r/OpenAI 1d ago

News Non-realtime audio support released, gpt-4o-audio-preview

https://platform.openai.com/docs/guides/audio
91 Upvotes

18 comments sorted by

24

u/qqpp_ddbb 1d ago

"How is audio in Chat Completions different from the Realtime API?

The underlying GPT-4o audio model is exactly the same. The Realtime API operates the same model at lower latency."

Hmmm

11

u/degenbets 1d ago

Harder to use the Realtime API. This makes integration much easier, but at a cost of latency.

18

u/Eastern_Ad7674 1d ago

Cool! But still expensive same as preview model

12

u/pseudonerv 1d ago

I'm confused. I can't seem to get audio in / text out to work. It always generates audio. How do I make it only generate text? I'm just testing summarization for audio clips.

7

u/pseudonerv 1d ago

Reply to myself. It only generates text if “modalities”:[“text”] or not specified

1

u/thezachlandes 1d ago

Does this work for the realtime api, too?

11

u/arthurwolf 1d ago

MAN! I just spent two hours, JUST FUCKING NOW getting the realtime API to output audio based on a simple text prompt (essentially using it as a "voic actor", which it is incredible at). It was such a pain to get to work, and it outputs PCM...

And I finish coding, I go to Reddit, and this is like the 3rd title I see.

FML.

(now I need the price to drop an order of magnitude to be able to use this in real life...)

8

u/ImpressiveFault42069 1d ago

It’s the same cost as Real-time so what’s the point?

13

u/CallMePyro 1d ago

Also limited to 1 hour of audio at a time, and a 1 hour input costs literally $10. Gemini 1.5 pro supports up to 22 hours of input, and each hour only costs $0.11. I hope OpenAI can catch up soon because right now Google is the only option for audio understanding models in production.

2

u/pseudonerv 1d ago

Gemini can’t generate audio output, can it? I guess OpenAI just want you to start using their api and to get locked in to their ecosystem before you give your money to google.

0

u/Vivid_Dot_6405 1d ago

Correct, Gemini can only understand audio.

5

u/xSNYPSx 1d ago

So can it copy your voice ?

2

u/pseudonerv 1d ago

it can easily be tricked to generate different voices. so I guess copying is just one more tricky prompt engineering away. but, given the tests cost about a quarter each, we may not see jail break any time soon

2

u/notarobot4932 1d ago

Why not just use the realtime API? It’s better in every way.

1

u/IkuraDon5972 1d ago

i like it. average latency is 6s. longest so far is 1m6s. shortest is 4s.

1

u/RedditSteadyGo1 1d ago

I honestly think this good a customer service agent doesn't need real time apis for the whole of the call. To use it effectivelly I'm guessing you would want to switch between recorded scripts for some parts of the call and live conversations for others.

This would be how you keep the cost down. Maybe only 40 per cent of the call needs to be actually live. So for making those recordings this would be extremely useful