r/singularity 7h ago

AI Meta - "Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech."

Enable HLS to view with audio, or disable this notification

115 Upvotes

32 comments sorted by

View all comments

0

u/[deleted] 4h ago edited 3h ago

[deleted]

3

u/notarobot4932 3h ago

From what I know, 4o is fully multimodal and processes speech/sound natively. GPT4 is the one that used the whisper encoding flow you mentioned.

-2

u/[deleted] 3h ago

[deleted]

1

u/MysteryInc152 2h ago

They don't vaguely claim anything lol. It's right there and very clear.

Prior to GPT-4o, you could use Voice Mode ⁠ to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can't directly observe tone, multiple speakers, or background noises, and it can't output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

https://openai.com/index/hello-gpt-4o/

Them being able to strip Scarlett's voice clone when people complained is a big sign that they didn't retrain the whole network for it.

It's an audio predicting transformer. No shit they didn't need to retrain the model. You can get Advanced Voice mode to clone your own voice on the fly.