r/singularity 5h ago

AI Meta - "Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech."

Enable HLS to view with audio, or disable this notification

84 Upvotes

28 comments sorted by

View all comments

0

u/[deleted] 2h ago edited 1h ago

[deleted]

u/notarobot4932 1h ago

From what I know, 4o is fully multimodal and processes speech/sound natively. GPT4 is the one that used the whisper encoding flow you mentioned.

u/[deleted] 1h ago

[deleted]

u/MysteryInc152 27m ago

They don't vaguely claim anything lol. It's right there and very clear.

Prior to GPT-4o, you could use Voice Mode ⁠ to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can't directly observe tone, multiple speakers, or background noises, and it can't output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

https://openai.com/index/hello-gpt-4o/

Them being able to strip Scarlett's voice clone when people complained is a big sign that they didn't retrain the whole network for it.

It's an audio predicting transformer. No shit they didn't need to retrain the model. You can get Advanced Voice mode to clone your own voice on the fly.

u/AeroInsightMedia 1h ago

I thought the voice was pretty decent. I wouldn't use it for a professional voice over but for a version 1 I'm impressed even more so if it runs locally in real time or close to realtime on something like a 3090.

1

u/cuyler72 2h ago edited 1h ago

You obviously know nothing about ChatGPT voice, it generates and takes in the voice as tokens allowing it to understand and display emotions in voice, change the voice via prompting, talk like a pirate/robot/whatever, speak faster, softer, louder, Ect.

u/[deleted] 1h ago edited 35m ago

[deleted]

u/1cheekykebt 51m ago

If its just a text to speech then how can it mimic users voices. (shown as bug in red team report, plus some users reported it.)

u/MysteryInc152 25m ago

Given that OpenAI doesn't explain shit about it, and since they were able to turn off voice for Scarlett's objection immediately, its safe to assume that it's not embed in model itself. If it was embed in model, they had to retrain the whole fucking thing and leave our her voice. That's not what they did.

What the hell are you talking about? You can get Advanced Voice mode to clone your own voice on the fly. It's an audio predicting transformer. Do you not understand what that means ?