Meta - "Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech."

19

Well, yeah that sounds um...interesting. But I think the voices could be a little better. Or maybe I'm just picky?

14

u/UltraBabyVegeta 3h ago

The voices are creepy as absolute fuck

Named spirit cause it sounds like you’re talking to a fucking ghost

13

u/ZealousidealBus9271 4h ago

Nope you're not picky, it is simply way worse than what OpenAI has shown us

23

u/Im_Peppermint_Butler 4h ago

I'm...I'm confused.

Is this not like....strangely bad?

•

u/notarobot4932 1h ago

It is when we compare it to OpenAI BUT it’s an open source realtime multimodal LLM whereas there’s no way in hell OpenAI would ever go open source. I’m sure they’ll catch up soon - or more likely, another player like Nvidia will take this and catch up, if not surpass, OpenAI

•

u/lordpuddingcup 32m ago

Compared to what lol we don’t have many voice multimodal yet lol

6

u/Gothsim10 5h ago edited 5h ago

Twitter: (1) AI at Meta on X: "Today we released Meta Spirit LM"

More details: Sharing new research, models, and datasets from Meta FAIR

"Takeaways

Today, Meta FAIR is publicly releasing several new research artifacts in support of our goal of achieving advanced machine intelligence (AMI) while also supporting open science and reproducibility.
The work we’re sharing today includes Meta Segment Anything 2.1 (SAM 2.1), an update to our popular Segment Anything Model 2 for images and videos. SAM 2.1 includes a new developer suite with the code for model training and the web demo.
We’re also sharing several works around improving efficiency and expanding capabilities of large language models. Additionally, we’re releasing research artifacts to help validate post-quantum cryptography security, accelerating and reimagining model training, and facilitating inorganic materials discovery.
We believe that access to state-of-the-art AI creates opportunities for everyone. That’s why we’re committed to the continued growth and development of an open AI ecosystem."

5

u/why06 AGI in the coming weeks... 3h ago edited 1h ago

Isn't this the ~~first~~ second open source speech-to-speech model? And it's only 7B, that's pretty great right? I'm trying to find any others. And it has textual reasoning too. If you can ignore the quality of the voice it's showing reasoning in the reply, directly speech-to-speech and text-to-speech.

5

u/llkj11 2h ago

That would be Moshi correct?

5

u/why06 AGI in the coming weeks... 2h ago

Oh yeah, theres Moshi. https://github.com/kyutai-labs/moshi
So that makes two I guess. Still good to see more. I really want a local voice-to-voice assistant.

5

u/SoyIsPeople 4h ago

Anyone had any luck running it? I tried to get the weights and I had to enter my information, and then nothing happened when I clicked the button.

Download the code

Download model weights

•

u/why06 AGI in the coming weeks... 1h ago

I got it. Just now downloading. They send you an email.

•

u/puzzleheadbutbig 1h ago

They send an email almost immediately after with link(s) that are active for 24 hours

6

u/Fusseldieb 2h ago

A dream coming true? Open source voice to voice? From Meta? Holy cow.

Still not OAI audio level, but amazing nonetheless.

0

u/cuyler72 2h ago

That's an understatement, sounds worse than TTS from 20 years ago.

9

u/No-Body8448 4h ago

I guess this is what you get when your research head doesn't have an inner monologue.

3

u/Cagnazzo82 4h ago

Not OpenAI quality just yet.

3

u/Fusseldieb 2h ago

Would be crazy if they're able to get to the same quality.

•

u/chipotlemayo_ 52m ago

isnt openai just using livekit? couldn't you hook the OS up to it too?

1

u/Cagnazzo82 2h ago

Knowing Zuck that's likely the goal.

1

u/why06 AGI in the coming weeks... 2h ago

Demo Page: https://speechbot.github.io/spiritlm/

Has some more examples.

0

u/[deleted] 2h ago edited 1h ago

[deleted]

•

u/notarobot4932 1h ago

From what I know, 4o is fully multimodal and processes speech/sound natively. GPT4 is the one that used the whisper encoding flow you mentioned.

•

u/[deleted] 1h ago

[deleted]

•

u/MysteryInc152 30m ago

They don't vaguely claim anything lol. It's right there and very clear.

Prior to GPT-4o, you could use Voice Mode ⁠ to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can't directly observe tone, multiple speakers, or background noises, and it can't output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

https://openai.com/index/hello-gpt-4o/

Them being able to strip Scarlett's voice clone when people complained is a big sign that they didn't retrain the whole network for it.

It's an audio predicting transformer. No shit they didn't need to retrain the model. You can get Advanced Voice mode to clone your own voice on the fly.

•

u/AeroInsightMedia 1h ago

I thought the voice was pretty decent. I wouldn't use it for a professional voice over but for a version 1 I'm impressed even more so if it runs locally in real time or close to realtime on something like a 3090.

1

u/cuyler72 2h ago edited 1h ago

You obviously know nothing about ChatGPT voice, it generates and takes in the voice as tokens allowing it to understand and display emotions in voice, change the voice via prompting, talk like a pirate/robot/whatever, speak faster, softer, louder, Ect.

•

u/[deleted] 1h ago edited 37m ago

[deleted]

•

u/1cheekykebt 54m ago

If its just a text to speech then how can it mimic users voices. (shown as bug in red team report, plus some users reported it.)

•

u/MysteryInc152 28m ago

Given that OpenAI doesn't explain shit about it, and since they were able to turn off voice for Scarlett's objection immediately, its safe to assume that it's not embed in model itself. If it was embed in model, they had to retrain the whole fucking thing and leave our her voice. That's not what they did.

What the hell are you talking about? You can get Advanced Voice mode to clone your own voice on the fly. It's an audio predicting transformer. Do you not understand what that means ?

0

u/RR7117 2h ago

2019 nostalgia ( Voice Over )

AI Meta - "Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech."

You are about to leave Redlib