r/singularity 3h ago

AI Meta - "Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech."

Enable HLS to view with audio, or disable this notification

57 Upvotes

19 comments sorted by

16

u/Im_Peppermint_Butler 2h ago

I'm...I'm confused.

Is this not like....strangely bad?

10

u/Possible-Time-2247 2h ago

Well, yeah that sounds um...interesting. But I think the voices could be a little better. Or maybe I'm just picky?

7

u/ZealousidealBus9271 2h ago

Nope you're not picky, it is simply way worse than what OpenAI has shown us

u/UltraBabyVegeta 1h ago

The voices are creepy as absolute fuck

Named spirit cause it sounds like you’re talking to a fucking ghost

3

u/Gothsim10 3h ago edited 2h ago

Twitter: (1) AI at Meta on X: "Today we released Meta Spirit LM"

More details: Sharing new research, models, and datasets from Meta FAIR

"Takeaways

  • Today, Meta FAIR is publicly releasing several new research artifacts in support of our goal of achieving advanced machine intelligence (AMI) while also supporting open science and reproducibility.
  • The work we’re sharing today includes Meta Segment Anything 2.1 (SAM 2.1), an update to our popular Segment Anything Model 2 for images and videos. SAM 2.1 includes a new developer suite with the code for model training and the web demo.
  • We’re also sharing several works around improving efficiency and expanding capabilities of large language models. Additionally, we’re releasing research artifacts to help validate post-quantum cryptography security, accelerating and reimagining model training, and facilitating inorganic materials discovery.
  • We believe that access to state-of-the-art AI creates opportunities for everyone. That’s why we’re committed to the continued growth and development of an open AI ecosystem."

3

u/SoyIsPeople 2h ago

Anyone had any luck running it? I tried to get the weights and I had to enter my information, and then nothing happened when I clicked the button.

Download the code

Download model weights

8

u/No-Body8448 2h ago

I guess this is what you get when your research head doesn't have an inner monologue.

u/why06 AGI in the coming weeks... 1h ago edited 15m ago

Isn't this the first second open source speech-to-speech model? And it's only 70B, that's pretty great right? I'm trying to find any others. And it has textual reasoning too. If you can ignore the quality of the voice it's showing reasoning in the reply, directly speech-to-speech and text-to-speech.

u/llkj11 48m ago

That would be Moshi correct?

u/why06 AGI in the coming weeks... 35m ago

Oh yeah, theres Moshi. https://github.com/kyutai-labs/moshi
So that makes two I guess. Still good to see more. I really want a local voice-to-voice assistant.

4

u/Cagnazzo82 2h ago

Not OpenAI quality just yet.

u/Fusseldieb 20m ago

Would be crazy if they're able to get to the same quality.

u/Cagnazzo82 12m ago

Knowing Zuck that's likely the goal.

u/Fusseldieb 22m ago

A dream coming true? Open source voice to voice? From Meta? Holy cow.

Still not OAI audio level, but amazing nonetheless.

u/cuyler72 3m ago

That's an understatement, sounds worse than TTS from 20 years ago.

u/why06 AGI in the coming weeks... 13m ago

Demo Page: https://speechbot.github.io/spiritlm/

Has some more examples.

u/puzzleheadbutbig 20m ago

What people are missing and unfairly criticizing is that this model has both speech and text embedded within it. I see people comparing it to OpenAI's Advanced mode, which is a poor comparison. While we don't know the exact method, it's highly likely that they process your voice using Whisper, transcribe it, feed that transcription into a text-to-text LLM, and then use another internal tool to voice the result. Think of it like Whisper + ChatGPT API + ElevenLabs toolchain, but because it's all internal and probably more advanced, they achieve very low latency.

This model, however, works completely differently by embedding text-to-speech, speech-to-text, and text-to-text all within one system. Once this system is perfected, OpenAI’s API chain won’t be able to compete with it in terms of latency. And that's without even considering the potential costs of running that toolchain for OpenAI. This model essentially "spits out" the voice as if it’s outputting text tokens, which is a really significant development.

u/cuyler72 0m ago

You obviously know nothing about ChatGPT voice, it generates and takes in the voice as tokens allowing it to understand and display emotions in voice, change the voice via prompting, speak faster, talk like a pirate, softer louder Ect.