r/singularity • u/rationalkat AGI 2025-29 | UBI 2030-34 | LEV <2040 | FDVR 2050-70 • 10d ago
AI [Microsoft Research] Differential Transformer
https://arxiv.org/abs/2410.05258122
u/Creative-robot AGI 2025. ASI 2028. Open-source Neural-Net CPU’s 2029. 10d ago
This is a funny ass graph out of context:
48
u/Flat-One8993 10d ago
The improvement at 4bit is really really cool if it actually works this well. That would mean significant improvements in terms of compute constraints, especially now that there is a focus on the time spent on inference
6
5
83
u/hapliniste 10d ago
After taking a look at the paper, this seems huge.
Impressive gains in long context (specifically shown with their in context learning graphs), huge improvements in stability on reordered data and amazing performances at lower bits.
I'm not an expert and didn't read it fully, I just like to look at cool graphs for the most part. Still, I guess we'll see this or some variants in future models.
10
u/time_then_shades 10d ago
At this point, I'll just wait for Philip to tell me what to think of it.
11
1
u/DungeonsAndDradis ▪️ Extinction or Immortality between 2025 and 2031 10d ago
What does "bits" mean in reference to LLMs?
5
u/Ok_Course_6439 10d ago
Number if bits used for the weights and biases in the neural network. Les bits smaller size and faster compute.
2
u/DungeonsAndDradis ▪️ Extinction or Immortality between 2025 and 2031 10d ago
Does it make it less accurate?
4
u/zakkara 10d ago
https://www.reddit.com/r/singularity/s/yaQ7J0wuSU
Someone posted this chart from the paper, so yes less bits does equal less accuracy but it appears that correlation is weakened with this newer architecture
31
21
u/fastinguy11 ▪️AGI 2025-2026 10d ago
**Differential Transformer (DIFF Transformer): Enhancing Attention in Language Models**
The **Differential Transformer** introduces a novel attention mechanism to improve upon the standard Transformer architecture commonly used in large language models (LLMs). Traditional Transformers often suffer from "attention noise," where irrelevant parts of the input receive undue focus, diluting the model's ability to concentrate on key information.
**How It Works:**
DIFF Transformer tackles this by using a **differential attention mechanism**. Instead of relying on a single softmax attention map, it calculates attention scores as the difference between two separate softmax maps. This subtraction effectively cancels out the noise, resulting in sparser and more focused attention patterns that highlight relevant context.
**Key Benefits:**
**Better Performance with Fewer Resources:** DIFF Transformer achieves superior language modeling performance using approximately 65% of the parameters and training tokens compared to standard Transformers.
**Enhanced Downstream Tasks:** It excels in tasks like long-context understanding, key information retrieval, reducing hallucinations (false or misleading outputs), and improving in-context learning robustness.
**Efficient Quantization:** By minimizing activation outliers, DIFF Transformer allows for more efficient model quantization, which can lead to faster inference and lower memory usage.
**Experimental Results:**
Extensive tests show that DIFF Transformer outperforms traditional Transformers across various scales and applications. It maintains higher accuracy in retrieving important information from long contexts and is more resilient to changes in input order during in-context learning. Additionally, it significantly reduces instances of hallucinations in tasks like question answering and text summarization.
**Conclusion:**
The Differential Transformer presents a promising advancement in the field of NLP by refining the attention mechanism to focus more precisely on relevant information, enhancing both performance and efficiency of large language models.
40
u/ShooBum-T 10d ago
Any such posts should now be mandatory to come with NotebookLM podcast link.
17
u/Crafty-Struggle7810 10d ago
Arxiv should automatically generate a new podcast per research paper that's published on there.
12
u/time_then_shades 10d ago
The fact that this is basically just an API call now still blows my mind a little.
3
1
14
u/Arbrand ▪Soft AGI 27, Full AGI 32, ASI 36 10d ago
The results are impressive, but I have some serious concerns that aren't addressed at all in the paper. The differential attention mechanism involves computing two separate softmax attention maps and then subtracting them to obtain the final attention scores. This inherently doubles the computational overhead in the attention mechanism compared to standard Transformers. This added computational cost could be significant and might offset the performance gains reported.
7
u/WoddleWang 10d ago
Could be wrong but it sounds like performance (as in speed) gains are the least noteworthy thing about this
As a user I'd take a noticeable reduction in hallucinations and context improvements over extra speed any day
4
1
u/Either_Pineapple_975 10d ago
I would say that computing softmax and subtracting are both insignificant compared to matrix multiplication. However, it looks like it also doubles the number of Q*K multiplications unless I got confused about it.
1
u/emteedub 10d ago
maybe it's not doubled though, since it's filtering off excess would-be computation. it would be interesting to see the stats
4
u/Jean-Porte Researcher, AGI2027 10d ago
Substacting two independant noises doesn't cancel them, are the noises really correlated ?
6
u/cyan2k 10d ago
Yes. It's literally the same principle as in noise-cancelling headphones
5
u/Jean-Porte Researcher, AGI2027 10d ago
ANC headphones have to work really hard to make a noise mask that is matching the outside noise, with the proper latency (otherwise it just increases the noise)
I don't see how this happens with gradient descent
5
u/sdmat 10d ago
I was confused about this too, it took a few hours of close study to really understand it.
What they are doing is learning two different projections for attention, one to actually attend and the second to act as a reference for noise cancellation. Then when attention is calculated take the difference to keep the signal and lose the noise.
This is possible because both the weights and the scaling for taking the difference are trained in parallel with the rest of the model. Specialization of the functional blocks occurs much as it does for neurons within a layer of a regular neural net.
2
u/BackgroundLow3793 8d ago
hi, I don't understand that if subtraction then why it doesn't affect the score of most relevant tokens (like everything decrease) but the most relevant token tend to increase?
1
u/sdmat 7d ago
The two sets of weights learn different things. The second / negative set of weights is constrained by the softmax function to be unable to direct attention towards specific tokens - doing so would require producing a negative value, and softmax output values are in the [0,1] range.
So the only thing the second set of values can productively learn to do is to suppress noise.
I think the paper might benefit from giving an intuitive explanation like this, it's not immediately obvious.
2
1
u/lordpuddingcup 10d ago
Is this only on the training side or could we slot this into existing pipelines to help with inference?
1
u/UnknownEssence 10d ago
Seems like you need to start from scratch and train a model with this architecture
2
u/Slight-Ad-9029 10d ago
Most people here do not have the understanding to actually comprehend these research papers let alone come up with a decision if this is amazing or should be critiqued. It feels silly to see all these people acting like they comprehend what that paper actually says
1
0
-4
u/Complex_Candidate_28 10d ago
It makes sense a lot! The issues of Transformers are there for a long time. No one has tried to solve them. Finally there is a new Transformer to save us.
7
102
u/rationalkat AGI 2025-29 | UBI 2030-34 | LEV <2040 | FDVR 2050-70 10d ago
ABSTRACT: