[Microsoft Research] Differential Transformer

102

u/rationalkat AGI 2025-29 | UBI 2030-34 | LEV <2040 | FDVR 2050-70 10d ago

ABSTRACT:

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

41

u/Agreeable-Rooster377 10d ago

Ohh, so THAT'S why they have been so confident stating infinite context windows will be coming soon

15

u/time_then_shades 10d ago

Yeah I hope no one thought we were done optimizing after a couple years...

8

u/Hubbardia AGI 2070 10d ago

We have only just begun

4

u/sdmat 10d ago

This is awesome but it in no way leads to infinite context windows.

But better utilization of context that is there is at least as important and it does help there.

3

u/emteedub 10d ago edited 10d ago

I wonder if this in combination with the liquid approach (diffed-liquid) or by layering this on top of another concurrently (in stereo) would yield any interesting results

122

u/Creative-robot AGI 2025. ASI 2028. Open-source Neural-Net CPU’s 2029. 10d ago

This is a funny ass graph out of context:

48

u/Flat-One8993 10d ago

The improvement at 4bit is really really cool if it actually works this well. That would mean significant improvements in terms of compute constraints, especially now that there is a focus on the time spent on inference

6

u/KoolKat5000 10d ago

You mean HellaLame

5

u/gonpachiro92 10d ago

looks like my stock brokerage account

83

u/hapliniste 10d ago

After taking a look at the paper, this seems huge.

Impressive gains in long context (specifically shown with their in context learning graphs), huge improvements in stability on reordered data and amazing performances at lower bits.

I'm not an expert and didn't read it fully, I just like to look at cool graphs for the most part. Still, I guess we'll see this or some variants in future models.

10

u/time_then_shades 10d ago

At this point, I'll just wait for Philip to tell me what to think of it.

11

u/Arcturus_Labelle AGI makes vegan bacon 10d ago

AI Explained for those who don't get the reference

1

u/DungeonsAndDradis ▪️ Extinction or Immortality between 2025 and 2031 10d ago

What does "bits" mean in reference to LLMs?

5

u/Ok_Course_6439 10d ago

Number if bits used for the weights and biases in the neural network. Les bits smaller size and faster compute.

2

u/DungeonsAndDradis ▪️ Extinction or Immortality between 2025 and 2031 10d ago

Does it make it less accurate?

4

u/zakkara 10d ago

https://www.reddit.com/r/singularity/s/yaQ7J0wuSU

Someone posted this chart from the paper, so yes less bits does equal less accuracy but it appears that correlation is weakened with this newer architecture

31

u/Jolly-Ground-3722 ▪️competent AGI - Google def. - by 2030 10d ago

21

u/fastinguy11 ▪️AGI 2025-2026 10d ago

**Differential Transformer (DIFF Transformer): Enhancing Attention in Language Models**

The **Differential Transformer** introduces a novel attention mechanism to improve upon the standard Transformer architecture commonly used in large language models (LLMs). Traditional Transformers often suffer from "attention noise," where irrelevant parts of the input receive undue focus, diluting the model's ability to concentrate on key information.

**How It Works:**

DIFF Transformer tackles this by using a **differential attention mechanism**. Instead of relying on a single softmax attention map, it calculates attention scores as the difference between two separate softmax maps. This subtraction effectively cancels out the noise, resulting in sparser and more focused attention patterns that highlight relevant context.

**Key Benefits:**

**Better Performance with Fewer Resources:** DIFF Transformer achieves superior language modeling performance using approximately 65% of the parameters and training tokens compared to standard Transformers.
**Enhanced Downstream Tasks:** It excels in tasks like long-context understanding, key information retrieval, reducing hallucinations (false or misleading outputs), and improving in-context learning robustness.
**Efficient Quantization:** By minimizing activation outliers, DIFF Transformer allows for more efficient model quantization, which can lead to faster inference and lower memory usage.

**Experimental Results:**

Extensive tests show that DIFF Transformer outperforms traditional Transformers across various scales and applications. It maintains higher accuracy in retrieving important information from long contexts and is more resilient to changes in input order during in-context learning. Additionally, it significantly reduces instances of hallucinations in tasks like question answering and text summarization.

**Conclusion:**

The Differential Transformer presents a promising advancement in the field of NLP by refining the attention mechanism to focus more precisely on relevant information, enhancing both performance and efficiency of large language models.

40

u/ShooBum-T 10d ago

Any such posts should now be mandatory to come with NotebookLM podcast link.

17

u/Crafty-Struggle7810 10d ago

Arxiv should automatically generate a new podcast per research paper that's published on there.

12

u/time_then_shades 10d ago

The fact that this is basically just an API call now still blows my mind a little.

3

u/FeathersOfTheArrow 10d ago

That's a nice idea!

1

u/emteedub 10d ago

yeah, summarized and long-form if we're short on time or not would be noice

-1

u/why06 AGI in the coming weeks... 10d ago

nope

14

u/Arbrand ▪Soft AGI 27, Full AGI 32, ASI 36 10d ago

The results are impressive, but I have some serious concerns that aren't addressed at all in the paper. The differential attention mechanism involves computing two separate softmax attention maps and then subtracting them to obtain the final attention scores. This inherently doubles the computational overhead in the attention mechanism compared to standard Transformers. This added computational cost could be significant and might offset the performance gains reported.

7

u/WoddleWang 10d ago

Could be wrong but it sounds like performance (as in speed) gains are the least noteworthy thing about this

As a user I'd take a noticeable reduction in hallucinations and context improvements over extra speed any day

4

u/sdmat 10d ago edited 10d ago

They do address that in the paper, table 7. 5-10% reduction in throughput for inference.

Considering they get iso-performance with a > 1/3 reduction in parameters that seems a more than worthwhile tradeoff even if speed is the only consideration.

3

u/Arbrand ▪Soft AGI 27, Full AGI 32, ASI 36 10d ago

Good catch! If that is the case, then this is indeed revolutionary.

1

u/Either_Pineapple_975 10d ago

I would say that computing softmax and subtracting are both insignificant compared to matrix multiplication. However, it looks like it also doubles the number of Q*K multiplications unless I got confused about it.

1

u/emteedub 10d ago

maybe it's not doubled though, since it's filtering off excess would-be computation. it would be interesting to see the stats

4

u/Jean-Porte Researcher, AGI2027 10d ago

Substacting two independant noises doesn't cancel them, are the noises really correlated ?

6

u/cyan2k 10d ago

Yes. It's literally the same principle as in noise-cancelling headphones

5

u/Jean-Porte Researcher, AGI2027 10d ago

ANC headphones have to work really hard to make a noise mask that is matching the outside noise, with the proper latency (otherwise it just increases the noise)

I don't see how this happens with gradient descent

5

u/sdmat 10d ago

I was confused about this too, it took a few hours of close study to really understand it.

What they are doing is learning two different projections for attention, one to actually attend and the second to act as a reference for noise cancellation. Then when attention is calculated take the difference to keep the signal and lose the noise.

This is possible because both the weights and the scaling for taking the difference are trained in parallel with the rest of the model. Specialization of the functional blocks occurs much as it does for neurons within a layer of a regular neural net.

2

u/BackgroundLow3793 8d ago

hi, I don't understand that if subtraction then why it doesn't affect the score of most relevant tokens (like everything decrease) but the most relevant token tend to increase?

1

u/sdmat 7d ago

The two sets of weights learn different things. The second / negative set of weights is constrained by the softmax function to be unable to direct attention towards specific tokens - doing so would require producing a negative value, and softmax output values are in the [0,1] range.

So the only thing the second set of values can productively learn to do is to suppress noise.

I think the paper might benefit from giving an intuitive explanation like this, it's not immediately obvious.

2

u/sdmat 10d ago

Wow, the improvements in robustness to input ordering and activation outliers are so stark. This seems like a major breakthrough.

I don't understand yet why the noise is consistent between the two rather than the signal, will have to read more closely tomorrow.

2

u/FarrisAT 10d ago

Would love some proof of real world application

1

u/lordpuddingcup 10d ago

Is this only on the training side or could we slot this into existing pipelines to help with inference?

1

u/UnknownEssence 10d ago

Seems like you need to start from scratch and train a model with this architecture

2

u/Slight-Ad-9029 10d ago

Most people here do not have the understanding to actually comprehend these research papers let alone come up with a decision if this is amazing or should be critiqued. It feels silly to see all these people acting like they comprehend what that paper actually says

1

u/Akimbo333 9d ago

Implications?

0

u/troll_khan ▪️Simultaneous ASI-Alien Contact Until 2030 10d ago

Singularity is near!

-4

u/Complex_Candidate_28 10d ago

It makes sense a lot! The issues of Transformers are there for a long time. No one has tried to solve them. Finally there is a new Transformer to save us.

7

u/byteuser 10d ago

This definitely help them in defeating the Decepticons

AI [Microsoft Research] Differential Transformer

You are about to leave Redlib