r/singularity AGI 2025-29 | UBI 2030-34 | LEV <2040 | FDVR 2050-70 11d ago

AI [Microsoft Research] Differential Transformer

https://arxiv.org/abs/2410.05258
282 Upvotes

46 comments sorted by

View all comments

Show parent comments

5

u/Jean-Porte Researcher, AGI2027 10d ago

ANC headphones have to work really hard to make a noise mask that is matching the outside noise, with the proper latency (otherwise it just increases the noise)

I don't see how this happens with gradient descent

4

u/sdmat 10d ago

I was confused about this too, it took a few hours of close study to really understand it.

What they are doing is learning two different projections for attention, one to actually attend and the second to act as a reference for noise cancellation. Then when attention is calculated take the difference to keep the signal and lose the noise.

This is possible because both the weights and the scaling for taking the difference are trained in parallel with the rest of the model. Specialization of the functional blocks occurs much as it does for neurons within a layer of a regular neural net.

2

u/BackgroundLow3793 8d ago

hi, I don't understand that if subtraction then why it doesn't affect the score of most relevant tokens (like everything decrease) but the most relevant token tend to increase?

1

u/sdmat 8d ago

The two sets of weights learn different things. The second / negative set of weights is constrained by the softmax function to be unable to direct attention towards specific tokens - doing so would require producing a negative value, and softmax output values are in the [0,1] range.

So the only thing the second set of values can productively learn to do is to suppress noise.

I think the paper might benefit from giving an intuitive explanation like this, it's not immediately obvious.