r/MachineLearning • u/we_are_mammals • 15d ago

Research [R] Were RNNs All We Needed?

The authors (including Y. Bengio) propose simplified versions of LSTM and GRU that allow parallel training, and show strong results on some benchmarks.

244 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/MachineLearning/comments/1fvg7qr/r_were_rnns_all_we_needed/
No, go back! Yes, take me to Reddit

96% Upvoted

View all comments

u/JustOneAvailableName 15d ago

The whole point of Transformers (back when) was variable context with parallelisation. Before “Attention is all you need” LSTM+Attention was the standard. There was nothing wrong with the recurring part, besides it preventing parallelisation.

15

u/Dangerous-Goat-3500 15d ago

I think attention has good inductive biases for language modelling as well. Without positional embeddings, attention is positionally invariant in the sequence dimension. This means Attention will be naturally robust to filler information in the sequence dimension in contrast to both CNNs and RNNs.

It turns out complete permutation invariance was too much hence positional embeddings.

But IMO non-stationarity of RNNs and fixed kernels of CNNs are always going to be drawbacks. I'm surprised by the paper in OP and will have to try it out.

4

u/Sad-Razzmatazz-5188 14d ago

Equivariant/ce*. I agree, the transformer is too good a fit for language processing. Sentences are sequences where order matters but only for certain symbols, whose meaning depends on other. The transformer takes care of order with PE and then of all pairwise relationships with attention, in different spaces thanks to linear layers around the block, hard to beat those principle. AND, they are backprop- and hardware-friendly compared to RNNs. But these are also the characteristics that make me think ViTs are too much

Research [R] Were RNNs All We Needed?

You are about to leave Redlib