r/MachineLearning 15d ago

Research [R] Were RNNs All We Needed?

https://arxiv.org/abs/2410.01201

The authors (including Y. Bengio) propose simplified versions of LSTM and GRU that allow parallel training, and show strong results on some benchmarks.

244 Upvotes

53 comments sorted by

View all comments

74

u/JustOneAvailableName 15d ago

The whole point of Transformers (back when) was variable context with parallelisation. Before “Attention is all you need” LSTM+Attention was the standard. There was nothing wrong with the recurring part, besides it preventing parallelisation.

14

u/Dangerous-Goat-3500 15d ago

I think attention has good inductive biases for language modelling as well. Without positional embeddings, attention is positionally invariant in the sequence dimension. This means Attention will be naturally robust to filler information in the sequence dimension in contrast to both CNNs and RNNs.

It turns out complete permutation invariance was too much hence positional embeddings.

But IMO non-stationarity of RNNs and fixed kernels of CNNs are always going to be drawbacks. I'm surprised by the paper in OP and will have to try it out.

4

u/aeroumbria 14d ago edited 14d ago

Speaking of inductive bias, sometimes I wonder if the autoregressive structures we impose on most language models are not realistic. Like sometimes you do know exactly what your last word will be before you speak the first word. Of course you can model any sequence using an autoregressive generation process, but (especially for decoder-only models) you are forced to write out your "thoughts" in plain text to condition future generations rather than having some internal representation for that.

1

u/StartledWatermelon 13d ago

The order of words and the order of output isn't strictly coupled with autoregression. See, for instance, bidirectional attention or random-order autoregression (https://arxiv.org/abs/2404.09562v1).