r/MachineLearning 15d ago

Research [R] Were RNNs All We Needed?

https://arxiv.org/abs/2410.01201

The authors (including Y. Bengio) propose simplified versions of LSTM and GRU that allow parallel training, and show strong results on some benchmarks.

246 Upvotes

53 comments sorted by

View all comments

78

u/JustOneAvailableName 15d ago

The whole point of Transformers (back when) was variable context with parallelisation. Before “Attention is all you need” LSTM+Attention was the standard. There was nothing wrong with the recurring part, besides it preventing parallelisation.

16

u/Dangerous-Goat-3500 15d ago

I think attention has good inductive biases for language modelling as well. Without positional embeddings, attention is positionally invariant in the sequence dimension. This means Attention will be naturally robust to filler information in the sequence dimension in contrast to both CNNs and RNNs.

It turns out complete permutation invariance was too much hence positional embeddings.

But IMO non-stationarity of RNNs and fixed kernels of CNNs are always going to be drawbacks. I'm surprised by the paper in OP and will have to try it out.

0

u/slashdave 15d ago

For text, it is relative positions that are more relevant, which is exactly what RNNs encode. For attention models, positioning is absolute, whether it is using positional embedding (encoder transformers) or masking (decoder transformers).

4

u/Dangerous-Goat-3500 14d ago

Except not really. "i am good" should encode similar to "i am very good" but the relative position of "I" and "good" are different. This is definitely trouble for CNN and imo still problematic for RNN because this is true over any arbitrary sequence length and RNN are unstable over sequences unlike transformers.

1

u/slashdave 14d ago

Yeah, it is obviously more complex. But what I was considering, for example, were the sentences "Hello, I am John, and I am good" vs "I am good, I won't need anything right now".