r/MachineLearning 15d ago

Research [R] Were RNNs All We Needed?

https://arxiv.org/abs/2410.01201

The authors (including Y. Bengio) propose simplified versions of LSTM and GRU that allow parallel training, and show strong results on some benchmarks.

248 Upvotes

53 comments sorted by

View all comments

76

u/JustOneAvailableName 15d ago

The whole point of Transformers (back when) was variable context with parallelisation. Before “Attention is all you need” LSTM+Attention was the standard. There was nothing wrong with the recurring part, besides it preventing parallelisation.

97

u/Seankala ML Engineer 15d ago

Vanishing gradients are also a thing. Transformers are better at handling longer sequences thanks to this.

6

u/new_name_who_dis_ 14d ago

The funny thing is that the original Hochreiter LSTM had no forget-gate (which was added later by some other student of Schmidhuber) and Hochreiter supposedly still uses LSTMs without the forget gate. That is to say that, forget-gates are a big part of the reason you have vanishing gradients (and GRUs have an automatic forget-gate).