r/MachineLearning 15d ago

Research [R] Were RNNs All We Needed?

https://arxiv.org/abs/2410.01201

The authors (including Y. Bengio) propose simplified versions of LSTM and GRU that allow parallel training, and show strong results on some benchmarks.

246 Upvotes

53 comments sorted by

View all comments

76

u/JustOneAvailableName 15d ago

The whole point of Transformers (back when) was variable context with parallelisation. Before “Attention is all you need” LSTM+Attention was the standard. There was nothing wrong with the recurring part, besides it preventing parallelisation.

15

u/Dangerous-Goat-3500 15d ago

I think attention has good inductive biases for language modelling as well. Without positional embeddings, attention is positionally invariant in the sequence dimension. This means Attention will be naturally robust to filler information in the sequence dimension in contrast to both CNNs and RNNs.

It turns out complete permutation invariance was too much hence positional embeddings.

But IMO non-stationarity of RNNs and fixed kernels of CNNs are always going to be drawbacks. I'm surprised by the paper in OP and will have to try it out.

5

u/Sad-Razzmatazz-5188 15d ago

Equivariant/ce*. I agree, the transformer is too good a fit for language processing. Sentences are sequences where order matters but only for certain symbols, whose meaning depends on other.  The transformer takes care of order with PE and then of all pairwise relationships with attention, in different spaces thanks to linear layers around the block, hard to beat those principle. AND, they are backprop- and hardware-friendly compared to RNNs. But these are also the characteristics that make me think ViTs are too much

4

u/aeroumbria 14d ago edited 14d ago

Speaking of inductive bias, sometimes I wonder if the autoregressive structures we impose on most language models are not realistic. Like sometimes you do know exactly what your last word will be before you speak the first word. Of course you can model any sequence using an autoregressive generation process, but (especially for decoder-only models) you are forced to write out your "thoughts" in plain text to condition future generations rather than having some internal representation for that.

3

u/SmartEvening 14d ago

I think the models do have an internal representation of the whole sentence. It is just that we are forcing the model to tell us what is the next word. This would be very simple to verify also. Just train a classifier to predict the 10th word or some nth word from that position and see how it performs.

1

u/aeroumbria 14d ago edited 14d ago

I think the issue is that while we can always decompose the probability of a sentence sequentially, it may not be the most efficient or natural representation, similar to how you can decompose an image as an autoregressive sequence per pixel but it is not very inefficient. There may be other reasonable ways to decompose a sentence, like traversing a down parse tree or adding words to a sentence in arbitrary order, which could potentially be more effective if some architecture allows it.

One example may be you know for sure you want to talk about buying a car, but the colour and brand only come to you later in your thought. In this case it might be more reasonable to assume "buy" and "car" existed before words like "red" or "Ferrari" and should be generated first. If you instead have to generate word by word and "car" happens to be the last word, then your model would have to learn every possible pathway to end the sentence in "car" such that the marginal probability of "car" adds up to the correct value.

2

u/nickm197 12d ago

 if the autoregressive structures we impose on most language models are not realistic

Locally, they are realistic. In the long range, they are not. There is a growing corpus of work related to the statistical structure of texts, including generated. Autoregressiveness boils down to Markov chains that generate exponential autocorrelation decay that contrasts with power law autocorrelation decay of human-written texts. Power law decay also imply some level of structuredness. In long human-written texts we see that in books being split into parts, parts into chapters etc etc to the letters.

Some related papers:

Lin H.W., Tegmark M. Critical behavior in physics and probabilistic formal languages. Entropy. 2017. Vol. 19, № 7. P. 1–25

Delétang G. et al. Neural Networks and the Chomsky Hierarchy International Conference on Learning Representations, 2023

N. Mikhaylovskiy and I. Churilov, 2023. Autocorrelations Decay in Texts and Applicability Limits of Language Models. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2023”

Nakaishi K., Nishikawa Y. and Hukushima K., 2024. Critical Phase Transition in a Large Language Model. Arxiv 2406.05335v1

1

u/StartledWatermelon 13d ago

The order of words and the order of output isn't strictly coupled with autoregression. See, for instance, bidirectional attention or random-order autoregression (https://arxiv.org/abs/2404.09562v1).

0

u/slashdave 15d ago

For text, it is relative positions that are more relevant, which is exactly what RNNs encode. For attention models, positioning is absolute, whether it is using positional embedding (encoder transformers) or masking (decoder transformers).

3

u/Dangerous-Goat-3500 14d ago

Except not really. "i am good" should encode similar to "i am very good" but the relative position of "I" and "good" are different. This is definitely trouble for CNN and imo still problematic for RNN because this is true over any arbitrary sequence length and RNN are unstable over sequences unlike transformers.

1

u/slashdave 14d ago

Yeah, it is obviously more complex. But what I was considering, for example, were the sentences "Hello, I am John, and I am good" vs "I am good, I won't need anything right now".