Mamba Models a possible replacement for Transformers?

Mamba Models a possible replacement for Transformers?

Table of contents

No heading

No headings in the article.

All AI chatbots right now like Bard or ChatGPT are based on a architecture called Transformers. It is really useful thanks to a special mechanism called self attention that helps the models to look back into a set of sequential inputs and perform some pretty insane text completion based on the input you feed it.
But for these models even counting and basic arithmetic are a big problem.

AI Chatbot developed by Krutrim AI , India

Also it hallucinated on larger contexts when you can provide really long contexts like PDF documents (research papers). The exact details are missed and these chatbots give you generalized and oversimplified answers.

To address this problem, researchers made MAMBA (Selective Structured State Space Sequence Models) model. It is basically LSTM + Transformers.

The current SOTA AI models like ChatGPT get exponentially more expensive to train and run the bigger they are. Like Zuck casually mentioned to buy 600,000 H100s GPUs to train LLMs by the end of 2024. And that goes without saying that these models are not suitable for scaling up even more.

This is majorly because of it's own attention mechanism where they need to note the positions of all texts in the whole context which makes longer texts harder to work with.

So some big brain researchers from CMU and Princeton dug up and old architecture called state space models and refine it to create something called the S4 model, short for structured state space sequence models and implemented it into something called Mamba.

First, it not only solves the scaling problem that the transformers have, where the computation doesn't scale exponentially and only linearly, but second, it is not using the attention mechanism and can still recall any details you provided within full precision. For a more technical perspective, it is the first alternative model architecture to achieve benchmarks that surpasses the strongest transformer recipe.

performance of mamba vs transformers

For example if you are going to a party - everyone needs to meet everyone else, all relations between each other. That's how transformers are working.
But think of a situation when you are attending a party and everyone just knows the party host, so you just have to meet the host, as he/she knows everyone and their relations among each other - that's how Mamba works.

Mamba save you so much more time when you increase the amount of people attending the party. Fundamentally the S4 model is a completely different architecture from the transformers.

Mamba is more similar to LSTM and recurrent models. LSTM needs the output from the previous hidden state and the global input to generate the next prediction. Hence every layers needs to wait for it's previous layer to finish it's prediction to proceede, which makes it extremely slow.

In the S4 models Mamba uses, each hidden state is only dependent on the global input, so there's no wasting time waiting for the result from the last layer and whatnot. On top of having non linearity between the hidden states, it makes the calculations insanely fast. We can just finish all the matrix multiplications at once.

This improved the quadratic scaling that the transformers has from

$$from O(n)^2 to O(n)$$

mamba outperforms in all categories

Let's talk about similar models like Vision Mamba in another blog, stay tuned.

Refer to this GitHub Repository for latest updates in Mamba Papers