Published on

Meta challenges transformer architecture with Megalodon LLM


"A new machine learning (ML) model proposed by researchers at Meta and the University of Southern California aims to solve some of the fundamental challenges of the Transformer, the deep learning architecture that gave rise to the age of large language models (LLMs).

The new model, called Megalodon, allows language models to extend their context window to millions of tokens without requiring huge amounts of memory. Experiments show that Megalodon outperforms Transformer models of equal size in processing large texts. Megalodon is the latest of a series of new models that are being proposed as the successor to the Transformer.

Long context windows

“Context window” is the number of tokens a model can work on at any time. Larger context windows allow LLMs to have longer conversations, process longer documents, and extend their in-context learning abilities. However, extending the context window of Transformers comes at a steep cost.

The Transformer has a “quadratic complexity,” which means every time you double the size of the input, the memory and computation time required to process the input quadruples. This quadratic relationship is due to the self-attention mechanism in transformers, which compares each element in the input sequence with every other element.

Meta’s Megalodon builds on Moving Average Equipped Gated Attention (MEGA), a technique that was first presented in 2022. MEGA makes modifications to the attention mechanism in a way that significantly reduces the complexity of the model, enabling the LLM to process longer inputs without exploding the memory and compute requirements. MEGA also uses exponential moving average (EMA), a tried-and-tested technique that helps models place the right amount of emphasis on local and long-distance relationships between tokens. This can help the models maintain their coherence as more information is fed into the context window.


Megalodon further improves MEGA with a few key modifications to the architecture that bring its performance on par with the full-attention mechanism used in the original Transformer model. Megalodon also uses “chunk-wise attention,” which divides the input sequence into fixed-size blocks to reduce the complexity of the model from quadratic to linear. Chunk-wise attention also makes it possible to add an extra layer of parallelism that speeds up model training.

The researchers trained a 7-billion-parameter version of Megalodon on 2 trillion tokens and compared it with Llama-2-7B, 13B and other models. Their experiments show that Megalodon-7B “significantly outperforms the state-of-the-art variant of Transformer used to train LLAMA2-7B on both training perplexity and across downstream benchmarks.” On some tasks, Megalodon-7B matches the performance of Llama-2-13B.

With a 4,000-token context window, Megalodon is slightly slower than Llama-2, but when the context length is expanded to 32,000 tokens, Megalodon outperforms Llama-2 significantly due to its computational efficiency. Furthermore, the researchers claim that experimental results on long-context modeling suggest Megalodon can model sequences of unlimited length.

The researchers have also obtained promising results on small- and medium-scale experiments on other data modalities and will later work on adapting Megalodon to multi-modal settings. The researchers have released the code for Megalodon on GitHub with an MIT license, which means it can be adapted and used for commercial purposes without restriction.

Transformers still dominate

Scientists have been looking for alternative architectures that can replace transformers. Some notable examples include the Mamba architecture, which now has a commercial deployment with AI21 Labs Jamba. Another potentially promising architecture is liquid neural networks, a general deep learning architecture for processing any kind of sequential data, developed by researchers at MIT.

However, for the time being, Transformers continue to remain the dominant architecture for language models. While Meta is exploring architectures such as Megalodon, it continues to work on improving its Transformer models, and it just released Llama-3, the latest version of its open-source LLMs.

Another challenge for Transformer rivals is the required hardware and software tools. There is a large ecosystem of libraries and tools for training, fine-tuning, and customizing Transformer models for different applications and hardware devices. At the same time, researchers have developed low-level software code that optimizes the performance of Transformer LLMs on memory-constrained devices. The alternatives have yet to catch up with these developments.

Meanwhile, other researchers are working on modifying the Transformer architecture to reduce its memory and compute requirements. For example, Infini-attention, a recent paper by researchers at Google, aims to give Transformer models unlimited context windows without increasing the memory and compute complexity. Current frontier models support inputs of hundreds of thousands of tokens.

However, AI research is progressing rapidly. When the Transformer paper came out in 2017, few thought it would have such an impact. One of these models might turn out to beat the Transformer at its own game."

Meta challenges transformer architecture with Megalodon LLM


AiUTOMATING PEOPLE, ABN ASIA was founded by people with deep roots in academia, with work experience in the US, Holland, Hungary, Japan, South Korea, Singapore, and Vietnam. ABN Asia is where academy and technology meet opportunity. With our cutting-edge solutions and competent software development services, we're helping businesses level up and take on the global scene. Our commitment: Faster. Better. More reliable. In most cases: Cheaper as well.

Feel free to reach out to us whenever you require IT services, digital consulting, off-the-shelf software solutions, or if you'd like to send us requests for proposals (RFPs). You can contact us at We're ready to assist you with all your technology needs.