Published on

Did you know that LLama 2 or 3 are probably among the best choices if you need a large context window with an open-source model?

Authors

"Did you know that LLama 2 or 3 are probably among the best choices if you need a large context window with an open-source model? In fact, any model using the RoPE positional embedding is a good bet!

8192 tokens, that's about 6000 words. Not bad but it limits the possible applications. The typical Transformer architecture is composed of Embeddings to encode the text input, multiple transformer blocks, and a prediction head specific to the learning task the LLM is used for. To encode the text, we use a text embedding matrix T that has the size of the token vocabulary and a positional embedding P that encodes the position of the token in the input sequence. That position embedding size defines the context size. That embedding can be learned, or it can be a simple sin function of the position index. Typically they are added together T + P such that the same word is encoded differently at positions i and j.

The great thing about LLama is that it uses Rotary Positional Embeddings (RoPE) as opposed to the typical sin function encoding. Each Attention layer is modified using that embedding and it ensures the computed attention between input tokens to be only dependent on the distance between those tokens. If token T1 is at position i and a token T2 at position j, the attention A(T1, T2) = f(j - i) is a function of j - i. The attention is not dependent on the specific token's locations but on their relative positions.

The technique they use at Meta to extend the context window is to interpolate at non-integer positions. Basically, if the original window size is L, you can extend it to L' (with L' > L) by rescaling the integer positions

i' = i * L / L'

As an example, if you wanted to have a text input of 16,384 tokens (so 4x the window size of LLama 2) into LLama 2, you would just need to divide every integer position by 4: i' = i / 4. To be clear, if you look at the implementation of LLama 2 available on GitHub (line 101 in model.py today https://lnkd.in/exqcTkDD), you would just need to replace the following line of code

t = torch.arange(end, device=freqs.device) by t = torch.arange(end, device=freqs.device) / 4

How simple is that? Because the model was not trained for that position embedding, you would need to fine-tune the model a bit to adapt it to that new context window and position embedding. When we think that LLama 2 will most likely be used to be fine-tuned on private data, that is the icing on the cake to be able to dynamically adapt the context window to our needs as we fine-tune it.

You can look at the method here: https://lnkd.in/dCYuwdHz. They were able to extend LLama's context window by 16 times while keeping the performance at the same level!"

Did you know that LLama 2 or 3 are probably among the best choices if you need a large context window with an open-source model?

Author

ABN ASIA was founded by people with deep roots in academia, with work experience in the US, Holland, Hungary, Japan, South Korea, Singapore, and Vietnam. ABN Asia is where academy and technology meet opportunity. With our cutting-edge solutions and competent software development services, we're helping businesses level up and take on the global scene. Our commitment: Faster. Better. More reliable. In most cases: Cheaper as well.

Feel free to reach out to us whenever you require IT services, digital consulting, off-the-shelf software solutions, or if you'd like to send us requests for proposals (RFPs). You can contact us at contact@abnasia.org. We're ready to assist you with all your technology needs.

ABNAsia.org

© ABN ASIA