What makes Llama 3 good


"The release of Llama 3 was very sparse in terms of technical/research details. I tried to extract the hidden details from the release blog post to answer common questions. 👀

Why is Llama 3 better than Llama 2?

  • Scaled up pertaining 7x from 2T Tokens to 15T on sequences of 8,192 tokens.

Improved data quality with new filtering including heuristic filters, NSFW filters, semantic deduplication (👀) approaches, and text classifiers to predict data quality.

  • Used Llama 2 to generate synthetic training data to train text-quality classifiers.

  • Extensive experiments to find the best data mix from different sources

What changes were made to Llama 3?

  • Used attention-mask to ensure self-attention does not cross documents. This wasn’t done for Llama2. (or OpenAI GPT-3)

  • Increased input sequence length from 4096 to 8192

  • New Tokenizer with a 128k vocabulary, leading to a reduction of 15% in needed tokens compared to Llama 2 to generate the same text. It will also improve multilingualism for continued pertaining or future versions. (Thats is why 7B became 8B → bigger embedding layer)

  • All model sizes use grouped query attention (GQA)

How was Llama 3 Instruct trained? Others:

  • Used a combination of supervised fine-tuning (SFT), rejection sampling (RS), proximal policy optimization (PPO), and direct policy optimization (DPO)

Training on preference rankings enables the model to improve on getting the right answer in reasoning

  • Fine-tuning data includes public datasets as well as over 10M human-annotated examples. Unclear how the distribution is between the Reward Model and the Instruct Model

  • high-quality prompts and preference rankings (good Reward Model) are key

  • My Guess: 1️⃣ SFT → 2️⃣ Rejection Sampling → ( 3️⃣ DPO → 4️⃣ PPO) where 3️⃣ & 4️⃣ is repeated/iterated

  • My Guess: A good Reward Model was the key for Llama 3 Instruct to become that good.


  • 5% of the pretraining dataset is non-English/code data in 30 languages.

  • Even after 15T tokens, the model performance improved log-linearly 🤯

  • Human evaluation was performant with 1,800 promotes in 12 different topics

Let's hope Meta will release a paper and with the Reward Models in the near future.🤞🏻"

