Posts in Nlp (12 found)
Theia 1 weeks ago

Why do LLMs freak out over the seahorse emoji?

This is an edited and expanded version of a Twitter post, originally in response to @arm1st1ce, that can be found here: https://x.com/voooooogel/status/1964465679647887838 Is there a seahorse emoji? Let's ask GPT-5 Instant: Wtf? Let's ask Claude Sonnet 4.5 instead: What's going on here? Maybe Gemini 2.5 Pro handles it better? OK, something is going on here. Let's find out why.

11 views
Simon Willison 1 months ago

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search

"Don't use chatbots as search engines" was great advice for several years... until it wasn't. I wrote about how good OpenAI's o3 was at using its Bing-backed search tool back in April . GPT-5 feels even better. I've started calling it my Research Goblin . I can assign a task to it, no matter how trivial or complex, and it will do an often unreasonable amount of work to search the internet and figure out an answer. This is excellent for satisfying curiosity, and occasionally useful for more important endeavors as well. I always run my searches by selecting the "GPT-5 Thinking" model from the model picker - in my experience this leads to far more comprehensive (albeit much slower) results.

0 views
Simon Willison 2 months ago

GPT-5: Key characteristics, pricing and model card

I've had preview access to the new GPT-5 model family for the past two weeks (see related video and my disclosures ) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for. I've collected a lot of notes over the past two weeks, so I've decided to break them up into a series of posts . This first one will cover key characteristics of the models, how they are priced and what we can learn from the GPT-5 system card . Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold): GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model. GPT-5 in the API is simpler: it's available as three models - regular , mini and nano - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.

0 views
Sean Goedecke 3 months ago

Practical notes on getting LLMs to generate new ideas

Large language models struggle to generate new ideas. To AI skeptics, this seems trivially true, since they believe LLMs can only regurgitate content from their training data 1 . To AI believers, this is a puzzle. If a human had the breadth of knowledge of a LLM, wouldn’t they be able to synthesize it and come up with ideas nobody else has had

0 views
Sean Goedecke 3 months ago

Mecha-Hitler, Grok, and why it's so hard to give LLMs the right personality

Recently, xAI’s Grok model made some very strange comments. In a now-deleted post, it suggested Adolf Hitler as the right person to deal with “anti-white hate”. It also pointed out that “radical leftists spewing anti-white hate … often have Ashkenazi Jewish surnames”. Grok repeatedly referred to itself as “MechaHitler”, and seems happy to go by “Grokler”

0 views

Uncovering Tarot Biases with Simple NLP

Read on the website: Tarot is nice. It's showing us some archetypes and allowing to create stories. But are these stories as diverse as we are? No, and here're some simple NLP approaches to learning why.

0 views
<antirez> 8 months ago

Reasoning models are just LLMs

It’s not new, but it’s accelerating. People that used to say that LLMs were a fundamentally flawed way to reach any useful reasoning and, in general, to develop any useful tool with some degree of generality, are starting to shuffle the deck, in the hope to look less wrong. They say: “the progresses we are seeing are due to the fact that models like OpenAI o1 or DeepSeek R1 are not just LLMs”. This is false, and it is important to show their mystification as soon as possible. First, DeepSeek R1 (don’t want to talk about o1 / o3, since it’s a private thing we don’t have access to, but it’s very likely the same) is a pure decoder only autoregressive model. It’s the same next token prediction that was so strongly criticized. There isn’t, in any place of the model, any explicit symbolic reasoning or representation. Moreover, R1 Zero has similar reasoning capabilities of R1 without requiring *any* supervised fine tuning, just generating chain of thoughts, and improving it with a reward function, using reinforcement learning, was enough to learn a stronger form of reasoning. Interestingly enough, part of these capabilities were easily distilled into smaller models via SFT, which brings me to the next point. The other fundamental observation is that the S1 paper shows that you need very few examples (as little as 1000) in order for the model to start being able to build complex reasoning steps and solve non trivial mathematical problems. S1, and R1 Zero, hint that in some way in the pre-training step the models already learned the representations needed in order to perform reasoning, just with the unsupervised next word prediction training target

0 views
Ahead of AI 11 months ago

Understanding Multimodal LLMs

It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to AI and several interesting research papers published.  Among others, Meta AI released their latest Llama 3.2 models, which include open-weight versions for the 1B and 3B large language models and two multimodal models. In this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches. (To see a table of contents menu, click on the stack of lines on the left-hand side.) An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)" , is now finally available on Amazon ! Build a Large Language Model (From Scratch) now available on Amazon Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers! For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months.  P.S. If you have read the book, I'd really appreciate it if you could leave a brief review ; it truly helps us authors! What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs. A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below. Example use of a multimodal LLM explaining a meme . Of course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown. There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. (By the way, I don’t believe official terms for these techniques exist yet, but let me know if you’ve come across any. For instance, briefer descriptions may be "decoder-only" and "cross-attention-based" approaches.) The two main approaches to developing multimodal LLM architectures. As shown in the figure above, the Unified Embedding-Decoder Architecture utilizes a single decoder model, much like an unmodified LLM architecture such as GPT-2 or Llama 3.2. In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation. The Cross-Modality Attention Architecture employs a cross-attention mechanism to integrate image and text embeddings directly within the attention layer. In the following sections, we will explore how these methods work on a conceptual level. Then, we will look at recent research papers on multimodal LLMs to see how they are applied in practice. Let’s begin with the unified embedding decoder architecture, illustrated again in the figure below. Illustration of the unified embedding decoder architecture, which is an unmodified decoder-style LLM (like GPT-2, Phi-3, Gemma, or Llama 3.2) that receives inputs consisting of image token and text token embeddings. In the unified embedding-decoder architecture, an image is converted into embedding vectors, similar to how input text is converted into embeddings in a standard text-only LLM. For a typical text-only LLM that processes text, the text input is usually tokenized (e.g., using Byte-Pair Encoding) and then passed through an embedding layer, as shown in the figure below. Illustration of the standard process for tokenizing text and converting it into token embedding vectors, which are subsequently passed to an LLM during training and inference. Analogous to the tokenization and embedding of text, image embeddings are generated using an image encoder module (instead of a tokenizer), as shown in the figure below. Illustration of the process for encoding an image into image patch embeddings. What happens inside the image encoder shown above? To process an image, we first divide it into smaller patches, much like breaking words into subwords during tokenization. These patches are then encoded by a pretrained vision transformer (ViT), as shown in the figure below. Illustration of a classic vision transformer (ViT) setup, similar to the model proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). Note that ViTs are often used for classification tasks, so I included the classification head in the figure above. However, in this case, we only need the image encoder part. The "linear projection" shown in the previous figure consists of a single linear layer (i.e., a fully connected layer). The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder. This linear projection is illustrated in the figure below. An image patch, flattened into a 256-dimensional vector, is up-projected to a 768-dimensional vector. Illustration of a linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space. For those who prefer seeing a code example, In PyTorch code, we could implement the linear projection for the image patches as follows: If you have read my Machine Learning Q and AI book by chance, you may know there are ways to replace linear layers with convolution operations that can be implemented to be mathematically equivalent. Here, this can be especially handy as we can combine the creation of patches and projection into two lines of code: Now that we briefly discussed the purpose of the image encoder (and the linear projection that is part of the encoder), let's return to the text tokenization analogy from earlier and look at text and image tokenization and embedding side by side, as depicted in the figure below. Image tokenization and embedding (left) and text tokenization and embedding (right) side by side. As you can see in the figure above, I included an additional projector module that follows the image encoder. This projector is usually just another linear projection layer that is similar to the one explained earlier. The purpose is to project the image encoder outputs into a dimension that matches the dimensions of the embedded text tokens, as illustrated in the figure below. (As we will see later, the projector is sometimes also called adapter, adaptor, or connector.) Another side-by-side comparison between image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions. Now that the image patch embeddings have the same embedding dimension as the text token embeddings, we can simply concatenate them as input to the LLM, as shown in the figure at the beginning of this section. Below is the same figure again for easier reference. After projecting the image patch tokens into the same dimension as the text token embeddings, we can simply concatenate them as input to a standard LLM. By the way, the image encoder we discussed in this section is usually a pretrained vision transformer. A popular choice is CLIP or OpenCLIP . However, there are also versions of Method A that operate directly on patches, such as Fuyu , which is shown in the figure below. Annotated figure of the Fuyu multimodal LLM that operates directly on the image patches without image encoder. (Annotated figure from https://www.adept.ai/blog/fuyu-8b .) As illustrated in the figure above, Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do. This greatly simplifies the architecture and training setup. Now that we have discussed the unified embedding decoder architecture approach to building multimodal LLMs and understand the basic concept behind image encoding, let's talk about an alternative way of implementing multimodal LLMs via cross-attention, as summarized in the figure below. An illustration of the Cross-Modality Attention Architecture approach to building multimodal LLMs. In the Cross-Modality Attention Architecture method depicted in the figure above, we still use the same image encoder setup we discussed previously. However, instead of encoding the patches as input to the LLM, we connect the input patches in the multi-head attention layer via a cross-attention mechanism. The idea is related and goes back to the original transformer architecture from the 2017 Attention Is All You Need paper, highlighted in the figure below. High-level illustration of the cross-attention mechanism used in the original transformer architecture. (Annotated figure from the "Attention Is All You Need" paper: https://arxiv.org/abs/1706.03762.) Note that the original "Attention Is All You Need" transformer depicted in the figure above was originally developed for language translation. So, it consists of a text en coder (left part of the figure) that takes the sentence to be translated and generates the translation via a text de coder (right part of the figure). In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies. How does cross-attention work? Let's have a look at a conceptual drawing of what happens inside the regular self-attention mechanism. Outline of the regular self-attention mechanism. (This flow depicts one of the heads in a regular multi-head attention module.) In the figure above, x is the input, and W q is a weight matrix used to generate the queries ( Q ). Similarly, K stands for keys, and V stands for values. A represents the attention scores matrix, and Z are the inputs (x) transformed into the output context vectors. (If this seems confusing, you may find a comprehensive introduction in Chapter 3 of my Build a Large Language Model from Scratch book helpful; alternatively, you may also find my article, Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs helpful here.) In cross-attention, in contrast to self-attention, we have two different input sources, as illustrated in the following figure. Illustration of cross attention, where there can be two different inputs x 1 and x 2 As illustrated in the previous two figures, in self-attention, we work with the same input sequence. In cross-attention, we mix or combine two different input sequences.  In the case of the original transformer architecture in the Attention Is All You Need paper, the two inputs x 1 and x 2 correspond to the sequence returned by the encoder module on the left ( x 2 ) and the input sequence being processed by the decoder part on the right ( x 1 ). In the context of a multimodal LLM, x 2 is the output of an image encoder. (Note that the queries usually come from the decoder, and the keys and values typically come from the encoder.) Note that in cross-attention, the two input sequences x 1 and x 2 can have different numbers of elements. However, their embedding dimensions must match. If we set x 1 = x 2 , this is equivalent to self-attention. Now that we have talked a bit about the two major multimodal design choices, let's briefly talk about how we deal with the three major components during model training, which are summarized in the figure below. An overview of the different components in a multimodal LLM. The components numbered 1-3 can be frozen or unfrozen during the multimodal training process. Similar to the development of traditional text-only LLMs, the training of multimodal LLMs also involves two phases: pretraining and instruction finetuning. However, unlike starting from scratch, multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model. For the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions, as we will explore later. Keeping the LLM part frozen during the pretraining phase is also usual, focusing only on training the projector—a linear layer or a small multi-layer perceptron. Given the projector's limited learning capacity, usually comprising just one or two layers, the LLM is often unfrozen during multimodal instruction finetuning (stage 2) to allow for more comprehensive updates. However, note that in the cross-attention-based models (Method B), the cross-attention layers are unfrozen throughout the entire training process. After introducing the two primary approaches (Method A: Unified Embedding Decoder Architecture and Method B: Cross-modality Attention Architecture), you might be wondering which is more effective. The answer depends on specific trade-offs. The Unified Embedding Decoder Architecture (Method A) is typically easier to implement since it doesn't require any modifications to the LLM architecture itself. The Cross-modality Attention Architecture (Method B) is often considered more computationally efficient because it doesn't overload the input context with additional image tokens, introducing them later in the cross-attention layers instead. Additionally, this approach maintains the text-only performance of the original LLM if the LLM parameters are kept frozen during training. We will revisit the discussion on modeling performance and response quality in a later section, where we will discuss NVIDIA's NVLM paper. This marks the end of what turned out to be a rather extensive introduction to multimodal LLMs. As I write this, I realize that the discussion has become lengthier than initially planned, which probably makes this a good place to conclude the article.  However, to provide a practical perspective, it would be nice to examine a few recent research papers that implement these approaches. So, we will explore these papers in the remaining sections of this article. For the remainder of this article, I will review recent literature concerning multimodal LLMs, focusing specifically on works published in the last few weeks to maintain a reasonable scope. Thus, this is not a historical overview or comprehensive review of multimodal LLMs but rather a brief look at the latest developments. I will also try to keep these summaries short and without too much fluff as there are 10 of them.  The conclusion section at the end of this has an overview that compares the methods used in these papers. The Llama 3 Herd of Models paper (July 31, 2024) by Meta AI came out earlier this summer, which feels like ages ago in LLM terms. However, given that they only described but did not release their multimodal models until much later, I think it's fair to include Llama 3 in this list. (Llama 3.2 models were officially announced and made available on September 25.) The multimodal Llama 3.2 models, which come in an 11-billion and 90-billion parameter version, are image-text models that use the previously described cross-attention-based approach, which is illustrated in the figure below. Illustration of the multimodal LLM approach used by Llama 3.2. (Annotated figure from the Llama 3 paper: https://arxiv.org/abs/2407.21783.The video and speech parts are visually occluded to focus the attention on the image part.) Note that while the figure also depicts video and speech as possible modalities, the models that were released as of this writing focus only on image and text. Llama 3.2 uses the cross-attention-based approach. However, it differs a bit from what I wrote about earlier, namely that in multimodal LLM development, we usually freeze the image encoder and only update the LLM parameters during pretraining. Here, the researchers almost take the opposite approach: they update the image encoder but do not update the language model's parameters. They write that this is intentional and done to preserve the text-only capabilities so that the 11B and 90B multimodal models can be used as drop-in replacements for the Llama 3.1 8B and 70B text-only model on text tasks. The training itself is done in multiple iterations, starting with the Llama 3.1 text models. After adding the image encoder and projection (here called "adapter") layers, they pretrain the model on image-text data. Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article ), they follow up with instruction and preference finetuning. Instead of adopting a pretrained model such as CLIP as an image encoder, the researchers used a vision transformer that they pretrained from scratch. Specifically, they adopted the  ViT-H/14 variant (630 million parameters) of the classic vision transformer architecture ( Dosovitskiy et al., 2020 ). They then pretrained the ViT on a dataset of 2.5 billion image-text pairs over five epochs; this was done before connecting the image encoder to the LLM. (The image encoder takes 224×224 resolution images and divides them into a 14×14 grid of patches, with each patch sized at 16×16 pixels.) As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 billion parameters.) The Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper (September 25, 2024) is notable because it promises to open source not only the model weights but also the dataset and source code similar to the language-only OLMo LLM. (This is great for LLM research as it allows us to take a look at the exact training procedure and code and also lets us run ablation studies and reproduce results on the same dataset.) If you are wondering why there are two names in the paper title, Molmo refers to the model (Multimodal Open Language Model), and PixMo (Pixels for Molmo) is the dataset. Illustration of the Molmo decoder-only approach (Method A). Annotated figure adapted from the Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper: https://www.arxiv.org/abs/2409.17146. As illustrated in the figure above, the image encoder employs an off-the-shelf vision transformer, specifically CLIP. The term "connector" here refers to a "projector" that aligns image features with the language model. Molmo streamlines the training process by avoiding multiple pretraining stages, choosing instead a simple pipeline that updates all parameters in a unified approach—including those of the base LLM, the connector, and the image encoder. The Molmo team offers several options for the base LLM: OLMo-7B-1024 (a fully open model backbone), OLMoE-1B-7B (a mixture-of-experts architecture; the most efficient model), Qwen2 7B (an open-weight model that performs better than OLMo-7B-1024), Qwen2 72B (an open-weight model and the best-performing model) NVIDIA's NVLM: Open Frontier-Class Multimodal LLMs paper (September 17, 2024) is particularly interesting because, rather than focusing on a single approach, it explores both methods:  Method A, the Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), and  Method B, the Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X).  Additionally, they develop a hybrid approach (NVLM-H) and provide an apples-to-apples comparison of all three methods. Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure below, NVLM-D corresponds to Method A, and NVLM-X corresponds to Method B, as discussed earlier. The concept behind the hybrid model (NVLM-H) is to combine the strengths of both methods: an image thumbnail is provided as input, followed by a dynamic number of patches passed through cross-attention to capture finer high-resolution details. In short, the research team find that: NVLM-X demonstrates superior computational efficiency for high-resolution images. NVLM-D achieves higher accuracy in OCR-related tasks. NVLM-H combines the advantages of both methods. Similar to Molmo and other approaches, they begin with a text-only LLM rather than pretraining a multimodal model from scratch (as this generally performs better). Additionally, they use an instruction-tuned LLM instead of a base LLM. Specifically, the backbone LLM is Qwen2-72B-Instruct (to my knowledge, Molmo used the Qwen2-72B base model). While training all LLM parameters in the NVLM-D approach, they found that for NVLM-X, it works well to freeze the original LLM parameters and train only the cross-attention layers during both pretraining and instruction finetuning. For the image encoder, instead of using a typical CLIP model, they use InternViT-6B , which remains frozen throughout all stages. The projector is a multilayer perceptron rather than a single linear layer. The previous two papers and models, Molmo and NVLM, were based on Qwen2-72B LLM. In this paper, the Qwen research team itself announces a multimodal LLM, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (October 3rd, 2024). At the core of this work is their so-called "Naive Dynamic Resolution" mechanism (the term "naive" is intentional and not a typo for "native," though "native" could also be fitting). This mechanism allows the model to handle images of varying resolutions without simple downsampling, enabling the input of images in their original resolution. An overview of the multimodal Qwen model, which can process input images with various different resolutions natively. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The native resolution input is implemented via a modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE. They used a classic vision encoder with 675M parameters and LLM backbones of varying sizes, as shown in the table below. The components of the different Qwen2-VL models. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The training itself consists of 3 stages: (1) pretraining only the image encoder, (2) unfreezing all parameters (including LLM), and (3) freezing the image encoder and instruction-finetuning only the LLM. Pixtral 12B (September 17, 2024), which uses the Method A: Unified Embedding Decoder Architecture approach, is the first multimodal model from Mistral AI. Unfortunately, there is no technical paper or report available, but the Mistral team shared a few interesting tidbits in their blog post . Interestingly, they chose not to use a pretrained image encoder, instead training one with 400 million parameters from scratch. For the LLM backbone, they used the 12-billion-parameter Mistral NeMo model. Similar to Qwen2-VL, Pixtral also supports variable image sizes natively, as illustrated in the figure below. Illustration of how Pixtral processes images of different sizes. (Annotated figure from the Pixtral blog  post: https://mistral.ai/news/pixtral-12b/) The MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning paper (September 30, 2024) provides practical tips and introduces a mixture-of-experts multimodal model alongside a dense model similar to Molmo. The models span a wide size range, from 1 billion to 30 billion parameters. The models described in this paper focuse on Method A, a Unified Embedding Transformer Architecture, which structures inputs effectively for multimodal learning. In addition, the paper has a series of interesting ablation studies looking into data mixtures and the effects of using coordinate tokens.  Illustration of the MM1.5 approach, which includes additional coordinate tokens to denote bounding boxes. (Annotated figure from the MM1.5 paper: https://arxiv.org/abs/2409.20566.) The Aria: An Open Multimodal Native Mixture-of-Experts Model paper (October 8, 2024) introduces another mixture-of-experts model approach, similar to one of the variants in the Molmo and MM1.5 lineups.  The Aria model has 24.9 billion parameters, with 3.5 billion parameters allocated per text token. The image encoder ( SigLIP ) has 438-million-parameters. This model is based on a cross-attention approach with the following overall training procedure: Training the LLM backbone entirely from scratch. Pretraining both the LLM backbone and the vision encoder. The Baichuan-Omni Technical Report (October 11, 2024) introduces Baichuan-Omni, a 7-billion-parameter multimodal LLM based on Method A: the Unified Embedding Decoder Architecture approach, as shown in the figure below. An overview of the Baichuan-Omni model, which can handle various input modalities. (Annotated figure from the Baichuan-Omni paper: https://arxiv.org/abs/2410.08565) The training process for Baichuan-Omni involves a three-stage approach: Projector training : Initially, only the projector is trained, while both the vision encoder and the language model (LLM) remain frozen. Vision encoder training : Next, the vision encoder is unfrozen and trained, with the LLM still frozen. Full model training : Finally, the LLM is unfrozen, allowing the entire model to be trained end-to-end. The model utilizes the SigLIP vision encoder and incorporates the AnyRes module to handle high-resolution images through down-sampling techniques. While the report does not explicitly specify the LLM backbone, it is likely based on the Baichuan 7B LLM, given the model's parameter size and the naming convention. The Emu3: Next-Token Prediction is All You Need paper (September 27, 2024) presents a compelling alternative to diffusion models for image generation, which is solely based on a transformer-based decoder architecture. Although it's not a multimodal LLM in the classic sense (i.e., models focused on image understanding rather than generation), Emu3 is super interesting as it demonstrates that it's possible to use transformer decoders for image generation, which is a task typically dominated by diffusion methods. (However, note that there have been other similar approaches before, such as Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation .) Emu3 is primarily an LLM for image generation as an alternative to diffusion models. (Annotated figure from the Emu3 paper: https://arxiv.org/abs/2409.18869) The researchers trained Emu3 from scratch and then used Direct Preference Optimization (DPO) to align the model with human preferences.  The architecture includes a vision tokenizer inspired by SBER-MoVQGAN . The core LLM architecture is based on Llama 2, yet it is trained entirely from scratch. We previously focused on multimodal LLMs for image understanding and just saw one example for image generation with Emu 3 above. Now, the Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation paper (October 17, 2024) introduces a framework that unifies multimodal understanding and generation tasks within a single LLM backbone.  A key feature of Janus is the decoupling of visual encoding pathways to address the distinct requirements of understanding and generation tasks. The researchers argue that image understanding tasks require high-dimensional semantic representations, while generation tasks require detailed local information and global consistency in images. By separating these pathways, Janus effectively manages these differing needs.  The model employs the SigLIP vision encoder, similar to that used in Baichuan-Omni, for processing visual inputs. For image generation, it utilizes a Vector Quantized (VQ) tokenizer to handle the generation process. The base LLM in Janus is the DeepSeek-LLM with 1.3 billion parameters. An overview of the unified decoder-only framework used in Janus. (Annotated figure from the Janus paper: https://arxiv.org/abs/2410.13848.) The training process for the model in this image follows three stages, as shown in the figure below. Illustration of the 3-stage training process of the Janus model. (Annotated figure from the Janus paper: https://arxiv.org/abs/2410.13848) In Stage I, only the projector layers and image output layer are trained while the LLM, understanding, and generation encoders remain frozen. In Stage II, the LLM backbone and text output layer are unfrozen, allowing for unified pretraining across understanding and generation tasks. Finally, in Stage III, the entire model, including the SigLIP image encoder, is unfrozen for supervised fine-tuning, enabling the model to fully integrate and refine its multimodal capabilities. As you may have noticed, I almost entirely skipped both the modeling and the computational performance comparisons. First, comparing the performance of LLMs and multimodal LLMs on public benchmarks is challenging due to prevalent data contamination, meaning that the test data may have been included in the training data. Additionally, the architectural components vary so much that making an apples-to-apples comparison is difficult. So, big kudos to the NVIDIA team for developing NVLM in different flavors, which allowed for a comparison between the decoder-only and cross-attention approaches at least. In any case, the main takeaway from this article is that multimodal LLMs can be built successfully in many different ways. Below is a figure that summarizes the different components of the models covered in this article. An overview of the different models covered in this article along with their subcomponents and training approaches. I hope you found reading this article educational and now have a better understanding of how multimodal LLMs work! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Your support means a great deal! Thank you! An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)" , is now finally available on Amazon ! Build a Large Language Model (From Scratch) now available on Amazon Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers! For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months.  P.S. If you have read the book, I'd really appreciate it if you could leave a brief review ; it truly helps us authors! 1. Use cases of multimodal LLMs What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs. A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below. Example use of a multimodal LLM explaining a meme . Of course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown. 2. Common approaches to building multimodal LLMs There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. The two main approaches to developing multimodal LLM architectures. As shown in the figure above, the Unified Embedding-Decoder Architecture utilizes a single decoder model, much like an unmodified LLM architecture such as GPT-2 or Llama 3.2. In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation. The Cross-Modality Attention Architecture employs a cross-attention mechanism to integrate image and text embeddings directly within the attention layer. In the following sections, we will explore how these methods work on a conceptual level. Then, we will look at recent research papers on multimodal LLMs to see how they are applied in practice. 2.1 Method A: Unified Embedding Decoder Architecture Let’s begin with the unified embedding decoder architecture, illustrated again in the figure below. Illustration of the unified embedding decoder architecture, which is an unmodified decoder-style LLM (like GPT-2, Phi-3, Gemma, or Llama 3.2) that receives inputs consisting of image token and text token embeddings. In the unified embedding-decoder architecture, an image is converted into embedding vectors, similar to how input text is converted into embeddings in a standard text-only LLM. For a typical text-only LLM that processes text, the text input is usually tokenized (e.g., using Byte-Pair Encoding) and then passed through an embedding layer, as shown in the figure below. Illustration of the standard process for tokenizing text and converting it into token embedding vectors, which are subsequently passed to an LLM during training and inference. 2.1.1 Understanding Image encoders Analogous to the tokenization and embedding of text, image embeddings are generated using an image encoder module (instead of a tokenizer), as shown in the figure below. Illustration of the process for encoding an image into image patch embeddings. What happens inside the image encoder shown above? To process an image, we first divide it into smaller patches, much like breaking words into subwords during tokenization. These patches are then encoded by a pretrained vision transformer (ViT), as shown in the figure below. Illustration of a classic vision transformer (ViT) setup, similar to the model proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). Note that ViTs are often used for classification tasks, so I included the classification head in the figure above. However, in this case, we only need the image encoder part. 2.1.2 The role of the linear projection module The "linear projection" shown in the previous figure consists of a single linear layer (i.e., a fully connected layer). The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder. This linear projection is illustrated in the figure below. An image patch, flattened into a 256-dimensional vector, is up-projected to a 768-dimensional vector. Illustration of a linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space. For those who prefer seeing a code example, In PyTorch code, we could implement the linear projection for the image patches as follows: If you have read my Machine Learning Q and AI book by chance, you may know there are ways to replace linear layers with convolution operations that can be implemented to be mathematically equivalent. Here, this can be especially handy as we can combine the creation of patches and projection into two lines of code: 2.1.3 Image vs text tokenization Now that we briefly discussed the purpose of the image encoder (and the linear projection that is part of the encoder), let's return to the text tokenization analogy from earlier and look at text and image tokenization and embedding side by side, as depicted in the figure below. Image tokenization and embedding (left) and text tokenization and embedding (right) side by side. As you can see in the figure above, I included an additional projector module that follows the image encoder. This projector is usually just another linear projection layer that is similar to the one explained earlier. The purpose is to project the image encoder outputs into a dimension that matches the dimensions of the embedded text tokens, as illustrated in the figure below. (As we will see later, the projector is sometimes also called adapter, adaptor, or connector.) Another side-by-side comparison between image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions. Now that the image patch embeddings have the same embedding dimension as the text token embeddings, we can simply concatenate them as input to the LLM, as shown in the figure at the beginning of this section. Below is the same figure again for easier reference. After projecting the image patch tokens into the same dimension as the text token embeddings, we can simply concatenate them as input to a standard LLM. By the way, the image encoder we discussed in this section is usually a pretrained vision transformer. A popular choice is CLIP or OpenCLIP . However, there are also versions of Method A that operate directly on patches, such as Fuyu , which is shown in the figure below. Annotated figure of the Fuyu multimodal LLM that operates directly on the image patches without image encoder. (Annotated figure from https://www.adept.ai/blog/fuyu-8b .) As illustrated in the figure above, Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do. This greatly simplifies the architecture and training setup. 2.2 Method B: Cross-Modality Attention Architecture Now that we have discussed the unified embedding decoder architecture approach to building multimodal LLMs and understand the basic concept behind image encoding, let's talk about an alternative way of implementing multimodal LLMs via cross-attention, as summarized in the figure below. An illustration of the Cross-Modality Attention Architecture approach to building multimodal LLMs. In the Cross-Modality Attention Architecture method depicted in the figure above, we still use the same image encoder setup we discussed previously. However, instead of encoding the patches as input to the LLM, we connect the input patches in the multi-head attention layer via a cross-attention mechanism. The idea is related and goes back to the original transformer architecture from the 2017 Attention Is All You Need paper, highlighted in the figure below. High-level illustration of the cross-attention mechanism used in the original transformer architecture. (Annotated figure from the "Attention Is All You Need" paper: https://arxiv.org/abs/1706.03762.) Note that the original "Attention Is All You Need" transformer depicted in the figure above was originally developed for language translation. So, it consists of a text en coder (left part of the figure) that takes the sentence to be translated and generates the translation via a text de coder (right part of the figure). In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies. How does cross-attention work? Let's have a look at a conceptual drawing of what happens inside the regular self-attention mechanism. Outline of the regular self-attention mechanism. (This flow depicts one of the heads in a regular multi-head attention module.) In the figure above, x is the input, and W q is a weight matrix used to generate the queries ( Q ). Similarly, K stands for keys, and V stands for values. A represents the attention scores matrix, and Z are the inputs (x) transformed into the output context vectors. (If this seems confusing, you may find a comprehensive introduction in Chapter 3 of my Build a Large Language Model from Scratch book helpful; alternatively, you may also find my article, Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs helpful here.) In cross-attention, in contrast to self-attention, we have two different input sources, as illustrated in the following figure. Illustration of cross attention, where there can be two different inputs x 1 and x 2 As illustrated in the previous two figures, in self-attention, we work with the same input sequence. In cross-attention, we mix or combine two different input sequences.  In the case of the original transformer architecture in the Attention Is All You Need paper, the two inputs x 1 and x 2 correspond to the sequence returned by the encoder module on the left ( x 2 ) and the input sequence being processed by the decoder part on the right ( x 1 ). In the context of a multimodal LLM, x 2 is the output of an image encoder. (Note that the queries usually come from the decoder, and the keys and values typically come from the encoder.) Note that in cross-attention, the two input sequences x 1 and x 2 can have different numbers of elements. However, their embedding dimensions must match. If we set x 1 = x 2 , this is equivalent to self-attention. 3. Unified decoder and cross-attention model training Now that we have talked a bit about the two major multimodal design choices, let's briefly talk about how we deal with the three major components during model training, which are summarized in the figure below. An overview of the different components in a multimodal LLM. The components numbered 1-3 can be frozen or unfrozen during the multimodal training process. Similar to the development of traditional text-only LLMs, the training of multimodal LLMs also involves two phases: pretraining and instruction finetuning. However, unlike starting from scratch, multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model. For the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions, as we will explore later. Keeping the LLM part frozen during the pretraining phase is also usual, focusing only on training the projector—a linear layer or a small multi-layer perceptron. Given the projector's limited learning capacity, usually comprising just one or two layers, the LLM is often unfrozen during multimodal instruction finetuning (stage 2) to allow for more comprehensive updates. However, note that in the cross-attention-based models (Method B), the cross-attention layers are unfrozen throughout the entire training process. After introducing the two primary approaches (Method A: Unified Embedding Decoder Architecture and Method B: Cross-modality Attention Architecture), you might be wondering which is more effective. The answer depends on specific trade-offs. The Unified Embedding Decoder Architecture (Method A) is typically easier to implement since it doesn't require any modifications to the LLM architecture itself. The Cross-modality Attention Architecture (Method B) is often considered more computationally efficient because it doesn't overload the input context with additional image tokens, introducing them later in the cross-attention layers instead. Additionally, this approach maintains the text-only performance of the original LLM if the LLM parameters are kept frozen during training. We will revisit the discussion on modeling performance and response quality in a later section, where we will discuss NVIDIA's NVLM paper. This marks the end of what turned out to be a rather extensive introduction to multimodal LLMs. As I write this, I realize that the discussion has become lengthier than initially planned, which probably makes this a good place to conclude the article.  However, to provide a practical perspective, it would be nice to examine a few recent research papers that implement these approaches. So, we will explore these papers in the remaining sections of this article. 4. Recent multimodal models and methods For the remainder of this article, I will review recent literature concerning multimodal LLMs, focusing specifically on works published in the last few weeks to maintain a reasonable scope. Thus, this is not a historical overview or comprehensive review of multimodal LLMs but rather a brief look at the latest developments. I will also try to keep these summaries short and without too much fluff as there are 10 of them.  The conclusion section at the end of this has an overview that compares the methods used in these papers. 4.1 The Llama 3 Herd of Models The Llama 3 Herd of Models paper (July 31, 2024) by Meta AI came out earlier this summer, which feels like ages ago in LLM terms. However, given that they only described but did not release their multimodal models until much later, I think it's fair to include Llama 3 in this list. (Llama 3.2 models were officially announced and made available on September 25.) The multimodal Llama 3.2 models, which come in an 11-billion and 90-billion parameter version, are image-text models that use the previously described cross-attention-based approach, which is illustrated in the figure below. Illustration of the multimodal LLM approach used by Llama 3.2. (Annotated figure from the Llama 3 paper: https://arxiv.org/abs/2407.21783.The video and speech parts are visually occluded to focus the attention on the image part.) Note that while the figure also depicts video and speech as possible modalities, the models that were released as of this writing focus only on image and text. Llama 3.2 uses the cross-attention-based approach. However, it differs a bit from what I wrote about earlier, namely that in multimodal LLM development, we usually freeze the image encoder and only update the LLM parameters during pretraining. Here, the researchers almost take the opposite approach: they update the image encoder but do not update the language model's parameters. They write that this is intentional and done to preserve the text-only capabilities so that the 11B and 90B multimodal models can be used as drop-in replacements for the Llama 3.1 8B and 70B text-only model on text tasks. The training itself is done in multiple iterations, starting with the Llama 3.1 text models. After adding the image encoder and projection (here called "adapter") layers, they pretrain the model on image-text data. Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article ), they follow up with instruction and preference finetuning. Instead of adopting a pretrained model such as CLIP as an image encoder, the researchers used a vision transformer that they pretrained from scratch. Specifically, they adopted the  ViT-H/14 variant (630 million parameters) of the classic vision transformer architecture ( Dosovitskiy et al., 2020 ). They then pretrained the ViT on a dataset of 2.5 billion image-text pairs over five epochs; this was done before connecting the image encoder to the LLM. (The image encoder takes 224×224 resolution images and divides them into a 14×14 grid of patches, with each patch sized at 16×16 pixels.) As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 billion parameters.) 4.2 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models The Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper (September 25, 2024) is notable because it promises to open source not only the model weights but also the dataset and source code similar to the language-only OLMo LLM. (This is great for LLM research as it allows us to take a look at the exact training procedure and code and also lets us run ablation studies and reproduce results on the same dataset.) If you are wondering why there are two names in the paper title, Molmo refers to the model (Multimodal Open Language Model), and PixMo (Pixels for Molmo) is the dataset. Illustration of the Molmo decoder-only approach (Method A). Annotated figure adapted from the Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper: https://www.arxiv.org/abs/2409.17146. As illustrated in the figure above, the image encoder employs an off-the-shelf vision transformer, specifically CLIP. The term "connector" here refers to a "projector" that aligns image features with the language model. Molmo streamlines the training process by avoiding multiple pretraining stages, choosing instead a simple pipeline that updates all parameters in a unified approach—including those of the base LLM, the connector, and the image encoder. The Molmo team offers several options for the base LLM: OLMo-7B-1024 (a fully open model backbone), OLMoE-1B-7B (a mixture-of-experts architecture; the most efficient model), Qwen2 7B (an open-weight model that performs better than OLMo-7B-1024), Qwen2 72B (an open-weight model and the best-performing model) Method A, the Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), and  Method B, the Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X).  Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure below, NVLM-D corresponds to Method A, and NVLM-X corresponds to Method B, as discussed earlier. The concept behind the hybrid model (NVLM-H) is to combine the strengths of both methods: an image thumbnail is provided as input, followed by a dynamic number of patches passed through cross-attention to capture finer high-resolution details. In short, the research team find that: NVLM-X demonstrates superior computational efficiency for high-resolution images. NVLM-D achieves higher accuracy in OCR-related tasks. NVLM-H combines the advantages of both methods. An overview of the multimodal Qwen model, which can process input images with various different resolutions natively. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The native resolution input is implemented via a modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE. They used a classic vision encoder with 675M parameters and LLM backbones of varying sizes, as shown in the table below. The components of the different Qwen2-VL models. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The training itself consists of 3 stages: (1) pretraining only the image encoder, (2) unfreezing all parameters (including LLM), and (3) freezing the image encoder and instruction-finetuning only the LLM. 4.5 Pixtral 12B Pixtral 12B (September 17, 2024), which uses the Method A: Unified Embedding Decoder Architecture approach, is the first multimodal model from Mistral AI. Unfortunately, there is no technical paper or report available, but the Mistral team shared a few interesting tidbits in their blog post . Interestingly, they chose not to use a pretrained image encoder, instead training one with 400 million parameters from scratch. For the LLM backbone, they used the 12-billion-parameter Mistral NeMo model. Similar to Qwen2-VL, Pixtral also supports variable image sizes natively, as illustrated in the figure below. Illustration of how Pixtral processes images of different sizes. (Annotated figure from the Pixtral blog  post: https://mistral.ai/news/pixtral-12b/) 4.6 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning The MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning paper (September 30, 2024) provides practical tips and introduces a mixture-of-experts multimodal model alongside a dense model similar to Molmo. The models span a wide size range, from 1 billion to 30 billion parameters. The models described in this paper focuse on Method A, a Unified Embedding Transformer Architecture, which structures inputs effectively for multimodal learning. In addition, the paper has a series of interesting ablation studies looking into data mixtures and the effects of using coordinate tokens.  Illustration of the MM1.5 approach, which includes additional coordinate tokens to denote bounding boxes. (Annotated figure from the MM1.5 paper: https://arxiv.org/abs/2409.20566.) 4.7 Aria: An Open Multimodal Native Mixture-of-Experts Model The Aria: An Open Multimodal Native Mixture-of-Experts Model paper (October 8, 2024) introduces another mixture-of-experts model approach, similar to one of the variants in the Molmo and MM1.5 lineups.  The Aria model has 24.9 billion parameters, with 3.5 billion parameters allocated per text token. The image encoder ( SigLIP ) has 438-million-parameters. This model is based on a cross-attention approach with the following overall training procedure: Training the LLM backbone entirely from scratch. Pretraining both the LLM backbone and the vision encoder. An overview of the Baichuan-Omni model, which can handle various input modalities. (Annotated figure from the Baichuan-Omni paper: https://arxiv.org/abs/2410.08565) The training process for Baichuan-Omni involves a three-stage approach: Projector training : Initially, only the projector is trained, while both the vision encoder and the language model (LLM) remain frozen. Vision encoder training : Next, the vision encoder is unfrozen and trained, with the LLM still frozen. Full model training : Finally, the LLM is unfrozen, allowing the entire model to be trained end-to-end.

0 views
Ahead of AI 1 years ago

Instruction Pretraining LLMs

A lot has happened last month: Apple announced the integration of on-device LLMs, Nvidia shared their large Nemotron model, FlashAttention-3 was announced, Google's Gemma 2 came out, and much more.  You've probably already read about it all in various news outlets. So, in this article, I want to focus on recent research centered on instruction finetuning, a fundamental technique for training LLMs. What I am going to cover in this article: A new, cost-effective method for generating data for instruction finetuning Instruction finetuning from scratch Pretraining LLMs with instruction data An overview of what's new in Gemma 2 An overview of all the other interesting research papers that came out in June Happy reading! The Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing paper shares a fascinating hack to generate a high-quality dataset for LLM instruction finetuning. While this doesn't offer any particularly recent research insights, it's one of those interesting, practical exploits that seems super useful. What distinguishes this instruction-data-generating method from others is that it can be fully automated and doesn't require any initial questions or instructions. As the paper title suggests, it enables the creation of an instruction dataset from "Nothing" – the only thing we need is a locally running Llama 3 8B model. The figure below summarizes how this method works. Annotated illustration of the Magpie method for generating a synthetic dataset for instruction finetuning. The figure is based on illustrations from the Magpie paper: https://arxiv.org/abs/2406.08464 Essentially, as shown in the figure above, we just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for us. Then, we feed that instruction back to the LLM, and it will generate a response. If we repeat this procedure a couple of thousand times, we obtain a dataset for instruction finetuning. (Optionally, we can apply an LLM to filter the instruction-response pairs by quality.) What's fascinating is that with the resulting instruction dataset, the authors found that finetuning a Llama 3 8B base model with just instruction finetuning (no preference finetuning via RLHF and DPO) beats the original Llama 2 8B Instruct model by Meta AI, as shown in the figure below. A Llama 3 8B base model finetuned on the Magpie-generated instruction dataset beats the original Llama 3 8B Instruct model. Based on an annotated illustration from the Magpie paper: https://arxiv.org/abs/2406.08464 The Magpie results shown in the figure above wer achieved with 300 thousand samples only. In comparison, The original Llama 3 Instruct model was finetuned and aligned on 100 million samples! I was skeptical at first, so I tried to implement this myself. It really works! Here , you can find my reimplementation using Ollama, which even runs fine locally on a MacBook Air. Code screenshot from a reimplementation of the Magpie method that runs locally. The code is available here . The authors created two sets of datasets: A "Pro" version using the Llama 3 70B Instruct model and an "Air" version using the Llama 3 8B Instruct model. As an earlier figure showed, the Magpie-Pro-generated dataset results in slightly stronger models compared to the Magpie-Air dataset when using it to instruction-finetune a Llama 3 8B base model. The figure below shows an additional comparison of the dataset qualities and difficulties as rated via an LLM. Annotated plots from the Magpie paper showing the dataset quality and difficulty of the Air and Pro datasets relative to each other. As the figure above shows, the quality of the Air and Pro datasets is roughly on par. In addition, it would have been interesting to see how the Alpaca dataset compares to these. (The assumption is that the Magpie data is of much higher quality than Alpaca, but a reference point would be interesting.) Furthermore, the paper contains an analysis showing that the breadth or diversity in this dataset is much larger than that of other popular datasets for instruction finetuning, such as Alpaca, Evol Instruct, and UltraChat. In addition, when compared to models trained with other instruction finetuning datasets, the Magpie-Pro finetuned model also compares very favorably. Overall, I think that Magpie is an interesting exploit that is, on the one hand, fascinating in its effectiveness and, on the other hand, has a lot of practical utility. I will certainly consider it as an interesting, simple, and cost-effective candidate for constructing general-purpose instruction datasets in the future. If you are looking for a resource to understand the instruction finetuning process in LLMs, I am happy to share that Chapter 7 on instruction finetuning LLMs is now finally live on the Manning website . This is the longest chapter in the book and takes a from-scratch approach to implementing the instruction finetuning pipeline. This includes everything from input formatting to batching with a custom collate function, masking padding tokens, the training loop itself, and scoring the response quality of the finetuned LLM on a custom test set. (The exercises include changing prompt styles, instruction masking, and adding LoRA.) Happy coding! An overview of chapter 7 in my "Build a Large Language Model From Scratch" book. Supplementary code materials are available here on GitHub . PS: it's also the last chapter, and the publisher is currently preparing the layouts for the print version. In the paper "Instruction Pre-Training: Language Models are Supervised Multitask Learners" ( https://arxiv.org/abs/2406.14491 ), researchers investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. (Here, "raw text" means text from books, websites, papers, and so forth that has not been reprocessed into a specific format.) A comparison between regular pretraining (top) and the proposed instruction pretraining approach (bottom) via an annotated figure from https://arxiv.org/abs/2406.14491 Specifically, the researchers experiment with generating instruction-response data from the raw training corpus itself via an "instruction synthesizer," an LLM specifically finetuned for this task. (Note that this is not the first paper proposing the formatting of raw text as instruction data. Another work that comes to mind is "Genie: Achieving Human Parity in Content-Grounded Datasets Generation" ( https://arxiv.org/abs/2401.14367 ). I also recall seeing another paper or blog post using instruction data during pretraining a few months ago—I discussed this method with some of my colleagues—but unfortunately, I couldn't find the reference. Nonetheless, the paper discussed here is particularly intriguing since it builds on openly available LLMs that run locally and covers both pretraining and continual pretraining.) Before we dive into the pretraining and continual pretraining results, let's talk about the core component of this method: the instruction synthesizer. This is an openly available Mistral 7B v0.1 LLM (which I wrote about last year here: https://magazine.sebastianraschka.com/i/138555764/mistral-b ) that has been finetuned to generate instruction-response pairs from raw text. To finetune this synthesizer, the researchers use datasets such as HotpotQA ( https://arxiv.org/abs/1809.09600 ), which consists of passages from Wikipedia associated with questions and answers. For this, the authors also ensure that a variety of tasks, like commonsense reasoning, sentiment analysis, math problems, etc., are covered. The input and output data of the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Once this instruction synthesizer is developed (i.e., finetuned), it can be used to generate the input data for pretraining the target LLMs. One last noteworthy detail regarding the instruction synthesizer is that multiple raw texts (T n ) and instruction-response pairs (I n ⊕ R n ) are concatenated as few-shot examples, as shown in the figure below. The formatting of the instruction-data for finetuning (and using) the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Now that we have discussed the method to generate the instruction-response pairs, let's get to the interesting part: how well do models train on this augmented dataset. The first set of results looks at two small models trained from scratch: 500M parameters and 1.3B parameters (both are based on the Mistral architecture). A comparison of 3 different pretraining approaches used to train models from scratch (annotated table from https://arxiv.org/abs/2406.14491) As we can see in the table above, the model trained via the proposed instruction pretraining approach ( Instruct PT ) performs best on most benchmark tasks (higher values are better).  Note, though, that it has seen more tokens than the Vanilla PT approach since it included the synthesized instruction-response pairs. Hence, the authors included the Mix PT comparison, which is a model that has been trained on a data mix containing both the raw text and the instruction data used to train the synthesizer. From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.) In addition, it's worth noting that the Instruct PT pretrained models have another advantage: They improve a more when they are instruction-finetuned afterwards, as the figure below shows. Finetuning LLMs that have been pretrained with either the traditional pretraining pardigm (Vanilla PT) or instruction pretraining (annotated figure from https://arxiv.org/abs/2406.14491) Pretraining from scratch is interesting because that's how LLMs are created in the first place. However, I'd say that practitioners care more about continual pretraining and finetuning.  Continual pretraining here means that we take an existing pretrained model and pretrain it further on new domain data. For instance, think of a Llama 3 8B base model that has been trained on a general text corpus and that you want to adapt for finance, medical, legal, or other domains. The table below summarizes the results the researchers obtained when applying the instruction pretraining method to a pretrained Llama 3 8B base model. Specifically, they conducted continual pretraining with both biomedical texts and finance texts. A comparison of 3 different pretraining approaches used for continual pretraining (annotated table from https://arxiv.org/abs/2406.14491) Looking at the table above, we can see that the instruction pretraining approach ( Instruct PT ) clearly outperforms the vanilla pretraining ( Vanilla PT ) approach (here, this means regular continual pretraining of the base model).  The Llama 3 70B base model is included as a reference; I suppose to showcase that small specialized models can beat larger general models. Almost every time I explain the LLM pretraining pipeline to someone, they are surprised by its simplicity and the fact that this is still what's commonly used to train LLMs today. The instruction pretraining approach is quite refreshing in that sense.  One caveat is that for large pretraining corpora, it might still be expensive to create the instruction-augmented corpora. However, the nice thing about generated data is that it can be reused in many different projects once created. I cannot write this article without mentioning Google's new Gemma 2 models , which are arguably the biggest model release last month. However, when it comes to pure size, Nvidia's Nemotron-4 340B takes the crown (https://arxiv.org/abs/2406.11704). The Gemma 2 models come in 2.6B, 9B, and 27B parameter versions. Since this article is already quite lengthy, and you're likely familiar with Gemma 2 from other sources, let's cut to the chase. What are the main highlights and noteworthy updates in Google's newly released Gemma 2 LLMs? The main theme is exploring techniques without necessarily increasing the size of training datasets but rather focusing on developing relatively small and efficient LLMs. Specifically, they blend three main architectural and training choices to create the 2.6B and 9B parameter models: sliding window attention, grouped-query attention, and knowledge distillation. Sliding window attention (e.g., as popularized by Mistral) is a technique using a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below. Annotated figure from https://arxiv.org/abs/2310.06825 explaining sliding window attention. In the case of Gemma 2, the authors alternated between regular attention and sliding window attention layers. The sliding attention block size was 4096 tokens, spanning a total block size of 8192 tokens.  Sliding window attention is mainly used to improve computational performance, and the researchers also included a small ablation study showing that there's a barely noticeable difference in perplexity when shrinking the block size during inference. An ablation study from the Gemma 2 technical report showing that a decreased block size for the sliding window barely impacts the modeling performance of the 9B parameter model during inference. (It would have been interesting to see the GPU memory improvement side-by-side.) Group-query attention (like in Llama 2 and 3) can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements. Annotated figure from Ainslie et al. 2023 The general idea of Knowledge distillation (as in MiniLLM, https://arxiv.org/abs/2306.08543) is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a "teacher" for the smaller models. An overview of knowledge distillation from my Machine Learning Q and AI book in the context of computer vision. In an LLM-context, think of text instead of images and predicted tokens instead of class labels. The paper contains many other interesting tidbits. For instance, one hallmark of Gemma 2 is its relatively large vocabulary size: 256,000 tokens. This is similar to the first Gemma model, but it's still worth noting since it's twice the size of the Llama 3 vocabulary (128,000) and eight times the size of the Phi-3 vocabulary (32,000). The vocabulary size of an LLM refers to the number of unique tokens (words, subwords, or characters) that the model can recognize and generate.  A large vocabulary size in LLMs allows for better coverage of words and concepts, improved handling of multilingual content, and reduced tokenization artifacts. However, a large vocabulary size also comes with trade-offs, such as increased model size and potentially slower inference due to the larger embedding and output layers. (That's where the sliding window attention and multi-query attention mechanism are important to offset this.) There's also an interesting section on "logit capping," a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training. logits ← soft_cap ∗ tanh(logits/soft_cap). Additionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although the paper doesn't provide much detail about that. (However, interested readers can read more about this in WARP: On the Benefits of Weight Averaged Rewarded Policies , which Gemma 2 uses for this.)  In terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model. A comparison between two other popular models with openly available weights: Llama 3 and Qwen 1.5. (Annotated table from the Gemma 2 technical report ). Personally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research. An example of one of the ablation studies included in the Gemma 2 technical report . Here "wide" refers to a model with 28 layers and an intermediate size of 24,576, and "deep" refers to a architecture with 42 layers with an intermediate size of 14,336. It's refreshing to see such a relatively detailed technical report from Google. When it comes to the model itself, based on public consensus, Gemma 2 is likely the most capable model for single-GPU use cases today. For larger models, Llama 3 70B and Qwen 2 72B remain strong contenders. Ahead of AI is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. If you have a few moments, a review of Machine Learning Q and AI or Machine Learning with PyTorch and Scikit-Learn on Amazon would really help, too! Your support means a great deal and is tremendously helpful in continuing this journey. Thank you! Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 20 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects. Scaling Synthetic Data Creation with 1,000,000,000 Personas by Chan, Wang, Yu, et al. (28 June), https://arxiv.org/abs/2406.20094 The research proposes a persona-driven data synthesis methodology that utilizes an LLM to create diverse synthetic data by leveraging a vast collection of automatically curated personas, called Persona Hub, which represents about 13% of the world's population. LLM Critics Help Catch LLM Bugs by McAleese, Pokorny, Ceron Uribe, et al. (28 June), https://arxiv.org/abs/2407.00215 This study develops "critic" models using RLHF to assist humans in evaluating model-generated code, training LLMs to write natural language feedback on code errors, and demonstrating their effectiveness in catching bugs across various tasks. Direct Preference Knowledge Distillation for Large Language Models by Li, Gu, Dong, et al. (28 June), https://arxiv.org/abs/2406.19774 DPKD reformulates Knowledge Distillation for LLMs into a two-stage process: first optimizing an objective combining implicit reward and reverse KL divergence, then improving the preference probability of teacher outputs over student outputs. Changing Answer Order Can Decrease MMLU Accuracy by Gupta, Pantoja, Ross, et al. (27 June), https://arxiv.org/abs/2406.19470 This study investigates the robustness of accuracy measurements on the MMLU benchmark for LLMs, revealing that shuffling answer label contents leads to decreased accuracy across models, with varying sensitivity. From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Xiong, Papageorgiou, Lee, and Papailiopoulos (27 June), https://arxiv.org/abs/2406.19292 This study proposes a finetuning approach using a synthetic dataset of numerical key-value retrieval tasks to improve LLM's long-context information retrieval and reasoning capabilities.  Dataset Size Recovery from LoRA Weights by Salama, Kahana, Horwitz, and Hoshen (27 June), https://arxiv.org/abs/2406.19395 This study introduces a method for recovering the number of images used to finetune a vision model using LoRA, by analyzing the norm and spectrum of LoRA matrices.  Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs by Azerbayev, Shao, Lin, et al. (26 June), https://arxiv.org/abs/2406.18629 This paper introduces Step-DPO, a method that optimizes individual reasoning steps in mathematical problem-solving for LLMs using a custom 10K step-wise preference pair dataset. RouteLLM: Learning to Route LLMs with Preference Data by Ong, Amjad, et al. (26 June), https://arxiv.org/abs/2406.18665 This study proposes efficient router models that dynamically select between stronger and weaker LLMs during inference to optimize cost-performance trade-offs.  * A Closer Look into Mixture-of-Experts in Large Language Models by Zhang, Liu, Patel, et al. (26 June), https://arxiv.org/abs/2406.18219 This study looks at the inner workings of Mixture-of-Experts (MoE) LLMs to share  insights about neuron behavior, expert selection criteria, and expert diversity across layers, while providing practical suggestions for MoE design and implementation based on these observations. * Following Length Constraints in Instructions by Yuan, Kulikov, Yu, et al. (25 June), https://arxiv.org/abs/2406.17744 This study introduces a method to train LLMs that can follow user-specified length constraints at inference time, addressing the length bias in model evaluation and outperforming standard instruction-following models in length-controlled tasks. LongIns: A Challenging Long-context Instruction-based Exam for LLMs by Shaham, Bai, An, et al. (25 June), https://arxiv.org/abs/2406.17588 LongIns is a new benchmark for evaluating LLM's long-context capabilities, using three settings to assess retrieval and reasoning abilities. * The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale by He, Wang, Shen, et al. (25 June), https://arxiv.org/abs/2406.17557 This report introduces FineWeb, a 15-trillion token dataset derived from Common Crawl, and FineWeb-Edu, a 1.3-trillion token educational subset. Adam-mini: Use Fewer Learning Rates To Gain More by Zhang, Chen, Li, et al . (24 June), https://arxiv.org/abs/2406.16793 Adam-mini is a proposed optimizer that achieves comparable or better performance than AdamW while using 45-50% less memory by strategically reducing learning rate resources, partitioning parameters based on Hessian structure, and assigning optimized single learning rates to parameter blocks. WARP: On the Benefits of Weight Averaged Rewarded Policies by Ramé, Ferret, Vieillard, et al. (24 June), https://arxiv.org/abs/2406.16768 The paper introduces a new alignment strategy for LLMs that merges policies at three stages: using exponential moving average for dynamic KL regularization, spherical interpolation of independently fine-tuned policies, and linear interpolation with initialization. Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers by Lou, Jia, Zheng, and Tu (24 June), https://arxiv.org/abs/2406.16747 The authors propose a new sparse attention mechanism for autoregressive Transformers, using a scoring network and differentiable top-k mask operator to select a constant number of KV pairs per query to achieve linear time complexity and a constant memory footprint. Efficient Continual Pre-training by Mitigating the Stability Gap by Wang, Hu, Xiong, et al. (21 June), https://arxiv.org/abs/2406.14833 This study proposes three strategies to improve continual pretraining of LLMs: multiple epochs on a subset, focusing on high-quality data, and using a mixture similar to pretraining data. MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression by Fu, Huang, Ning, et al. (21 June), https://arxiv.org/abs/2406.14909 Mixture of Attention (MoA) automatically optimizes sparse attention patterns for different model components and input lengths in LLMs, improving context length, accuracy, and efficiency over uniform sparse attention approaches. LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs by Jiang, Ma, Chen, et al. (21 June), https://arxiv.org/abs/2406.15319 LongRAG introduces a new RAG framework using 4K-token retrieval units and long-context LLMs for answer extraction, which improves retrieval performance and achieving state-of-the-art results on question-answering tasks without additional training. * A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems by Cuconasu, Trappolini, Tonellotto, et al. (21 June), https://arxiv.org/abs/2406.14972 This study challenges conventional wisdom by demonstrating that base LLMs outperform instruction-tuned models in Retrieval Augmented Generation (RAG) tasks. Can LLMs Learn by Teaching? A Preliminary Study by Ning, Wang, Li, Lin, et al. (20 June), https://arxiv.org/abs/2406.14629 The authors develop and test three methods for implementing "Learning by Teaching" in LLMs, mimicking human teaching processes at different levels: observing student feedback, learning from feedback, and iterative learning, to improve model performance without relying on additional human-produced data or stronger models. * Instruction Pre-Training: Language Models are Supervised Multitask Learners by Cheng, Gu, Huang, et al. (20 June), https://arxiv.org/abs/2406.14491 This study introduces a framework for supervised multitask pretraining of LLMs that augments raw corpora with synthetically generated instruction-response pairs. * Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? by Wu, Zhang, Johnson, et al. (19 June), https://arxiv.org/abs/2406.13121 This study introduces a benchmark for evaluating long-context LLMs on tasks requiring up to millions of tokens, demonstrating that these long-context LLMs can compete with specialized retrieval and RAG systems in in-context retrieval and reasoning tasks. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges by Ye, Turpin, Li, He, et al. (18 June), https://arxiv.org/abs/2406.12624 This paper evaluates the LLM-as-a-judge paradigm using TriviaQA as a benchmark, comparing 9 judge models and 9 exam taker models against human annotations, revealing that models with high human alignment may not necessarily be the best at ranking exam taker models. From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries by Wadhwa, Seetharaman, Aggarwal, et al. (18 June), https://arxiv.org/abs/2406.12824 The authors investigate the mechanics of Retrieval Augmented Generation (RAG) in LLMs to reveal that models predominantly rely on retrieved context information rather than their parametric memory when answering questions, exhibiting a shortcut behavior across different model families. Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts by Kang, Karlinsky, and Luo, et al. (17 June), https://arxiv.org/abs/2406.12034 This paper introduces a method that transforms a monolithic (LLM into a modular system called MiXSE (MiXture of Self-specialized Experts), using self-generated synthetic data to create specialized expert modules with shared base LLM and self-optimized routing. Measuring memorization in RLHF for code completion by Pappu, Porter, Shumailov, and Hayes (17 June), https://arxiv.org/abs/2406.11715 This study investigates the impact of Reinforcement Learning with Human Feedback (RLHF) on data memorization in LLMs, focusing on code completion tasks, and finds that while RLHF reduces memorization of data used in reward modeling and reinforcement learning compared to direct finetuning, it largely preserves memorization from the initial finetuning stage. HARE: HumAn pRiors, a key to small language model Efficiency by Zhang, Jin, Ge, et al. (17 June), https://arxiv.org/abs/2406.11410 This study proposes a principle for leveraging human priors in data construction for Small Language Models (SLMs) that focuses on semantic diversity and data quality consistency while avoiding benchmark data leakage. Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level by Kim, Lee, Park, et al. (17 June), https://arxiv.org/abs/2406.11817 This study introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a method that improves LLM alignment with human preferences while controlling response verbosity. Unveiling Encoder-Free Vision-Language Models by Choi, Yoon, Lee, et al . (17 June), https://arxiv.org/abs/2406.11832 This study presents an encoder-free vision-language model (VLM) that directly processes both visual and textual inputs in a unified decoder. * DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence by Zhu, Wang, Lee, et al. (17 June), https://arxiv.org/abs/2406.11931 DeepSeek-Coder-V2 is an open-source Mixture-of-Experts code LLM that achieves GPT4-Turbo-level performance on coding tasks through continued pretraining on 6 trillion additional tokens. Tokenization Falling Short: The Curse of Tokenization by Nguyen, Kim, Patel, et al. (17 June), https://arxiv.org/abs/2406.11687 This study investigates the "curse of tokenization" in LLMs by examining their performance on complex problem solving, token structure probing, and resilience to typographical variation, revealing that while scaling model size helps, LLMs remain susceptible to tokenization-induced biases DataComp-LM: In Search of the Next Generation of Training Sets for Language Models by Li, Fang, Smyrnis, et al. (17 June), https://arxiv.org/abs/2406.11794 The authors provide a standardized testbed for experimenting with dataset curation strategies in language model training, including a 240T token corpus, pretraining recipes, and 53 downstream evaluations, * Nemotron-4 340B Technical Report by Unknown Authors at NVIDIA (17 June),  https://arxiv.org/abs/2406.11704 This technical report accompanies NVIDIA release of the Nemotron-4 340B model family, which performs competitively on various benchmarks and excels in synthetic data generation, along with open-sourcing its data generation pipeline for further research and development. mDPO: Conditional Preference Optimization for Multimodal Large Language Models by Wang, Zhou, Huang, et al. (17 June), https://arxiv.org/abs/2406.11839 mDPO addresses the unconditional preference problem in multimodal DPO by optimizing image preference alongside language preferences and introducing a reward anchor to prevent likelihood decrease for chosen responses. * How Do Large Language Models Acquire Factual Knowledge During Pretraining? by Chang, Park, Ye, et al. (17 June), https://arxiv.org/abs/2406.11813 Task Me Anything by Zhang, Huang, Ma, et al. (17 June), https://arxiv.org/abs/2406.11775 Task-Me-Anything is a benchmark generation engine that creates tailored benchmarks for multimodal language models by programmatically generating task instances from a vast taxonomy of images and videos. THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation by Kim, Ong, Kwon, et al. (16 June), https://arxiv.org/abs/2406.10996 Theanine augments LLMs' response generation by using memory timelines (series of memories showing the development and causality of past events) to improve the model's ability to recall and utilize information from lengthy dialogue histories. Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs by Yang, Ding, Lin, et al. (14 June) https://arxiv.org/abs/2406.10216 This study proposes enhancing reward model generalization in RLHF by regularizing hidden states through retaining the base model's language model head and incorporating text-generation losses, while simultaneously learning a reward head, thus improving out-of-distribution task performance and mitigating reward over-optimization. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs by Hans, Wen, Jain, et al. (14 Jun) , https://arxiv.org/abs/2406.10209 The "goldfish loss" technique reduces model memorization in LLMs by randomly excluding a subset of tokens from the loss computation during training, preventing the model from learning complete verbatim sequences from the training data. Bootstrapping Language Models with DPO Implicit Rewards by Chen, Liu, Du, et al. (14 June), https://arxiv.org/abs/2406.09760 Researchers find that using the aligned model, an implicit reward model, generated during direct preference optimization (DPO) can itself be used to generate a preference dataset to further substantially improve itself. FouRA: Fourier Low Rank Adaptation by Borse, Kadambi, Pandey, et al. (13 June), https://arxiv.org/abs/2406.08798 This research introduces FouRA, a new low-rank adaptation (LoRA) method that operates in the Fourier domain and uses adaptive rank selection, addressing issues of data copying and distribution collapse in LoRA-fine-tuned text-to-image diffusion models while improving image quality and generalization. * An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels by Nguyen, Mahmoud Assran, Jain, et al. (13 June), https://arxiv.org/abs/2406.09415 This research reveals that vanilla Transformers can achieve high performance in various computer vision tasks by treating individual pixels as tokens, which challenges the assumed necessity of locality-based inductive bias in modern vision architectures and suggests new possibilities for future neural network designs in computer vision. MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding by Zuhri, Adilazuarda,Purwarianti, and Aji (13 June), https://arxiv.org/abs/2406.09297 This research introduces Multi-Layer Key-Value (MLKV) sharing, a new technique that extends Key-Value (KV) caching across transformer layers, which substantially reduces memory usage during auto-regressive inference beyond existing methods like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), while maintaining performance on NLP tasks. Transformers Meet Neural Algorithmic Reasoners by Bounsi, Ibarz, Dudzik, et al. (13 June), https://arxiv.org/abs/2406.09308 TransNAR is a hybrid architecture combining Transformers with graph neural network-based neural algorithmic reasoners (NARs), which enables improved performance on algorithmic reasoning tasks by allowing the Transformer to leverage the robust computational capabilities of NARs while maintaining strong natural language understanding. Discovering Preference Optimization Algorithms with and for Large Language Models by Lu, Holt, Fanconi, et al. (12 June), https://arxiv.org/abs/2406.08414 The proposed Discovered Preference Optimization method uses LLMs to automatically discover and implement new preference optimization algorithms for improving LLM outputs. * An Empirical Study of Mamba-based Language Models by Waleffe, Byeon, Riach, et al. (12 June), https://arxiv.org/abs/2406.07887 This research compares 8B-parameter state-space models (Mamba, Mamba-2) and Transformer models trained on large datasets, finding that while pure state-space models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying, in-context learning, or long-context reasoning; however, hybrids seem to offer the best of both worlds. * Large Language Models Must Be Taught to Know What They Don't Know by Kapoor, Gruver, Roberts, et al. (12 June), https://arxiv.org/abs/2406.08391 This research demonstrates that finetuning LLMs on a small dataset of graded examples can produce more reliable uncertainty estimates than prompting alone, with the resulting models capable of estimating uncertainty for themselves and other models. Large Language Model Unlearning via Embedding-Corrupted Prompts by Liu, Flannigan, and Liu (12 June), https://arxiv.org/abs/2406.07933 This research introduces embedding-corrupted prompts, a method for selective knowledge unlearning in LLMs that uses prompt classification and embedding corruption to achieve targeted forgetting with minimal side effects across a wide range of model sizes. What If We Recaption Billions of Web Images with LLaMA-3? by Li, Tu, Hui, et al. (12 June) https://arxiv.org/abs/2406.08478 This research demonstrates that using a finetuned Llama 3-powered LLaVA-1.5 multimodal LLM to recaption 1.3 billion images from the DataComp-1B dataset significantly improves the performance of vision-language models in various tasks. * Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing by Xu, Jiang, Niu et al. (12 June), https://arxiv.org/abs/2406.08464 Researchers propose a synthetic instruction data generation method that generates 300,000 high-quality instruction-response pairs from Llama-3-Instruct; this data can be used for supervised instruction fine-tuning to rival the performance of aligned LLMs without requiring an actual alignment step. * Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling by (11 June), https://arxiv.org/abs/2406.07522 Samba is a hybrid model combining selective state space models (think Mamba) with sliding window attention that scales efficiently to 3.8B parameters. * Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement (11 June) by Wu, Zhao, and Zheng, https://arxiv.org/abs/2406.07138 CREAM is a training-efficient method for extending the context length of LLMs by interpolating positional encodings and using a truncated Gaussian to prioritize middle-context information. Simple and Effective Masked Diffusion Language Models by Sahoo, Arriola, Schiff, et al. (11 June), https://arxiv.org/abs/2406.07524 This work demonstrates that masked discrete diffusion models, when trained with an effective recipe and a simplified objective, can substantially narrow the performance gap with autoregressive methods in language modeling. TextGrad: Automatic "Differentiation" via Text by Yuksekgonul, Bianchi, Boen, et al. (11 June), https://arxiv.org/abs/2406.07496 TextGrad is a framework that leverages LLMs to "backpropagate" textual feedback for optimizing building blocks (such as "tool caller", "search engine", etc.) in compound AI systems. An Image is Worth 32 Tokens for Reconstruction and Generation by Yu, Weber, Deng, et al. (11 June), https://arxiv.org/abs/2406.07550 The authors propose a transformer-based 1-dimensional tokenizer for image generation that reduces a 256x256x3 image to just 32 discrete tokens. * Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching by Zhang, Peng, Zhou, et al. , (10 June), https://arxiv.org/abs/2406.06326 The Self-Tuning framework improves the knowledge acquisition of LLMs from raw documents through self-teaching tasks focused on memorization, comprehension, and self-reflection. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters by Song, Xie, Zhang, et al. (10 June), https://arxiv.org/abs/2406.05955 This paper proposes the dReLU activation function and an optimized training data mixture to improve activation sparsity in LLMs. Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning by Kim, Paranjape, Khot, and Hajishirzi (10 June), https://arxiv.org/abs/2406.06469 Husky is an open-source language agent that learns to reason over a unified action space to tackle diverse tasks involving numerical, tabular, and knowledge-based reasoning by iterating between generating and executing actions with expert models. Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference by Hong, Paul, Lee, et al. (10 June), https://arxiv.org/abs/2406.06424 To address limitations of traditional alignment techniques like RLHF and DPO, the authors propose Margin-Aware Preference Optimization (MaPO) for text-to-image diffusion models, which maximizes the likelihood margin between preferred and dispreferred image sets without using a reference model. * Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation by Sun, Jian, Chen, et al. (10 June), https://arxiv.org/abs/2406.06525 The authors propose LlamaGen, which involves applying the "next-token prediction" paradigm of large language models to image generation. Creativity Has Left the Chat: The Price of Debiasing Language Models by Mohammidi (8 June), https://arxiv.org/abs/2406.05587 This research reveals that while alignment techniques like RLHF mitigate biases in LLMs, they can diminish the models' creative capabilities, impacting syntactic and semantic diversity, which is crucial for tasks requiring creative output. 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination by Yang, Chen, Madaan, et al. (7 June), https://arxiv.org/abs/2406.05132 This research introduces 3D-GRAND, a dataset of 40,087 household scenes paired with 6.2 million scene-language instructions, and utilizes instruction tuning and the 3D-POPE benchmark to enhance grounding capabilities and reduce hallucinations in 3D-LLMs BERTs are Generative In-Context Learners by Samuel (7 June), https://arxiv.org/abs/2406.04823 This paper demonstrates that masked language models, like DeBERTa, can perform in-context learning using a simple inference technique that reformats the sequence of input tokens with mask tokens that resemble the structure of a causal attention mask. June 7, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild by Lin, Deng, Chandu, et al. (7 June), https://arxiv.org/abs/2406.04770 The authors introduce an automated evaluation framework for benchmarking LLMs using real-world user queries, featuring 1,024 tasks and two advanced metrics, WB-Reward and WB-Score, which provide reliable and interpretable automatic judgments by employing task-specific checklists and structured explanations. CRAG -- Comprehensive RAG Benchmark by Yang, Sun, Xin, et al. (7 June), https://arxiv.org/abs/2406.04744 This research introduces a factual question answering dataset of 4,409 question-answer pairs with mock APIs simulating web and Knowledge Graph searches, designed to reflect diverse, dynamic real-world QA tasks. Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach by Dong, Luo, Zhang, et al. (7 June), https://arxiv.org/abs/2406.04594 This research introduces C4, a communication-driven solution for parallel training of LLMs, which rapidly identifies and isolates hardware faults and optimizes traffic planning to reduce network congestion, which can cut error-induced overhead by up to 30% and improving runtime performance by up to 15%. Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step by Liang, Yuan, Gu, et al. (6 June), https://arxiv.org/abs/2406.04314 This research introduces Step-aware Preference Optimization, a post-training approach that independently evaluates and adjusts denoising performance at each step in text-to-image diffusion models, which outperforms Diffusion-DPO in image alignment and aesthetics while offering 20x faster training efficiency. * Are We Done with MMLU? by Gema, Leang, Hong, et al. (6 June), https://arxiv.org/abs/2406.04127 This study identifies numerous errors in the widely-used MMLU benchmark, creates a re-annotated subset called MMLU-Redux that reveals significant discrepancies in reported model performance, and advocates for revising MMLU to improve its reliability. * Transformers Need Glasses! Information Over-Squashing in Language Tasks by Barbero, Banino, Kapturowski, et al. (6 June), https://arxiv.org/abs/2406.04267 The study analyzes information propagation in LLMs (specifically: decoder-only transformers), revealing a representational collapse phenomenon where distinct input sequences can yield arbitrarily close final token representations, leading to errors in tasks like counting or copying and loss of sensitivity to specific input tokens. The Prompt Report: A Systematic Survey of Prompting Techniques by Schulhoff, Ilie, Balepur, et al. (6 June), https://arxiv.org/abs/2406.06608  This 76-page paper aims to provide a clear and organized framework for understanding prompts and prompting techniques. Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models by Yang, Yu, Zhang, et al. (6 June), https://arxiv.org/abs/2406.04271 This Buffer of Thoughts approach improves LLMs by retrieving and instantiating thought-templates, which are generic problem-solving blueprints, for reasoning across various domains. Block Transformer: Global-to-Local Language Modeling for Fast Inference (4 June) by Ho, Bae, Kim, et al. , https://arxiv.org/abs/2406.02657 The proposed Block Transformer improves inference throughput 10-20x by isolating expensive global attention to lower layers on fixed-size token blocks and applying fast local attention in upper layers. * Scalable MatMul-free Language Modeling by Zhu, Zhang, Sifferman, et al. (4 June), https://arxiv.org/abs/2406.02528 This paper presents a scalable MatMul-free language model architecture that replaces matrix multiplications with element-wise products and accumulations using ternary weights that works well even in billion-parameter scales. Towards Scalable Automated Alignment of LLMs: A Survey , (3 June) by Cao, Lu, Lu, et al. https://arxiv.org/abs/2406.01252 This paper reviews the recent and emerging automated alignment methods for LLMs that typically follow the instruction finetuning step in an LLM development pipeline. The Geometry of Categorical and Hierarchical Concepts in Large Language Models by by Park, Choe, Jiang, and Veitch (3 June),  https://arxiv.org/abs/2406.01506 Using a Gemma LLM, this paper extends the linear representation hypothesis, showing that categorical concepts are simplices, hierarchical relations are orthogonal, and complex concepts are polytopes, validated with 957 WordNet concepts. OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models by Büyükakyüz (3 June), https://arxiv.org/abs/2406.01775 OLoRA is an enhancement of Low-Rank Adaptation (LoRA) using orthonormal matrix initialization via QR decomposition, which accelerates the convergence of LLM training compared to regular LoRA. Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models by Wei, Zhu, Zhao et al. (3 June), https://arxiv.org/abs/2406.06563 A report that describes some of the approaches and methods behind developing a 146B parameter mixture-of-experts LLM from an existing 13B parameter dense (non-mixture-of-experts) model. Show, Don't Tell: Aligning Language Models with Demonstrated Feedback by Shaikh, Lam, Hejna, et al. (2 June), https://arxiv.org/abs/2406.00888 The proposed method aligns method aligns LLM outputs to specific user behaviors using fewer than 10 demonstrations as feedback, leveraging imitation learning. This magazine is personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. (Sharing your feedback with others via a book review on Amazon helps a lot, too!) Build A Large Language Model (From Scratch) , Machine Learning Q And AI , and Machine Learning with PyTorch and Scikit-Learn Your support means a great deal! Thank you! A new, cost-effective method for generating data for instruction finetuning Instruction finetuning from scratch Pretraining LLMs with instruction data An overview of what's new in Gemma 2 An overview of all the other interesting research papers that came out in June Annotated illustration of the Magpie method for generating a synthetic dataset for instruction finetuning. The figure is based on illustrations from the Magpie paper: https://arxiv.org/abs/2406.08464 Essentially, as shown in the figure above, we just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for us. Then, we feed that instruction back to the LLM, and it will generate a response. If we repeat this procedure a couple of thousand times, we obtain a dataset for instruction finetuning. (Optionally, we can apply an LLM to filter the instruction-response pairs by quality.) 1.2 Dataset quality What's fascinating is that with the resulting instruction dataset, the authors found that finetuning a Llama 3 8B base model with just instruction finetuning (no preference finetuning via RLHF and DPO) beats the original Llama 2 8B Instruct model by Meta AI, as shown in the figure below. A Llama 3 8B base model finetuned on the Magpie-generated instruction dataset beats the original Llama 3 8B Instruct model. Based on an annotated illustration from the Magpie paper: https://arxiv.org/abs/2406.08464 The Magpie results shown in the figure above wer achieved with 300 thousand samples only. In comparison, The original Llama 3 Instruct model was finetuned and aligned on 100 million samples! 1.3 Running the Dataset Generation Locally I was skeptical at first, so I tried to implement this myself. It really works! Here , you can find my reimplementation using Ollama, which even runs fine locally on a MacBook Air. Code screenshot from a reimplementation of the Magpie method that runs locally. The code is available here . 1.4 Additional Details The authors created two sets of datasets: A "Pro" version using the Llama 3 70B Instruct model and an "Air" version using the Llama 3 8B Instruct model. As an earlier figure showed, the Magpie-Pro-generated dataset results in slightly stronger models compared to the Magpie-Air dataset when using it to instruction-finetune a Llama 3 8B base model. The figure below shows an additional comparison of the dataset qualities and difficulties as rated via an LLM. Annotated plots from the Magpie paper showing the dataset quality and difficulty of the Air and Pro datasets relative to each other. As the figure above shows, the quality of the Air and Pro datasets is roughly on par. In addition, it would have been interesting to see how the Alpaca dataset compares to these. (The assumption is that the Magpie data is of much higher quality than Alpaca, but a reference point would be interesting.) Furthermore, the paper contains an analysis showing that the breadth or diversity in this dataset is much larger than that of other popular datasets for instruction finetuning, such as Alpaca, Evol Instruct, and UltraChat. In addition, when compared to models trained with other instruction finetuning datasets, the Magpie-Pro finetuned model also compares very favorably. 1.5 Conclusion Overall, I think that Magpie is an interesting exploit that is, on the one hand, fascinating in its effectiveness and, on the other hand, has a lot of practical utility. I will certainly consider it as an interesting, simple, and cost-effective candidate for constructing general-purpose instruction datasets in the future. 2. Instruction Finetuning from Scratch If you are looking for a resource to understand the instruction finetuning process in LLMs, I am happy to share that Chapter 7 on instruction finetuning LLMs is now finally live on the Manning website . This is the longest chapter in the book and takes a from-scratch approach to implementing the instruction finetuning pipeline. This includes everything from input formatting to batching with a custom collate function, masking padding tokens, the training loop itself, and scoring the response quality of the finetuned LLM on a custom test set. (The exercises include changing prompt styles, instruction masking, and adding LoRA.) Happy coding! An overview of chapter 7 in my "Build a Large Language Model From Scratch" book. Supplementary code materials are available here on GitHub . PS: it's also the last chapter, and the publisher is currently preparing the layouts for the print version. 3. Instruction Pretraining LLMs In the paper "Instruction Pre-Training: Language Models are Supervised Multitask Learners" ( https://arxiv.org/abs/2406.14491 ), researchers investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. (Here, "raw text" means text from books, websites, papers, and so forth that has not been reprocessed into a specific format.) A comparison between regular pretraining (top) and the proposed instruction pretraining approach (bottom) via an annotated figure from https://arxiv.org/abs/2406.14491 Specifically, the researchers experiment with generating instruction-response data from the raw training corpus itself via an "instruction synthesizer," an LLM specifically finetuned for this task. (Note that this is not the first paper proposing the formatting of raw text as instruction data. Another work that comes to mind is "Genie: Achieving Human Parity in Content-Grounded Datasets Generation" ( https://arxiv.org/abs/2401.14367 ). I also recall seeing another paper or blog post using instruction data during pretraining a few months ago—I discussed this method with some of my colleagues—but unfortunately, I couldn't find the reference. Nonetheless, the paper discussed here is particularly intriguing since it builds on openly available LLMs that run locally and covers both pretraining and continual pretraining.) 3.1 Instruction Synthesizer Before we dive into the pretraining and continual pretraining results, let's talk about the core component of this method: the instruction synthesizer. This is an openly available Mistral 7B v0.1 LLM (which I wrote about last year here: https://magazine.sebastianraschka.com/i/138555764/mistral-b ) that has been finetuned to generate instruction-response pairs from raw text. To finetune this synthesizer, the researchers use datasets such as HotpotQA ( https://arxiv.org/abs/1809.09600 ), which consists of passages from Wikipedia associated with questions and answers. For this, the authors also ensure that a variety of tasks, like commonsense reasoning, sentiment analysis, math problems, etc., are covered. The input and output data of the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Once this instruction synthesizer is developed (i.e., finetuned), it can be used to generate the input data for pretraining the target LLMs. One last noteworthy detail regarding the instruction synthesizer is that multiple raw texts (T n ) and instruction-response pairs (I n ⊕ R n ) are concatenated as few-shot examples, as shown in the figure below. The formatting of the instruction-data for finetuning (and using) the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 3.2 Pretraining with Instruction Data Now that we have discussed the method to generate the instruction-response pairs, let's get to the interesting part: how well do models train on this augmented dataset. The first set of results looks at two small models trained from scratch: 500M parameters and 1.3B parameters (both are based on the Mistral architecture). A comparison of 3 different pretraining approaches used to train models from scratch (annotated table from https://arxiv.org/abs/2406.14491) As we can see in the table above, the model trained via the proposed instruction pretraining approach ( Instruct PT ) performs best on most benchmark tasks (higher values are better).  Note, though, that it has seen more tokens than the Vanilla PT approach since it included the synthesized instruction-response pairs. Hence, the authors included the Mix PT comparison, which is a model that has been trained on a data mix containing both the raw text and the instruction data used to train the synthesizer. From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.) In addition, it's worth noting that the Instruct PT pretrained models have another advantage: They improve a more when they are instruction-finetuned afterwards, as the figure below shows. Finetuning LLMs that have been pretrained with either the traditional pretraining pardigm (Vanilla PT) or instruction pretraining (annotated figure from https://arxiv.org/abs/2406.14491) 3.3 Continual Pretraining with Instruction Data Pretraining from scratch is interesting because that's how LLMs are created in the first place. However, I'd say that practitioners care more about continual pretraining and finetuning.  Continual pretraining here means that we take an existing pretrained model and pretrain it further on new domain data. For instance, think of a Llama 3 8B base model that has been trained on a general text corpus and that you want to adapt for finance, medical, legal, or other domains. The table below summarizes the results the researchers obtained when applying the instruction pretraining method to a pretrained Llama 3 8B base model. Specifically, they conducted continual pretraining with both biomedical texts and finance texts. A comparison of 3 different pretraining approaches used for continual pretraining (annotated table from https://arxiv.org/abs/2406.14491) Looking at the table above, we can see that the instruction pretraining approach ( Instruct PT ) clearly outperforms the vanilla pretraining ( Vanilla PT ) approach (here, this means regular continual pretraining of the base model).  The Llama 3 70B base model is included as a reference; I suppose to showcase that small specialized models can beat larger general models. 3.4 Conclusion Almost every time I explain the LLM pretraining pipeline to someone, they are surprised by its simplicity and the fact that this is still what's commonly used to train LLMs today. The instruction pretraining approach is quite refreshing in that sense.  One caveat is that for large pretraining corpora, it might still be expensive to create the instruction-augmented corpora. However, the nice thing about generated data is that it can be reused in many different projects once created. 4. Gemma 2 I cannot write this article without mentioning Google's new Gemma 2 models , which are arguably the biggest model release last month. However, when it comes to pure size, Nvidia's Nemotron-4 340B takes the crown (https://arxiv.org/abs/2406.11704). The Gemma 2 models come in 2.6B, 9B, and 27B parameter versions. Since this article is already quite lengthy, and you're likely familiar with Gemma 2 from other sources, let's cut to the chase. What are the main highlights and noteworthy updates in Google's newly released Gemma 2 LLMs? The main theme is exploring techniques without necessarily increasing the size of training datasets but rather focusing on developing relatively small and efficient LLMs. Specifically, they blend three main architectural and training choices to create the 2.6B and 9B parameter models: sliding window attention, grouped-query attention, and knowledge distillation. 4.1 Sliding window attention Sliding window attention (e.g., as popularized by Mistral) is a technique using a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below. Annotated figure from https://arxiv.org/abs/2310.06825 explaining sliding window attention. In the case of Gemma 2, the authors alternated between regular attention and sliding window attention layers. The sliding attention block size was 4096 tokens, spanning a total block size of 8192 tokens.  Sliding window attention is mainly used to improve computational performance, and the researchers also included a small ablation study showing that there's a barely noticeable difference in perplexity when shrinking the block size during inference. An ablation study from the Gemma 2 technical report showing that a decreased block size for the sliding window barely impacts the modeling performance of the 9B parameter model during inference. (It would have been interesting to see the GPU memory improvement side-by-side.) 4.2 Group-query attention Group-query attention (like in Llama 2 and 3) can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements. Annotated figure from Ainslie et al. 2023 4.3 Knowledge distillation The general idea of Knowledge distillation (as in MiniLLM, https://arxiv.org/abs/2306.08543) is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a "teacher" for the smaller models. An overview of knowledge distillation from my Machine Learning Q and AI book in the context of computer vision. In an LLM-context, think of text instead of images and predicted tokens instead of class labels. 4.4 Other interesting architecture details The paper contains many other interesting tidbits. For instance, one hallmark of Gemma 2 is its relatively large vocabulary size: 256,000 tokens. This is similar to the first Gemma model, but it's still worth noting since it's twice the size of the Llama 3 vocabulary (128,000) and eight times the size of the Phi-3 vocabulary (32,000). The vocabulary size of an LLM refers to the number of unique tokens (words, subwords, or characters) that the model can recognize and generate.  A large vocabulary size in LLMs allows for better coverage of words and concepts, improved handling of multilingual content, and reduced tokenization artifacts. However, a large vocabulary size also comes with trade-offs, such as increased model size and potentially slower inference due to the larger embedding and output layers. (That's where the sliding window attention and multi-query attention mechanism are important to offset this.) There's also an interesting section on "logit capping," a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training. logits ← soft_cap ∗ tanh(logits/soft_cap). Additionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although the paper doesn't provide much detail about that. (However, interested readers can read more about this in WARP: On the Benefits of Weight Averaged Rewarded Policies , which Gemma 2 uses for this.)  In terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model. A comparison between two other popular models with openly available weights: Llama 3 and Qwen 1.5. (Annotated table from the Gemma 2 technical report ). Personally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research. An example of one of the ablation studies included in the Gemma 2 technical report . Here "wide" refers to a model with 28 layers and an intermediate size of 24,576, and "deep" refers to a architecture with 42 layers with an intermediate size of 14,336. 4.5 Conclusion It's refreshing to see such a relatively detailed technical report from Google. When it comes to the model itself, based on public consensus, Gemma 2 is likely the most capable model for single-GPU use cases today. For larger models, Llama 3 70B and Qwen 2 72B remain strong contenders. Supporting Ahead of AI Ahead of AI is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. If you have a few moments, a review of Machine Learning Q and AI or Machine Learning with PyTorch and Scikit-Learn on Amazon would really help, too! Your support means a great deal and is tremendously helpful in continuing this journey. Thank you! 5. Other Interesting Research Papers In April Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 20 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects. Scaling Synthetic Data Creation with 1,000,000,000 Personas by Chan, Wang, Yu, et al. (28 June), https://arxiv.org/abs/2406.20094 The research proposes a persona-driven data synthesis methodology that utilizes an LLM to create diverse synthetic data by leveraging a vast collection of automatically curated personas, called Persona Hub, which represents about 13% of the world's population. This study develops "critic" models using RLHF to assist humans in evaluating model-generated code, training LLMs to write natural language feedback on code errors, and demonstrating their effectiveness in catching bugs across various tasks. DPKD reformulates Knowledge Distillation for LLMs into a two-stage process: first optimizing an objective combining implicit reward and reverse KL divergence, then improving the preference probability of teacher outputs over student outputs. This study investigates the robustness of accuracy measurements on the MMLU benchmark for LLMs, revealing that shuffling answer label contents leads to decreased accuracy across models, with varying sensitivity. This study proposes a finetuning approach using a synthetic dataset of numerical key-value retrieval tasks to improve LLM's long-context information retrieval and reasoning capabilities.  This study proposes efficient router models that dynamically select between stronger and weaker LLMs during inference to optimize cost-performance trade-offs.  This study introduces a method to train LLMs that can follow user-specified length constraints at inference time, addressing the length bias in model evaluation and outperforming standard instruction-following models in length-controlled tasks. LongIns is a new benchmark for evaluating LLM's long-context capabilities, using three settings to assess retrieval and reasoning abilities. This report introduces FineWeb, a 15-trillion token dataset derived from Common Crawl, and FineWeb-Edu, a 1.3-trillion token educational subset. Adam-mini is a proposed optimizer that achieves comparable or better performance than AdamW while using 45-50% less memory by strategically reducing learning rate resources, partitioning parameters based on Hessian structure, and assigning optimized single learning rates to parameter blocks. The paper introduces a new alignment strategy for LLMs that merges policies at three stages: using exponential moving average for dynamic KL regularization, spherical interpolation of independently fine-tuned policies, and linear interpolation with initialization. The authors propose a new sparse attention mechanism for autoregressive Transformers, using a scoring network and differentiable top-k mask operator to select a constant number of KV pairs per query to achieve linear time complexity and a constant memory footprint. This study proposes three strategies to improve continual pretraining of LLMs: multiple epochs on a subset, focusing on high-quality data, and using a mixture similar to pretraining data. Mixture of Attention (MoA) automatically optimizes sparse attention patterns for different model components and input lengths in LLMs, improving context length, accuracy, and efficiency over uniform sparse attention approaches. LongRAG introduces a new RAG framework using 4K-token retrieval units and long-context LLMs for answer extraction, which improves retrieval performance and achieving state-of-the-art results on question-answering tasks without additional training. This study challenges conventional wisdom by demonstrating that base LLMs outperform instruction-tuned models in Retrieval Augmented Generation (RAG) tasks. The authors develop and test three methods for implementing "Learning by Teaching" in LLMs, mimicking human teaching processes at different levels: observing student feedback, learning from feedback, and iterative learning, to improve model performance without relying on additional human-produced data or stronger models. This study introduces a framework for supervised multitask pretraining of LLMs that augments raw corpora with synthetically generated instruction-response pairs. This study introduces a benchmark for evaluating long-context LLMs on tasks requiring up to millions of tokens, demonstrating that these long-context LLMs can compete with specialized retrieval and RAG systems in in-context retrieval and reasoning tasks. This paper evaluates the LLM-as-a-judge paradigm using TriviaQA as a benchmark, comparing 9 judge models and 9 exam taker models against human annotations, revealing that models with high human alignment may not necessarily be the best at ranking exam taker models. The authors investigate the mechanics of Retrieval Augmented Generation (RAG) in LLMs to reveal that models predominantly rely on retrieved context information rather than their parametric memory when answering questions, exhibiting a shortcut behavior across different model families. This paper introduces a method that transforms a monolithic (LLM into a modular system called MiXSE (MiXture of Self-specialized Experts), using self-generated synthetic data to create specialized expert modules with shared base LLM and self-optimized routing. This study investigates the impact of Reinforcement Learning with Human Feedback (RLHF) on data memorization in LLMs, focusing on code completion tasks, and finds that while RLHF reduces memorization of data used in reward modeling and reinforcement learning compared to direct finetuning, it largely preserves memorization from the initial finetuning stage. This study proposes a principle for leveraging human priors in data construction for Small Language Models (SLMs) that focuses on semantic diversity and data quality consistency while avoiding benchmark data leakage. This study introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a method that improves LLM alignment with human preferences while controlling response verbosity. This study presents an encoder-free vision-language model (VLM) that directly processes both visual and textual inputs in a unified decoder. DeepSeek-Coder-V2 is an open-source Mixture-of-Experts code LLM that achieves GPT4-Turbo-level performance on coding tasks through continued pretraining on 6 trillion additional tokens. This study investigates the "curse of tokenization" in LLMs by examining their performance on complex problem solving, token structure probing, and resilience to typographical variation, revealing that while scaling model size helps, LLMs remain susceptible to tokenization-induced biases The authors provide a standardized testbed for experimenting with dataset curation strategies in language model training, including a 240T token corpus, pretraining recipes, and 53 downstream evaluations, This technical report accompanies NVIDIA release of the Nemotron-4 340B model family, which performs competitively on various benchmarks and excels in synthetic data generation, along with open-sourcing its data generation pipeline for further research and development. mDPO addresses the unconditional preference problem in multimodal DPO by optimizing image preference alongside language preferences and introducing a reward anchor to prevent likelihood decrease for chosen responses. Task-Me-Anything is a benchmark generation engine that creates tailored benchmarks for multimodal language models by programmatically generating task instances from a vast taxonomy of images and videos. Theanine augments LLMs' response generation by using memory timelines (series of memories showing the development and causality of past events) to improve the model's ability to recall and utilize information from lengthy dialogue histories. This study proposes enhancing reward model generalization in RLHF by regularizing hidden states through retaining the base model's language model head and incorporating text-generation losses, while simultaneously learning a reward head, thus improving out-of-distribution task performance and mitigating reward over-optimization. The "goldfish loss" technique reduces model memorization in LLMs by randomly excluding a subset of tokens from the loss computation during training, preventing the model from learning complete verbatim sequences from the training data. Researchers find that using the aligned model, an implicit reward model, generated during direct preference optimization (DPO) can itself be used to generate a preference dataset to further substantially improve itself. This research introduces FouRA, a new low-rank adaptation (LoRA) method that operates in the Fourier domain and uses adaptive rank selection, addressing issues of data copying and distribution collapse in LoRA-fine-tuned text-to-image diffusion models while improving image quality and generalization. This research reveals that vanilla Transformers can achieve high performance in various computer vision tasks by treating individual pixels as tokens, which challenges the assumed necessity of locality-based inductive bias in modern vision architectures and suggests new possibilities for future neural network designs in computer vision. This research introduces Multi-Layer Key-Value (MLKV) sharing, a new technique that extends Key-Value (KV) caching across transformer layers, which substantially reduces memory usage during auto-regressive inference beyond existing methods like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), while maintaining performance on NLP tasks. TransNAR is a hybrid architecture combining Transformers with graph neural network-based neural algorithmic reasoners (NARs), which enables improved performance on algorithmic reasoning tasks by allowing the Transformer to leverage the robust computational capabilities of NARs while maintaining strong natural language understanding. The proposed Discovered Preference Optimization method uses LLMs to automatically discover and implement new preference optimization algorithms for improving LLM outputs. This research compares 8B-parameter state-space models (Mamba, Mamba-2) and Transformer models trained on large datasets, finding that while pure state-space models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying, in-context learning, or long-context reasoning; however, hybrids seem to offer the best of both worlds. This research demonstrates that finetuning LLMs on a small dataset of graded examples can produce more reliable uncertainty estimates than prompting alone, with the resulting models capable of estimating uncertainty for themselves and other models. This research introduces embedding-corrupted prompts, a method for selective knowledge unlearning in LLMs that uses prompt classification and embedding corruption to achieve targeted forgetting with minimal side effects across a wide range of model sizes. This research demonstrates that using a finetuned Llama 3-powered LLaVA-1.5 multimodal LLM to recaption 1.3 billion images from the DataComp-1B dataset significantly improves the performance of vision-language models in various tasks. Researchers propose a synthetic instruction data generation method that generates 300,000 high-quality instruction-response pairs from Llama-3-Instruct; this data can be used for supervised instruction fine-tuning to rival the performance of aligned LLMs without requiring an actual alignment step. Samba is a hybrid model combining selective state space models (think Mamba) with sliding window attention that scales efficiently to 3.8B parameters. CREAM is a training-efficient method for extending the context length of LLMs by interpolating positional encodings and using a truncated Gaussian to prioritize middle-context information. This work demonstrates that masked discrete diffusion models, when trained with an effective recipe and a simplified objective, can substantially narrow the performance gap with autoregressive methods in language modeling. TextGrad is a framework that leverages LLMs to "backpropagate" textual feedback for optimizing building blocks (such as "tool caller", "search engine", etc.) in compound AI systems. The authors propose a transformer-based 1-dimensional tokenizer for image generation that reduces a 256x256x3 image to just 32 discrete tokens. The Self-Tuning framework improves the knowledge acquisition of LLMs from raw documents through self-teaching tasks focused on memorization, comprehension, and self-reflection. This paper proposes the dReLU activation function and an optimized training data mixture to improve activation sparsity in LLMs. Husky is an open-source language agent that learns to reason over a unified action space to tackle diverse tasks involving numerical, tabular, and knowledge-based reasoning by iterating between generating and executing actions with expert models. To address limitations of traditional alignment techniques like RLHF and DPO, the authors propose Margin-Aware Preference Optimization (MaPO) for text-to-image diffusion models, which maximizes the likelihood margin between preferred and dispreferred image sets without using a reference model. The authors propose LlamaGen, which involves applying the "next-token prediction" paradigm of large language models to image generation. This research reveals that while alignment techniques like RLHF mitigate biases in LLMs, they can diminish the models' creative capabilities, impacting syntactic and semantic diversity, which is crucial for tasks requiring creative output. This research introduces 3D-GRAND, a dataset of 40,087 household scenes paired with 6.2 million scene-language instructions, and utilizes instruction tuning and the 3D-POPE benchmark to enhance grounding capabilities and reduce hallucinations in 3D-LLMs This paper demonstrates that masked language models, like DeBERTa, can perform in-context learning using a simple inference technique that reformats the sequence of input tokens with mask tokens that resemble the structure of a causal attention mask. The authors introduce an automated evaluation framework for benchmarking LLMs using real-world user queries, featuring 1,024 tasks and two advanced metrics, WB-Reward and WB-Score, which provide reliable and interpretable automatic judgments by employing task-specific checklists and structured explanations. This research introduces a factual question answering dataset of 4,409 question-answer pairs with mock APIs simulating web and Knowledge Graph searches, designed to reflect diverse, dynamic real-world QA tasks. This research introduces C4, a communication-driven solution for parallel training of LLMs, which rapidly identifies and isolates hardware faults and optimizes traffic planning to reduce network congestion, which can cut error-induced overhead by up to 30% and improving runtime performance by up to 15%. This research introduces Step-aware Preference Optimization, a post-training approach that independently evaluates and adjusts denoising performance at each step in text-to-image diffusion models, which outperforms Diffusion-DPO in image alignment and aesthetics while offering 20x faster training efficiency. This study identifies numerous errors in the widely-used MMLU benchmark, creates a re-annotated subset called MMLU-Redux that reveals significant discrepancies in reported model performance, and advocates for revising MMLU to improve its reliability. The study analyzes information propagation in LLMs (specifically: decoder-only transformers), revealing a representational collapse phenomenon where distinct input sequences can yield arbitrarily close final token representations, leading to errors in tasks like counting or copying and loss of sensitivity to specific input tokens. This 76-page paper aims to provide a clear and organized framework for understanding prompts and prompting techniques. This Buffer of Thoughts approach improves LLMs by retrieving and instantiating thought-templates, which are generic problem-solving blueprints, for reasoning across various domains. The proposed Block Transformer improves inference throughput 10-20x by isolating expensive global attention to lower layers on fixed-size token blocks and applying fast local attention in upper layers. This paper presents a scalable MatMul-free language model architecture that replaces matrix multiplications with element-wise products and accumulations using ternary weights that works well even in billion-parameter scales. This paper reviews the recent and emerging automated alignment methods for LLMs that typically follow the instruction finetuning step in an LLM development pipeline. Using a Gemma LLM, this paper extends the linear representation hypothesis, showing that categorical concepts are simplices, hierarchical relations are orthogonal, and complex concepts are polytopes, validated with 957 WordNet concepts. OLoRA is an enhancement of Low-Rank Adaptation (LoRA) using orthonormal matrix initialization via QR decomposition, which accelerates the convergence of LLM training compared to regular LoRA. A report that describes some of the approaches and methods behind developing a 146B parameter mixture-of-experts LLM from an existing 13B parameter dense (non-mixture-of-experts) model. The proposed method aligns method aligns LLM outputs to specific user behaviors using fewer than 10 demonstrations as feedback, leveraging imitation learning.

0 views
Lil'Log 1 years ago

Extrinsic Hallucinations in LLMs

Hallucination in large language models usually refers to the model generating unfaithful, fabricated, inconsistent, or nonsensical content. As a term, hallucination has been somewhat generalized to cases when the model makes mistakes. Here, I would like to narrow down the problem of hallucination to cases where the model output is fabricated and not grounded by either the provided context or world knowledge. There are two types of hallucination: This post focuses on extrinsic hallucination. To avoid hallucination, LLMs need to be (1) factual and (2) acknowledge not knowing the answer when applicable.

0 views
Matt Mazur 1 years ago

Exploring ChatGPT’s Knowledge Cutoff

This highlights the need to not only consider the possibility of hallucinations in ChatGPT’s responses, but in the possibility of ignorance as well, especially for information about recent events. This will be less of an issue if you’re using GPT-4 with browsing enabled, but will still impact older models as well as usage of GPT-4 via the API or playground. Again, if anyone spots anything I’ve overlooked in this analysis, please drop me a note. Thanks for reading .

0 views
Lil'Log 2 years ago

Prompt Engineering

Prompt Engineering , also known as In-Context Prompting , refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics. This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models. At its core, the goal of prompt engineering is about alignment and model steerability. Check my previous post on controllable text generation.

0 views