LLM Research Papers: The 2024 List
It’s been a very eventful and exciting year in AI research. This is especially true if you are interested in LLMs. I had big plans for this December edition and was planning to publish a new article with a discussion of all my research highlights from 2024. I still plan to do so, but due to an accident and serious injury, I am currently unable to work at a computer and finish the draft. But I hope to recover in the upcoming weeks and be back on my feet soon. In the meantime, I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It’s just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays. And if you are interested in more code-heavy reading and tinkering, My Build A Large Language Model (From Scratch) book is out on Amazon as of last month. In addition, I added a lot of bonus materials to the GitHub repository . Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663 This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. January 2024 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663