Posts in Odin (17 found)
Ginger Bill 2 months ago

If Odin Had Macros

I sometimes get asked if Odin has any plans to add hygienic macros or some other similar construct. My general, and now (in)famous, answer is to many such questions is: No . I am not against macros nor metaprogramming in general, and in fact I do make metaprograms quite often (i.e. programs that make/analyse a program). However my approach with the design of Odin has been extremely pragmatic . I commonly ask people who ask for such things what are they specifically trying to solve? Usually they are not trying to do anything whatsoever and are just thinking about non-existent hypotheticals. However, in the cases when people are trying to do something that they believe they need a macro for, Odin usually has a different approach to doing it and it is usually much better and more suited for the specific need of the problem. This is the surprising thing about designing Odin, I was expecting I would need some form of compile-time metaprogramming at the language level, be it hygienic macros, compile-time execution, or even complete compile-time AST modification. But every problem I came across, I find a better solution that could be solved with a different language construct or idiom. My original hypothesis in general has been shown to be wrong. n.b. This is all hypothetical, and the construct is very unlikely to happen in Odin One of the best cases for a macro-like construct that I can think of which Odin does not support would be push-based iterators. Since Go 1.23, it now has a way of doing push-based iterators . I’ve written on this topic before ( Why People are Angry over Go 1.23 Iterators ) and how they work. The main issue I have with them, and which many other individuals have complaints with too, is that they effectively rely on 3-levels deep of closures. You have a function that returns a closure that in turn takes in a closure which is automatically and implicitly generated from the body of a loop. This is honestly not going to be the best construct for performance by any stretch because it is relying on so heavily on closures. For Odin since it has no concept of a closure, being that it is a language with manual-memory-management 1 , it would not be possible to add the same approach that Go does. However there is a way of achieving this in a similar way which requires no closures, and would produce very good code due to its imperative nature. Using the example from the previous Go article I wrote, consider the following pseudo-syntax: The internally expanded code in the backend would look something similar to this: This is of course a restricted form of a hyigenic macro only applicable to iterators. However this is only way to achieve such a construct in the language. However the way you’d write the macro is still extremely imperative in nature. The push-iterator approach allows you to store state/data in the control flow itself without the need for explicit state which a pull-iterator would require. A more common example would be the iteration over a custom hash map: This approach to iterators is very imperative and not very “composable” in the functional sense. You cannot chain multiple iterators together using this approach. I personally don’t have much need for composing iterators in practice and I usually just want the ability to iterate across a custom data structure and that’s it. I honestly don’t think the composability of iterators is an actual need most programmers have, but rather something that “seems cool” 2 to use. I don’t think I can think of a case when I’ve actually wanted to use reusable composable iterators either, and when I’ve had something near to that, I’ve just written the code in-line since it was always a one-off. It’s unlikely that I would ever add a construct like this to Odin. Not because I don’t think it isn’t useful (it obviously is) but rather it is a slippery slope construct. A lot of features and constructs that have been proposed to me over the years usually fall into this category. A slippery slope is rarely a fallacy in my opinion 3 when it comes to design and that if you allow it at one point, there isn’t much justification to stop it later on. Giving into one whim does lead to giving into another whim. In this case, I’d argue the slippery slope is the design for more hygienic macros throughout the language, not just to modify pre-existing control flow but to add new control flow, or add other constructs to the language. This is why I have to say no . To keep Odin the language that is loved by so many people, I have to hold the reins and steer it with a clear vision in the right direction. If I allow anyone’s and everyone’s desire to slip into the language, it will become worse than a design by committee language such as C++. The road to hell is not paved with good intentions but rather the lack of intention. So my current answer at the time of writing to this construct is this: No . I’d argue that actual closures which are unified everywhere as a single procedure type with non-capturing procedure values require some form of automatic-memory-management. That does not necessarily garbage collection nor ARC, but it could be something akin to RAII. This is all still automatic and against the philosophy of Odin.  ↩︎ Remember, you’re a bunch of programmers–you’re not cool.  ↩︎ I’d easily argue if it actually is slippery slope, it cannot be a fallacy.  ↩︎ I’d argue that actual closures which are unified everywhere as a single procedure type with non-capturing procedure values require some form of automatic-memory-management. That does not necessarily garbage collection nor ARC, but it could be something akin to RAII. This is all still automatic and against the philosophy of Odin.  ↩︎ Remember, you’re a bunch of programmers–you’re not cool.  ↩︎ I’d easily argue if it actually is slippery slope, it cannot be a fallacy.  ↩︎

0 views
Ginger Bill 5 months ago

Unstructured Thoughts on the Problems of OSS/FOSS

Originally from replies to a Twitter thread: https://x.com/TheGingerBill/status/1914389352416993395 This is not a structured argument against FOSS/OSS but my uncommon thoughts on the topic. I am not sure if I agree [that FOSS/OSS derives from the same thinking process as the ideology of communism], but I understand the sentiment. The fundamental issue is that software is trivially copyable. I have loads of issues with FOSS and OSS 1 . And part of this “ideology” (as presented in the original post) is naïvety coupled with only first-order thinking and a poor understanding of ownership . Software isn’t property in the normal sense of that word, so trying to apply the same rules to it is why it starts to break down; coupled with the first-order thinking that many FOSS advocates do, it leads to unsustainable practices and single sources of dependency failure. However the simplest argument (which is technically wrong) is due to what I already said: it’s trivially copyable, and thus not really “property” in any traditional sense of that word. Software at the end of the day is a very complicated and complex implementation of an algorithm. It is neither physical property nor intellectual property , the latter of which is closer to an honour system, rather than a property rights system even if has the term “property” in its phrase. So in a weird sense, it fits into a different category (or maybe a subcategory) because it’s not the algorithm/idea itself which is “precious” but rather the implementation of it—the software itself. The foundations of the assumptions of FOSS/OSS about the “right” to “redistribute” is blending a few different aspects together. The licence itself is an honour system applied to the “software”. But the question is, is that an applicable in the first place? There are a lot of oddities when it comes to copyright and trademark law, which are mostly done due to practicality rather than based on principles. A good example that recipes 2 cannot be copyrightable, but from a principled aspect there isn’t any reason why not. Recipes have been passed around all over the place by numerous people over the years, so the origins are hard to trace, and even harder to enforce. This is why many industries have “trade secrets”, to protect their place in industry. Letting people know your “secrets” means that they are “secrets” no more. Even legally (secret) recipes are classed as the same as “trade secrets”. You could argue that letting people have more knowledge is a “net benefit for society” but that is the first-order thinking I am talking about. The assumption that “the truth with set you free” is adding the assumption that everyone should know it in the first place. I am not making a pseudo- gnostic argument, but rather some secrets are best kept, well, secret. It also makes sense from a business sense too to not let your competitors know how you do things if you want some sort of technical advantage. But this is still first-order thinking. To go second-order (which is still not that deep), it also means that people tend to rely on those “ideas” rather than evolving and generating more. It means that people just rely on the free things rather than trying to come up with other approaches. To clarify further what I mean by first-order thinking, it’s thinking about the immediate results rather than the long-term more complex and complicated results which require thinking in higher-orders. A good analogy would be a Taylor series . In this analogy, first-order is a linear approximation, whilst second-order would be quadratic, etc. As you add more terms, you get a more accurate approximation of the actual function, but with this fitting approach, it still has numerous limitations and flaws. And in the case of thinking, more and more orders might not be easy (or even possible) to think about. Ideas have virtually no cost to them, or even negative cost, and as such, when something is free, people cannot rationally calculate a cost benefit analysis of such ideas. It’s assuming that a “marketplace of ideas” is possible, when a market requires a price mechanism to work. A rational marketplace requires a way to rationally calculate costs from prices (as costs are determined by prices, not the other way around). There is a reason the XKCD 2347 meme exists. People will rely on something just because it is free and “forget” (I am using that term loosely) that everything has a cost to it. And I do run an Open Source (not FOSS) project: the Odin Programming Language . If there was a possibility of actually selling a compiler nowadays, I would, but because the expected price of a compiler for a new language is now free, it makes that nigh impossible. You have to either rely on charity or companies that rely on the product to pay for support. I am grateful for the amount of bug reports, Pull Requests, and general usage of Odin. It is extremely surreal that I work with a company that uses my language for all of their products , and I get paid to do it. Some of my time is spent working on the Odin compiler and Odin core library, but a lot of it is actually just working on the products themselves. And that’s what I made Odin for in the first place: a language I could actually program to make things in—a means to an end; not an end into itself. There does seem to be a common feeling of guilt programmers have that they should give their knowledge to the world freely. But why are they feeling guilty? Is that a correctly placed emotion? Is that even valid? And why should you give your knowledge and wisdom away for free? Just because you got it for free? I could also add, but I am not going to make this argument in general, that FOSS is specifically “forcing charity” on others, which the act itself is not virtuous but vicious. This is why I assume the original poster is kind of saying this it similar to the “ideology of communism”, if I am guessing. He is viewing that as the “forced charity” aspect of the “ideology”. It is also a very specific conception of charity , the view that charity is free-as-in-beer rather than a virtue of the friendship of man (or in the traditional theological conception, “the friendship of man for God”). A lot of charity is “free” but a lot is not. You can still be beneficial to others whilst making a profit. There is nothing intrinsically wrong about making a profit in-itself 3 . I’d trivially argue that people who release their source code when you pay for it when they don’t have to are still being charitable. Yes it does have a monetary cost, but that does not mean it isn’t a form of charity. But OSS/FOSS as a practice encourages, not forces, by telling people ought to work for free and give the services for free. To be clear, my position on this topic is far from a common one, and I don’t think it’s a “ strawman ” by any stretch of the imagination. There are many “definitions” of what OSS and FOSS are, but I’d argue most are idealistic forms which are on the whole (except in certain instances) do not bring forth the ideals that they set out. To use a common cybernetics phrase ( POSIWID ): The purpose of a system is what it does , and there is not point claiming that the purpose of a system is to do what it constantly fails to do. “Of course” you can hypothetically make money from OSS/FOSS but that doesn’t mean it is either possible nor sustainable or even desirable. And I know this from first-hand experience. I am always grateful for all of the donations received for the Odin programming language through people’s act of charity, and all of that money does go towards the development of Odin. However, it’s not a sustainable nor maintainable model—and I’d argue has never been nor could ever be. And for every purported exception to this rule, I’d try to argue none of them are because of the OSS/FOSS model and purely to the individual programmer/developer. The OSS/FOSS dream is that, a dream that cannot live up to its “ideals”. For every hypothetical benefit it has, it should be stated that it is a hypothesis and not a theory—theories require evidence to be classed as such. Most of the benefits and freedoms of OSS/FOSS are doubled-edge swords which are also huge attack vectors. Vectors for security, sustainability, maintainability, and redistribution. Most of the industry is based on blind-faith without anyway to verify that blind-trust. Regardless of whether I am correctly or incorrectly “defining” OSS/FOSS to your preferred “definition”, the multi-order effects are rarely considered. And to bastardize Lysander Spooner : this much is certain—that it has either authorized such a tech oligarchy as we have had, or has been powerless to prevent it. In either case, it is unfit to exist 4 . All of these lists of ideals of essential freedoms—and I’d argue they are not principles in the slightest—have not aided in anything in the long run. To use the GNU’s list for example: As to my previous statement, none of these are principles . “Freedom 0” the foundation for the rest isn’t even foundational. It pretty much relies on time preference between using pre-made software and home-made software. Software could be a service, but it’s also, again, an implementation of an algorithm/idea. Of course I know these “ideals” only apply to some FOSS advocates, and not necessarily any OSS advocates, but it’s still the general perception of it. To conclude from this very unstructured brain-dump, I can understand the original sentiment that a lot of the mentality of advocating for OSS/FOSS is from a similar standpoint of the “ideology of communism”, but I do not conceptualize it that way. I don’t think OSS nor FOSS has been a good thing for software, and probably a huge acceleration towards why software has been getting worse over the years. I am making a distinction between OSS (Open Source Software) and FOSS (Free [and] Open Source Software) in this article but most of my critiques are related to both.  ↩︎ I’d argue patents are fundamentally in the same-ilk as “recipes”, and software-patents even more so. I personally don’t like patents that much except in a few cases, but I really do dislike software-patents with a passion. However that’s a different topic for a different day.  ↩︎ Unless you are of course a communist which views that it is, but that’s a different discussion for a different day.  ↩︎ The original quote: “But whether the Constitution really be one thing, or another, this much is certain—that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case it is unfit to exist.” — Lysander Spooner, No Treason: The Constitution of No Authority.  ↩︎ Assuming you can even run it in the first place. This statement is also kind of vague but I won’t go into it too much. This is actually two “freedoms” combined together. The first is access to source code, and the second is that “secrets” should not exist (i.e. secret sauce/source). And why should that be free-as-in-beer in practice ? I understand the “ideal” here is not suggesting it ought to be free-as-in-beer but that is end-result. And I’d argue the vast majority of FOSS advocates would say paying for source (i.e. Source Available) is not Open Source . Why?! Do we allow this for forms of intellectual property? If software is intellectual property, why is it different? I know I’ve made the argument it is in a separate category, but if it is actually a form of intellectual property, then our legal systems do not and should not allow for this. This “freedom” is probably the most egregious with respect to the “ideology of communism” proclamation. This viral nature of licences like GPL are a fundamentally pernicious aspect of the FOSS (not necessarily OSS) movement. It’s this idea of “forcing” charity on others. Of course you are “free” to not use the software, but if there are virtually no other options, you either have to write the thing yourself, or have the potential to have virtually no business model. This point is also an extension of (freedom 2) and as such, not helping the case. I am making a distinction between OSS (Open Source Software) and FOSS (Free [and] Open Source Software) in this article but most of my critiques are related to both.  ↩︎ I’d argue patents are fundamentally in the same-ilk as “recipes”, and software-patents even more so. I personally don’t like patents that much except in a few cases, but I really do dislike software-patents with a passion. However that’s a different topic for a different day.  ↩︎ Unless you are of course a communist which views that it is, but that’s a different discussion for a different day.  ↩︎ The original quote: “But whether the Constitution really be one thing, or another, this much is certain—that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case it is unfit to exist.” — Lysander Spooner, No Treason: The Constitution of No Authority.  ↩︎

0 views
//pauls dev blog 7 months ago

5+2 Phenomenal FREE Platforms To Learn Coding

Some time ago, I was asked by fellow colleagues that were new how they could learn to code without following the "old path" by visiting a University or making an apprenticeship. If you are also new to the world of coding, it makes sense to start by teaching yourself using one of the many free resources found online. By taking advantage of these free resources, you can learn what you want and not waste money upfront. Once you’ve gone through enough free coding platforms, you will know what you want and can specialize in that field of coding. One of the best websites to improve your coding is freeCodeCamp. It contains numerous exercises based on different topics and languages for practice. Also, this website provides a means to get certified based on your skills by taking their test. I would highly suggest having a look at  their website here  and starting to learn. Exercism is a platform that exists to help as many people as possible to attain fluency in  ANY  programming language. It provides many concept exercises to improve your coding skills. The best thing about this website is that it publishes all information and all tutorials  for free . Also, you are able to keep track of your progress. Additionally, you can opt in as a mentor and share your knowledge with other people. Exercism contains tutorials/exercises for 55 different languages that you want to master:  Python ,  JavaScript ,  TypeScript ,  Go , and  many more . Here is a link to the Exercism website . The Odin Project is a platform where everyone can learn coding in  Ruby on Rails  and  JavaScript.  The platform is designed for people who could not attend an intensive coding school or had access to good computer science education. The Odin Project follows these beliefs: I personally think that this project is mandatory for every person who wants to learn any of the two programming languages. Here is a link to the website Codecademy was started with the goal to give anyone in the world the ability to learn the skills they need to succeed in the 21st century. To achieve this goal, Codecademy provides ~15 courses in different programming languages. Many of them are free, but some are only included within the Pro version that costs 18$ a month. At the moment, many free courses are available within the catalog. You can  start a quiz here  to find out which course is best suited for you. To be clear, this isn’t a coding platform itself, but it’s a great resource for community-curated programming courses. You can simply search for the programming language you want to learn, and you’ll get a list of the best courses, tutorials, and books that are recommended by coders and available online. Here is a link to the website WarriorJS is a learning platform for JavaScript that teaches you JavaScript while you are playing a game. This game is designed for new or advanced JavaScript programmers and will put your skills to the test! Here is a link to the website Elevator Saga is a game where you have to use JavaScript to transport people with an elevator in an efficient manner. While progressing through the different stages you have to complete even more difficult challenges. Only efficient programs will be able to complete all challenges. Here is a link to the website In this article, I showed five cool coding platforms that you can use to start coding. Taking advantage of any of the free coding resources out there is definitely the way to go when you are just starting. If you want a Gamification approach to learning to code, you should check out this article: Also, if you want to have a full path to follow, you can follow my described guide to become a Full Stack developer: If JavaScript is not your thing and you maybe want to learn Python, I have gathered some resources (videos, books, and websites) and listed them in this article: I hope you find any of these coding platforms helpful and find a suitable platform to start your learning. I would love to hear your ideas and thoughts. If you can provide another free coding platform, don’t hesitate to comment here. Also, if you have any questions, please jot them down below. I try to answer them if possible. Feel free to connect with me on  Medium ,  LinkedIn ,  Twitter ,  BlueSky , and  GitHub . Education should be free and accessible You learn best by actually building Motivation is fueled by working with others Open source is best

0 views
Ahead of AI 10 months ago

LLM Research Papers: The 2024 List

It’s been a very eventful and exciting year in AI research. This is especially true if you are interested in LLMs. I had big plans for this December edition and was planning to publish a new article with a discussion of all my research highlights from 2024. I still plan to do so, but due to an accident and serious injury, I am currently unable to work at a computer and finish the draft. But I hope to recover in the upcoming weeks and be back on my feet soon. In the meantime, I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It’s just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays. And if you are interested in more code-heavy reading and tinkering, My Build A Large Language Model (From Scratch) book is out on Amazon as of last month. In addition, I added a lot of bonus materials to the GitHub repository . Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663 This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. January 2024 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663

0 views
Ginger Bill 11 months ago

OpenGL is not Right-Handed

The original Twitter thread: https://x.com/TheGingerBill/status/1508833104567414785 I have a huge gripe when I read articles/tutorials on OpenGL: most people have no idea what they are talking about when it comes to coordinate systems and matrices. Specifically: OpenGL is NOT right-handed; the confusion over column-major “matrices”. Let’s clear up the first point. Many people will say OpenGL uses a right-handed coordinate system. Loads of articles/tutorials will keep repeating the view that OpenGL uses a right-handed coordinate system. So the question is, why do people think this? Modern OpenGL only knows about the Normalized Device Coordinates (NDC) , which is treated as if it is a left-handed 3D coordinate space. This means that OpenGL “is” left-handed, not right-handed as many articles will tell you. The origin of “right-handed” comes from the fixed-function days, where Z entries in the functions and were negated, which flips the handedness. Back in those days, it forced users to pretty much had to use a right-handed world space coordinate system. The Z-axis flip is to convert from the conventional world/entity space to the left handed Normalized Device Coordinates (NDC). And modern OpenGL only knowns the NDC. PLEASE READ THE SPEC BEFORE MAKING TUTORIALS! Now for the “column-major” part. This part is actually overloaded and means two things: what is the default vector kind (column-vector vs row-vector), and what is the internal memory layout of a matrix (array-of-column-vectors vs array-of-row-vectors). A good article on this has been written by @rygorous regarding this so I won’t repeat it too much. Row major vs. column major, row vectors vs. column vectors . It works because (A B)^T = B^T A^T (where T is the transpose). But where does this difference in convention come from? My best hypothesis is the physics vs mathematics distinction (just like everything else). In physics you default to using column-vectors and in mathematics you default to using row-vectors. It’s just a convention. It’s just another distinction between OpenGL and Direct3D which makes very little difference at the end of the day, especially since in both GLSL and HLSL, you can specify whether a matrix is or if necessary. I personally prefer column-vectors because of my background in physics but all that is important is that you are consistent/coherent in your codebase, and make sure that all conventions are make extremely clear: handedness, axis direction, “Euler”-angle conventions, units, etc. GLSL also doesn’t help it in that the matrix types are written “backwards” to most other conventions. in GLSL means a 3 row by 2 column matrix. GLSL does not work for this well known mnemonic: Then there is the other related issue of having matrices that are stored in the array-of-column-vector layout: many programming languages don’t have a built-in concept of a matrix & an array-of-row-vectors will be easier to write as an array-of-column-vectors because it is text. It is common to write a matrix in a language like the following and if you are trying to adhere to “column-major” memory layout, then you will have to write it down transposed compared to how you would write it down on paper. The Odin programming language has both built-in array programming (vectors & swizzling) and matrices, and as a result, this textual issue is completely removed! It even allows you to treat vectors as if they are column-vectors or row-vectors, and things will “just work”! You can even specify the memory layout of the matrices if needed too with (the default) or . It “was” true. It distinguishes it from Direct3D. People just repeat things w/o understanding it. Most who use OpenGL don’t really know anything about the GPU nor what the NDC is. (A, B) x (B, C) = (A, C) e.g. (3 by 4) x (4 by 1) = (3 by 1)

0 views
Ginger Bill 1 years ago

String Type Distinctions

[Originally from a Twitter Thread] Original Twitter Post One thing many languages & API designers get wrong is the concept of a string. I try to make a firm distinction between: They are not equivalent even if you can theoretically use them as such, and so many garbage collected language use them as such. They have different use cases which don’t actually overlap in practice. Most of the issues with strings come from trying to merge concepts into one. In Odin , the distinction between a string value and byte array is very important. is semantically a string and not an array of 8-bit unsigned integers. There is an implied character encoding (UTF-8) as part of the value. is also an immutable value in Odin. Having a string be immutable allows for a lot of optimizations, but in practice, you never want to mutate the string value itself once it has been created. And when you do mutate it, it most definitely a bug. This is why it is important to make a distinction between #1 and #3 and separate the concepts. Another way to conceptualize the ideas is as the following: Coupled with Run-Time Type Information (RTTI), having a distinction between []byte and string allows for a lot of really decent (de)serialization tooling, especially for “magical” printing (e.g. fmt.println). P.S. Even C makes a distinction between a string an array of integers string value ( or ) string builder ( or ) Backing buffer for a string ( or ) (3) is the “backing data”, an arena of sorts (fixed or dynamic) (2) are the operations on that buffer (fixed or dynamic) (1) is the final value that points to (3) and produced by (2)

0 views
Ginger Bill 1 years ago

The Video That Inspired Me To Create Odin

[Originally from a Twitter Thread] Original Twitter Post Many people may not know this but this video by Sean Barrett @nothings is partially the reason why I made the Odin programming language . And I’ll explain what insights it gave me in this thread 🧵. A lot of these seem “so obvious” but for some reason it never clicked to me before this video. A lot of the “higher level” “scripting” languages are not that much higher level than C. Those languages just have garbage collection and much better standard libraries than C. Those languages are “easier to use” than C just because they have loads of built-in data structures like arrays, hash maps, and built-in syntax for them (how “extravagant”, I know). I was already well aware of the STB libraries that Sean developed, and the single-header-file style too of them (which is mostly a cope for C’s flaws), but it never clicked to me that the library that Sean used the most was the “stb.h” one. This lead me to organize code I had used in previous projects already and put that into a single and easy to use library which I could use in any of my C projects going forward in 2015/2016: https://github.com/gingerBill/gb/blob/master/gb.h However by around March, I began experimenting with just making an “augmented” C compiler so that I could add constructs to C which I found to be some of the most annoying things C lacked. The first, and most important, two constructs were: I had already been using in C++ for many years already. I really wished the concept was native to C. (My opinion has now changed now for C for many reasons). P.S. I wrote an article about what I had been doing back in 2015 for defer in C++11 Eventually I realized I couldn’t just add constructs and features to C to “improve” my personal projects, and that any of my professional projects could not use them either. This then lead me to starting my own compiler to fix those issues I had with C. Interestingly, “Odin” (which was the original codename which just stuck) started life as a Pascal-like language with begin/end too, but I was trying to make it “C-like” enough for what I wanted. About 3 months later, I had implemented ~70% of Odin; the other ~30% took 7 years. Odin wasn’t my first programming language that I had designed nor implemented, but it was the first where I actually went “you know what?—this will actually be useful for my general needs”. And now I use it for my living at JangaFX , where all of our products are written in it. Other than slices and , the other things I did for Odin were to solve a few practical issues I had in C: I won’t discuss the benefits of slices here but I will discuss the latter 3 things of that list. In C, pretty much my only use case for variadic parameters were -style things. Adding a decent form of that + coupled with RTTI, allowed me to have a runtime-typesafe . As for “actual libraries”, they need their own namespace. When the user imports it, he can change import name itself, rather than relying on the library to prefix everything (or C++ namespace) to minimize namespace collisions. This eventually lead to Odin’s package system . I highly recommend everyone to watch this talk by Sean Barrett! It was a huge inspiration for me, and just a general eye-opener for something which should have been obvious to me at the time but wasn’t. A slice-based type Decent variadic parameters (a slice-based one) RTTI (not bug-prone format verbs in printf, and not my janky generated stuff) Actual libraries

0 views
Ginger Bill 3 years ago

Reverse Engineering Alembic

For my work at JangaFX , we require the use of the Alembic interchange file format. We have been using other libraries which wrap reading the Alembic file format but because it is not the native library, it has numerous issues due to the generic interface. I spent nearly 4 days trying to get the official Alembic C++ API, https://github.com/alembic/alembic/ , to compile correctly and then use the API itself. Numerous times the compilation would get corrupted (it compiled but none of the tests even ran) and when I got it work (on another machine), the API itself was a labyrinth to navigate. After this ordeal I decided to see how difficult it would be to create my own reader, and then writer, for the format from scratch. I don’t really like to “reinvent the wheel” when I don’t have to, but in this case, the C++ API was an absolute nightmare and I’ve spent more time trying to get that to work rather than actually solving the problem I had. Making my own library for Alembic turned out to be a little more work than I expected. Even though it is an “open source” format, it is effectively completely undocumented, and the bits that are documented refer mostly to specific (not all of them) schemas rather than the internal format. This article will be a basic document of the internals of the Alembic file format. Through my journey into “discovering” the Alembic format, I found out that Alembic is not actually a file format but masquerades as one. It is in fact two file formats with different memory layouts which can be determined based on the magic file signature : HDF5 is a hierarchical data format which is commonly used to store and organize large amounts of data in a hierarchical fashion. It’s commonly used in the scientific field rather than the visual effects industry, and as such, this internal format is rarely used for storing data that would be useful for our tasks for importing (e.g. meshes, cameras, animations, etc). HDF5 is effectively a storage format for database-like data, it’s a very good format in terms of usage (I will admit I have no idea about its internal design). It appears that the vast majority of “Alembic” files are not in the HDF5 format but in Ogawa. The main format of concern is named Ogawa . It’s a little-endian binary format (thank goodness) which was designed to be readable in-place for efficient multi-threaded data reading. This part of the file format is luckily documented 2 , and small enough that I could write it in a single tweet . Similar to HDF5, Ogawa is a hierarchical data format that is simple to read, but differing from HDF5, it is completely uncompressed. Note: When decoding binary formats, it is very useful if have a hex-viewer to see the bytes in a readable way. On Windows, I would recommend either Hex Editor Neo or 010 Editor . Even though the Ogawa format is mostly documented on that GitHub Wiki page, I still required a hex-viewer from time to time to ensure everything was laid out as I expected, especially for the more complex embedded data. Its header is very simple: The magic signature is just 5 bytes containing the ASCII characters , followed by a byte-wide flag which states whether the file is open or closed to writing, followed by 2 bytes specifying the version of the format (of which will always be ), and finally followed by an offset to the root “group”. The is there to allow writing to the file and prevent other applications from trying to read it whilst it is not finished. It begins off as and then once the file is finished, it becomes . For my Alembic writer, I was storing the data in a memory buffer and then writing the entire data to the file at once, meaning this byte-wide flag was not that useful. All grouped-based offsets in the Ogawa format are stored in a 64-bit unsigned little endian integer ( ) representing the number of bytes from the base of the file 3 . These group-based offsets come in two different flavours: an Ogawa Group or an Ogawa Data (byte stream) . The offset encodes a flag in its highest bit (63rd bit) to indicate which flavour: Group if 0 and Data if 1. The remaining 63-bits represent the offset to this specific kind of value; if the remaining 63-bits are all zero, it means it is a terminating value (not pointing to any actual memory). A Group begins with a representing the number of children following it in memory (of which all are another offset, ). If the value is 0, this represents an empty group that has no children. A Byte-Stream begins with a representing the number of bytes following it in memory. This basic file format is very simple to read but it has zero semantic information to make it useful for something like an interchange format. The semantic structure is what I am calling the Meta-Schema for Alembic . To begin decoding the format, I created a basic Ogawa to JSON viewer to see the tree structure of the file format. As I began to understand more of the format, the JSON structure became much more complex but richer in semantic data. Note: I highly recommend converting any hierarchical format into something like JSON just as a debug view alone, since there are many tools such as Dadroit that allow for easy viewing. But if you want to make your own format and viewer with your own tools, go ahead! After a bit of exploring, the meta-schema begins with the following data for the root group: From this point on, it can be assumed that a lot of the data is unaligned to its natural alignment and must be read accordingly. This data is only stored at the root group (child ) and referenced elsewhere by index. Each value containing the samples is stored in contiguous memory: This memory needs to be read in order in order to determine the number of samples. All metadata is stored as a string. Each entry is a key and value separated by the ASCII character , and each entry is separated by . Deserializing this is a very simple operation to do in most languages. These strings are NOT NUL terminated like in C because they are bounded by a length that usually prefixes it. As I am using Odin , its type is length-bounded making this process a lot easier to use compared to C. The indexed metadata is only stored at the root group (child ) and referenced elsewhere by index. Each value is stored in contiguous memory: The maximum number of index metadata entries allowed is ( ). The root group stores an offset to the first group (child ) representing Object Data . Object data contains a list of the nested children as well properties for its “this” Object . The Object Header are stored in contiguous memory for each entry but depending on the value of its (see below) will determine whether the metadata value is stored inline or stored in the indexed metadata at the root group. There are three different forms of properties that may be stored within Alembic: Scalar, Array, and Compound. Child of the Object Data group must represent a Compound property data. Compound property data is a group which stores children for properties. It’s effectively a hash map (like a JSON object) containing a keys, properties, and the values of more property data. In the compound property headers byte-stream, it contains a list of the nested of each property header. n.b. The property headers are the weirdest aspect of the entire format and not simply expressible in terms of basic data structures and require a lot of logic for the encoding. is an which encodes information about the property: For the property headers containing inline metadata: Scalar property data is constant if in the property header. Number of samples is equal to the . The indexed samples are stored in the children groups on the property data stored beginning at the relative offset 0. Depending on the whether the samples are constant or not, the actual index needs to be mapped using the following procedure: The data for each sample contains both the key/hash of the data and the uncompressed data. The hash is stored at the beginning; is represented as a 16-byte (128 bit) SpookyHash . The data after that is stored after this hash. Array property data is constant if in the property header. Number of samples is equal to the . Indexing into the samples is a little more complex than scalar data. The number groups stored for the sample data is doubled up, where the pair of groups represent the sample data and then the data for the dimensions. The sample data is similar to that of the scalar but the extra group that represents the dimensions is an array of values which represent the tensor shape of the data. Most of the time, the dimensions is just a single element represent the number of elements in the array. The data group has the same kind of hash prefixing the data (using SpookyHash). I will discuss the schemas of Alembic at a later date since they could fill their own article. For the need we have, we only really care about reading in Cameras and Xforms (Transforms), and writing out Particles/Points. I think the general format Ogawa format is absolute fine but because of its lack of semantic structure, it requires applying one to it. Even JSON has some semantic meaning by virtue of being a text-format (of which I have been using as a format to display to debug the Ogawa data). Note: I would never recommend using JSON when another file format is better fit (especially if you have the luxury of using a custom format), JSON is too general for the vast majority of use cases. If you have the choice of using an interchange format, always use a binary one (even a custom one) since it is pretty much never the case you will need to modify the file in a text editor. The design of the property header is very strange to me. It seems like it is trying to be as compact as possible but the reset of the file format is uncompressed. I cannot see why this bit would need to be compressed/compacted when the property data it refers to is uncompressed. The size hint bit is really bizarre and I personally would have just stuck with a single integer type (i.e. ) to keep things better. The use of indexed metadata seems a bit bizarre since because the entire format is offset-based, the indexed stuff could have pointed to the same memory address to save memory. I know this means that the hierarchy is now not strict, but for metadata, I personally don’t think it would be much of an issue. Speaking of the offset-based file formats, one potential issue for malicious files is form cycles in the hierarchy, which could cause many readers to crash. I am not sure why Spooky-Hash was used. It is not really an easy hashing algorithm to implement (~330 LOC of Odin) and not necessarily better than other hashes out there. It is also not that commonly used compared to simpler (and maybe better) hashes. After a lot of work, I got a fully functioning Alembic reader and writer that was ~2000 LOC of Odin (including the necessary schemas we required at JangaFX ). This is a hell of a lot smaller and simpler than the official Alembic C++ API for the same amount of functionality. Technically it’s Ogawa-Flavour Alembic with its own specific meta-schema, of which I will get to that later in the article.  ↩︎ https://github.com/alembic/alembic/wiki/Ogawa-Specification   ↩︎ I’m a huge fan of file binary formats designed this way where they use effectively internal relative pointers , of which I should explain my preferred approach to designing binary file format in the future.  ↩︎ 6 children minimum Every file that I tested always began with 6 children Child is a byte-stream which stores an containing something akin to the archive version (always appeared to ) Child is a byte-stream which stores an containing the file version which must be From the reverse engineering, it appears that the value stored in decimal represents the file format: e.g. represents Child is a group to the first Objects based at the root Child is a byte-stream which stores the file’s metadata Child is a byte-stream which stores the time samples and max samples information Child is a byte-stream which stores indexed metadata which nested properties may use later In my Alembic writer, I never used any indexed metadata and just did everything inline since it was easier to handle Child stores the data for the properties in the form of Compound Property Data Children stores the objects Child stores the data for the child Object Headers If is equal to , the metadata is stored directly after it If is any other value, it represents the index to the metadata stored at the root group (this value must be less than the number of index metadata entries otherwise it is a invalid file) Children represent the nested property data which is denoted by the property headers Child is a byte-stream which contains the compound property headers Bits : represents the property type: = Bits : represents the size hint = Bits : represents the plain old data type: = (byte) = string I am yet to see this in the wild yet Bit states whether the and is set for a Compound type Bit if set, and equal If bits and are not set, , Bit states if the property data is homogenous Bits : represents the “extent” of the value. e.g. Bits : represents the . This follows the same rules as Object Headers If is equal to , the metadata is stored directly after it If is any other value, it represents the index to the metadata stored at the root group (this value must be less than the number of index metadata entries otherwise it is a invalid file) Bits : I have no idea what these represent, if anything Technically it’s Ogawa-Flavour Alembic with its own specific meta-schema, of which I will get to that later in the article.  ↩︎ https://github.com/alembic/alembic/wiki/Ogawa-Specification   ↩︎ I’m a huge fan of file binary formats designed this way where they use effectively internal relative pointers , of which I should explain my preferred approach to designing binary file format in the future.  ↩︎

0 views
Ginger Bill 3 years ago

Multiple Return Values Research

I have recently been thinking about multiple return values as a concept, and wondering if there has been any kind of literature into the topic of “true” multiple return values which are not emulated through tuples . My current working hypothesis is that I think I have come to the conclusion (unless there is evidence to the contrary) Odin has invented a new kind of type system, something to the akin of polyadic expressions . It appears that Go and Odin are the only ones with what I call “true” multiple return values, but Odin’s approach is a lot more sophisticated than Go’s and extends the concept further. Odin’s idea is pretty simple: an expression may have n -values, where n is a natural number ≥0 . Meaning an expression may have 0, 1, 2, 3, 4, etc values, of which each value has a type, not that each expression has a type. An expression with 0-values has no (final) type to it (not even like in some languages). A way of thinking about this is that Odin is fundamentally a polyadic language rather than a monadic language, like pretty much everything else. However most languages seem to be polyadic in one aspect: input parameters to procedures/functions. In most languages, you have multiple (monadic) inputs but only one (monadic) output, e.g. ( ). Odin’s approach is to just extend this to the rest of the language’s type system, not just procedure inputs. In languages with tuples , they can emulate the ability to have multiple return values, especially if the language has sugar encouraging it. Let’s take Python as this basic example where tuples can be used to emulate multiple return values: Other languages may not use tuples to achieve multiple return values; languages such as Common Lisp and Lua do things slightly differently by pushing multiple values BUT the handling of the number of values is handled dynamically rather than statically: This is similar to in Common Lisp, Scheme, etc, of which it can suffer from exactly the same issues in terms of dropping of value bugs etc, especially due to its dynamic typing approach. A good example of this is to show that is the same as in Common Lisp. However, dynamically typed languages are not my area of research for this specific question, statically and strongly typed ones are. And in that case, all of the languages that allow for something like multiple return values (except for Odin and Go) use tuples. However tuples can still be passed around as a singular/monadic intermediary value. Tuples are effectively a form of monad which wrap values of (possibly) different types (essentially an ad-hoc record/struct type), which some languages may allow for explicit or implicit tuple destruction. Odin does differ from Go on multiple return values in one important aspect: Go’s approach is a bit of hack which only applies in the context of variable assignments. The following is valid in Odin but not valid in Go: The reason being is that Go’s approach for multiple returns only works in the context of assignment, and the assignment assumes that there is either N-expressions on the RHS or 1-expression on the RHS to match the N-variables on the LHS. Odin’s approach is a little more sophisticated here, the assignment assumes that there are N-values on the RHS for each of the N-variables on the LHS; where the N-values may be made from M-expressions (N ≥ M). Another fun thing about Odin because of this fact is that you can do the following (which is also not possible in Go): It also means something like this is possible in Odin: Internally, Odin’s approach could be implemented identically to as if a procedure was an input tuple + an output tuple and then automatically destructed, but semantically Odin does not support first-class tuples. n.b. For this research, implementation details are not of concern, only usage/grammar. Polyadic expressions are also used in another areas of the Odin programming language. One of the most common is the optional-ok idiom 1 : I am wondering if this approach to multiple return values is all Rob Pike ’s fault. Since the only languages I can find that are like what I speak of are: Alef and Limbo had multiple return values through tuples, but tuples were still not really first-class data types in those languages. In many ways, Alef and Limbo were the main precursors to Go (not just Oberon and Modula ). Limbo’s declaration syntax comes from Newsqueak which is also the precursor to Odin’s very own declaration syntax 2 . So maybe the reason there is little literature (that I can fine) into this topic is purely because it is limited to languages related-to/influenced-by Rob Pike. It might be the approach taken by Odin may be extremely unique to Odin, however it is also one of the my favourite things about Odin too. And I am really glad I added it so early on, since it has shown its utility really quickly. Coupled with named return values , bare statements, and , it is an absolute pleasure to work with. If anyone has any more information about this topic, please feel free to contact me with it! P.S. If there is no research done into this area, it is a good sign since there is so much left to discover in Programming Language Theory. This is technically 3 different addressing modes depending on the context ( map-index, optional-ok, or optional-ok-ptr )  ↩︎ https://www.gingerbill.org/article/2018/03/12/on-the-aesthetics-of-the-syntax-of-declarations/   ↩︎ Alef , Limbo , Go (Rob Pike related, the first two being non-destructing tuples) Odin ( me , who has been influenced by many of Pike’s languages) This is technically 3 different addressing modes depending on the context ( map-index, optional-ok, or optional-ok-ptr )  ↩︎ https://www.gingerbill.org/article/2018/03/12/on-the-aesthetics-of-the-syntax-of-declarations/   ↩︎

0 views
Ginger Bill 4 years ago

The Value Propagation Experiment Part 2

[Originally from a Twitter Thread] I have revisited The Value Propagation Experiment and have come to a different conclusion as to why it failed and how I recovered it. The recovery work has been merged into master now with this PR: https://github.com/odin-lang/Odin/pull/ 1082 I think there were three things which were confusing which make it look like a failure of an experiment: Changes in the PR: Most people understood what the purpose and semantics pretty intuitively but many were very confused regarding the semantics of . My conclusion was that the keyword-name and positioning were the main culprits for that. Another aspect which I stated in the original article was that I thought the construct was not as common as I originally thought. And for my codebases (including most of Odin’s core library), this was true. However it was false for many codebases, including the new core:math/big . In the core:math/big package alone, or_return was able to replace 350+ instances of idioms. The more C-like a library is in terms of design, the more it required this construct. It appears that when a package needs it, it REALLY needs it. When this correction was made, there were 350+ instances of in a ~3900 LOC is ~9% of (non-blank) lines of code. That’s a useful construct for definite. Another (smaller) package that found or_return useful was the new core:encoding/hxa, which is an Odin native implementation of @quelsolaar ’s HxA format. https://github.com/quelsolaar/HxA The entire implementation for reading and writing is 500 LOC, of which there are 32 s. I do believe that my general hypotheses are still correct regarding exception-like error value handling. The main points being: The most important one is the degenerate state issue, where all values can degenerate to a single type. It appears that you and many others pretty much only want to know if there was an error value or not and then pass that up the stack, writing your code as if it was purely the happy path and then handling any error value. Contrasting with Go, Go a built-in concept of an interface, and all values degenerate to this interface. In practice from what I have seen of many Go programmers, most people just don’t handle error values and just pass them up the stack to a very high place and then pretty much handle any error state as they are all the same degenerate value: “error or not”. This is now equivalent to a fancy boolean. I do hope that this addition of does aid a lot of people when making projects and designing packages. It does appear to be very useful already for many of the core library package developers for Odin. was a confusing name for what the semantics were. as a prefix was the wrong place. Unifying with was a bad idea for many reasons. Built-in procedure became a binary operator instead (meaning it became a keyword). was added as a keyword. became a suffix-operator/atom-expression. Error value propagation ACROSS library boundaries Degenerate states due to type erasure or automatic inference Cultural lack of partial success states

0 views
Ginger Bill 4 years ago

The Value Propagation Experiment

[Originally from a Twitter Thread] Part 2 of this Experiment I recently experimented with adding a feature into Odin which allowed for a way to propagate a value by early returning if that value was or not . It was in a similar vein to Rust’s which became , or Zig’s , etc. I have now removed it from Odin. But why? The hypothesis was that that this idiom was common: where may be an enum, a (discriminated) union, or any other kind of value that has . And replace it with 1 This construct solves a very specific kind of error handling, of which optimizes for typing code rather than reading code. It also fails because Odin (and Go) are languages with multiple return values rather than single-type returns. And the more I think about it, the and similar stuff may be the least worst option, and the best in terms of readability. It’s a question of whether you are optimizing for reading or typing, and in Odin, it has usually been reading. And something like instead of does reduce typing but is a lot harder to catch (even with syntax highlighting). It happens that Go already declined such a proposal for numerous reasons. And the research done for this is directly applicable to Odin because both languages share the multiple return value semantics. The research has been fruitful however. I did experiment with a construct which has now become a built-in procedure which can be used on things with an optional-ok check e.g. map indices, type assertions Some people may be a little surprised with my experimentation with this exception-like shorthand with error values. Especially since I wrote an article (which was originally two github comments) titled: Exceptions — And Why Odin Will Never Have Them . One thing I did not comment on in the that article is the cause of the problem (other than the cultural issues). My hypothesis is that if you have a degenerate type ( type erasure or automatic inference), then if a value can convert to it implicitly (easily), people will (ab)use it. So in languages with exceptions, all exception values can degenerate to the “base type”. In Rust, it can either go to the base trait or be inferred parametrically. In Zig it can either do or it will infer the error set from usage. Go has the built-in interface type which acts as the common degenerate value. As I discuss in the article, I am not against error value propagation within a library, but I am pretty much always against it across library boundaries. A degenerate state has high entropy and a lack of specific information. And due to this form of type erasure, “downcasting” (broad use of term) is a way to recover the information, but it assumes implicit information which is not known in the type system itself. The other issue when people pass the error up the stack for someone else to handle (something I criticize in the previous article already) is that it’s common to see this in many codebases already that have such a type: Go, Rust, and Zig (public) codebases exhibit this a lot. And my hypothesis for this phenomenon is due to the very nature of this “degenerative type”. Now a design judgement is to be made when designing a language: is such a concept worth it for the problems it intrinsically has. For Odin, I do not think it was worth it. In Odin, errors are just values, and not something special. For other languages? That’s another thing. For Odin, I have found that having an error value type defined per package is absolutely fine (and ergonomic too), and minimizes, but cannot remove, the problem value propagation across library boundaries. was a bad idea for Odin consider the rest of its semantics (multiple return values, lack of error value type at the semantics level, optimizes for typing rather than reading) has now become which is useful. n.b. I am not criticizing any particular language’s design for doing this, but rather saying that it does not work well for Odin’s semantics nor philosophy. Part 2 of this Experiment The concept of worked by popping off the end value in a multiple valued expression and checking whether it was or , and if so, setting the end return value to value if possible. If the procedure only had one return value, it did a simple return. If the procedure had multiple return values, required that they were all named so that the end value could be assigned to by name and then an empty could be called.  ↩︎ The concept of worked by popping off the end value in a multiple valued expression and checking whether it was or , and if so, setting the end return value to value if possible. If the procedure only had one return value, it did a simple return. If the procedure had multiple return values, required that they were all named so that the end value could be assigned to by name and then an empty could be called.  ↩︎

0 views
Ginger Bill 4 years ago

Structured Control Flow (Brain Dump)

Note: This is a “brain dump” article, and subject to be cleaned up. Examples in Odin Procedure call Terminating Conditional , , Looping - loop with initial statement, condition, post statement, and body - loop with a value to be iterated over - loop with condition then body - loop with body then condition Branching - go to end outside of the control statement - skip to the end of a loop - merge two switch case bodies, to have multiple entry points to the merged body Labels on other control flow statements Deferred / Structured Exception Handling (not specifically Microsoft’s SEH , Default (named) return values Procedure call x86: Call instruction Terminating x86: Return instruction Comparison x86: Comparison instructions Jump x86: Unconditional jump instructions x86: Conditional jump instructions Indirect Jump (goto by value) GNU C: Labels as values (jump tables) Stack Unwinding (exceptions) longjmp/setjmp

0 views
Ginger Bill 5 years ago

Relative Pointers

Pointers are a value type in programming languages that store a memory address. A pointer references a location in memory, and obtaining the value stored at this location in memory is known as dereferencing a pointer. Pointers are part and parcel of using languages like C, and are an extremely powerful tool. Pointers can be treated as a form of reference type, a value that refers to another typed value in memory. When most people think of a pointer, they usually treat them as if the pointer refers a physical memory address. I will call this form of pointer an absolute pointer. On most modern machines with virtualized memory, this is rarely the case any more, and for good reasons 1 . In the case of virtualized memory spaces, these pointers do not refer to a physical memory address but rather an alias to a physical memory address, in which usually the operating system handles on a per page basis 2 . In a sense, these pointers that belong to virtualized memory spaces can be classed as a form of relative pointer. At the end of the day, a pointer can be treated/encoded as if it was an integer of some form. Knowing this fact is extremely useful, not just with pointer arithmetic in C/C++, but also relative pointers. In general, there are four forms of relative pointers: MSVC supports a non-official extension to C and C++, based pointers , which allows the user to declare pointers based on pointers. They can be declared using the keyword. The size of a based pointer is still the same size as a regular pointer, but now the value it points to relative to that value. This form of relative pointers is useful but because it is based on a globally defined variable, it muddles the type system by coupling it with values which means it cannot be applied as generally. Based pointers can also be replicated in C++ through the use of templates and operator overloading, it also has the advantage of allowing the size of the user defined based pointer to be a different size to a regular pointer: Another approach which is very similar to based pointers but the base is not explicitly defined by the type itself, which I will refer to as offset pointers . I use this approach a lot when designing file formats, which has the base being the start of the file. I originally learnt about this trick from reading the Quake source code for the MD2/MD3/MD4 file formats which used this trick a lot, especially for relative arrays . This shows one of the main advantages of relative pointers: serializeability. This means that the data structure can be ’d and still be valid. Offset pointers can also be replicated in C++ through the use of templates. Similar to the previous user-defined based pointers above, the size of the underlying integer can be any size. One of the cool tricks here is that the data structure’s fields do not contain any type information about what data type it pointers but rather that data type is encoded in the constant parameters of the templated data structure. The same form of data structure can be implemented in Odin too: In the above cases, the represents the number of bytes away from the base . It could be used to represent the number of typed-units instead, but this may require that the base pointer is correctly aligned to the referenced type on some systems. The last approach is to define the base to be the memory address of the offset itself. This approach has been used at Valve , and in some programming languages such as Jonathan Blow’s programming language 3 Jai and my own language Odin . As with offset pointer , self-relative pointers have a similar main advantage: serializeability. This means that the data structure can be ’d and still be (mostly) valid. One issue with self-relative pointers is that for the value itself to be valid, it must be in an addressable memory space (i.e. must be an l-value and not in a register). If a language containing such a concept wanted to ensure the validity of such a data type, any other data type that contains it would also have to be addressable to. This adds a form of virality to the type system, adding a new concept of addressability to the type system directly. However, even if this was enforced, there are cases when a self-relative pointer in non-addressable memory could be perfectly valid. Another issue is that smaller base integer types for the internal offset value means that they can only be valid over a smaller range than regular pointers. Relative pointers can be implemented in C++ with the use of templates and operator overloading 4 : In the above cases, the represents the number of bytes away from its own memory address. It could be used to represent the number of typed-units instead, but this may require that the base pointer is correctly aligned to the referenced type on some systems. I personally find this approach less useful in practice even if a stride of greater than 1 does give a larger range to work with in theory ( bytes). It should also be noted that this implementation of a relative pointer does treat the value as a sentinel value similar to regular pointers. Relatively recently, I added self-relative pointers to Odin because I found a decent use case for them compared to just using (explicit) offset pointers. I also extended the concept of relative pointers to relative slices too, similar to the offset array example from above. Due to the fact that the underlying base integer of a self relative pointer can be specified by the programmer, it can be exploited for a lot of useful purposes. One basic example could be having a very small linked list where the next item in the list is defined to be within a certain range. This means you could defined a relative pointer to be stored in a single byte if necessary: Relative pointers allow for a lot more flexibility compared to “traditional” pointers, and a godsend with regards to serialization. They do have many advantages and disadvantages compared to “traditional” pointers, and the programmer must be absolutely aware of them before using them. Relative pointers can be easily serialized making them ideal for file formats, and relative pointers can be as small as they need to be rather than taking up the same size of a regular pointer (e.g. on 64-bit platforms, a pointer is 64 bits which might be a lot more than the user requires for a specific problem). Virtualized memory has many security, safety, and practical benefits.  ↩︎ I will not discuss how virtual memory works or its benefits in this article.  ↩︎ Jai is also the first language I have seen implement self-relative pointers as part of the type system directly.  ↩︎ Please note that this is technically undefined behaviour, even if most compilers handle it as expected.  ↩︎ Virtual memory pointers—relative to the virtualized memory space Global-based pointers—relative to a global variable Offset pointers—relative to an explicitly defined base, such as the beginning of a file Self-Relative/Auto-Relative pointers—relative to its own memory address Virtualized memory has many security, safety, and practical benefits.  ↩︎ I will not discuss how virtual memory works or its benefits in this article.  ↩︎ Jai is also the first language I have seen implement self-relative pointers as part of the type system directly.  ↩︎ Please note that this is technically undefined behaviour, even if most compilers handle it as expected.  ↩︎

0 views
Ginger Bill 5 years ago

A Reply to _Let's stop copying C_

I read the article Let’s stop copying C about 3 years ago. Recently someone brought it up again and I thought I would comment on the points being made. The article argues that newer languages ought not to copy the mistakes of C and comment on may of C’s mistakes. I recommend reading the original article first before reading this one as I will be commenting directly on the subsections of the article. A lot of my comments will be with regards to systems-level programming languages and not programming languages in general, and I will be referring back to my language Odin as a contrast with C. Back in the day when C was made, it is perfectly understandable why was used over different approaches. C was designed to be able to be implemented with a single-pass compiler 1 . However, since the advent of more modern computers, this approach is not needed and a proper library/module/package system is a better solution to the problem. C++ approach with s is not a solution and more of a hack. One huge annoyance with C is the issue of namespace collisions. The general approach to solve this is to make the interface to your library prefixed with something e.g. . However, this does not guarantee that namespace collisions don’t happen, and it means that you have to write these extra prefixes when you just want to write the code as normal without having to worry about all of this. From a usability standpoint, when you “import” a library, you want to be able to have those imported entities belong to its own scope which can be named with an import name . The main categories of how I think most languages handle libraries in this fashion are the following: After a lot of experimentation, I decided to settle for something like Modula for Odin as it was the best compromise in practice. I originally wanted to support single-file libraries but found that I could not get them to work with multiple-file libraries in a nice way. There is kind of a use case for textual inclusion, and that is embedding data within the program itself. This is one of the reasons why I have the built-in procedure in Odin. This procedure takes a constant string as a file path and returns a slice of bytes of the data of that file. It’s not exactly textual inclusion but it does solve some of the problems that has been used for in C. In Odin, there are no optional block delimiters but if you want to do a single statement, you have to explicitly state it (as a compromise): C’s operator precedence for bitwise operations is bad and requires an “overuse” of parentheses. I fixed this in Odin so that they feel a lot better. Getting the right precedence required a bit of experimentation to get it to feel correct. In different languages, the operator is usually implemented as either dividend modulo or divisor modulo operations. In C, it is now defined as dividend. In Odin, I added the operator to solve this issue. Where is dividend and is divisor. For unsigned integers, and are identical, but the difference comes when using signed integers. is equivalent to the following using : See the following for more information: https://wikipedia.org/wiki/Modulo_operation I agree C’s octal notation has many issues, and this has its routes back when bytes were not always a multiple of 8 bits. This is why Odin uses to denote a number as octal; it is more consistent with other numeric bases too: for binary, for octal, for explicit decimal, for dozenal, for hexadecimal. For a systems level programming language, having a “power” operator is not a good idea. Having an explicit procedure is a lot clearer as to what is happening for two reasons: I agree that 90% of the time you just want a style loop. However C-style loops do have their uses and getting rid of them entirely is not a great idea. If you would like to see Odin’s approach to loops, I recommend reading the overview for them at https://odin-lang.org/docs/overview/#for-statement . I personally use statement as the equivalent of a jump-table and/or if-else chain. For this reason, I usually want each case to have a at the end and for each case to be a separate scope too. This is why in Odin I added an explicit statement and made each case its own scope. I also extended the statement to properly understand: statements in Odin require all cases of and to be handled by default unless explicitly specified with . If you would like to see Odin’s approach to switch statements, I recommend reading the overview for them at https://odin-lang.org/docs/overview/#switch-statement . C’s “type first” was a weird design decision. It was meant to expression that the type declaration is the same as how it is used in an expression. For this reason, C has given us some of the most “wonderful” declarations: And not to mention that type qualifiers can be in different places and still mean the same thing: C’s approach also requires a symbol table whilst parsing to disambiguate between types declarations and an expression e.g.: This is why for Odin, I went for a more Pascal-style approach to types, which is a lot easier to read, parse, and comprehend. Instead of following C’s approach of “declarations match usage”, Odin’s approach is “types on the left, usage on the right”. Coupled with Odin’s very strong and orthogonal type system, things just work as expected and are easy to comprehend for mere mortals like myself. C’s implicit numeric conversions and array demotion to pointer conversion are a source of many bugs and confusions. C++ tried to fix some of this but still its type system is quite weak in many regards. Odin’s strong type system is one of its extremely huge advantages. More errors are caught earlier because of it, and it allows you to do a lot more. If you would like to see Odin’s approach to types, I recommend reading the overview for them at https://odin-lang.org/docs/overview/#distinct-types . I disagree with this conclusion here regarding integer division. I guess dimensional analysis is baked into me as I don’t usually want an expression to change the types based on the operator. If I have two integers of the same type and use an operator on them, I expect the result to be of the same type, not a new one. Strings are funny things because human languages are complicated things. With Odin I went with what I think is the best compromise: strings are internally byte slices with the assumption that they have UTF-8 encoding. I am not a fan of UTF-32 strings due to their size, and UTF-16 should die out but Windows still uses the encoding pervasively. C’s null-terminated strings made sense back in the day but not any more. Storing the length with the pointer to the data is personally my preferred approach, as this allows for substrings very easily. Along with the type in Odin, Odin does support to aid with the system with interfacing with languages like C that have null-terminates strings. and are both a blessing and a curse in C. They are an expression which allows for things like and to be done. However, in a language that does not require these “hacks”, these operators are quite useless. The other issue is that order of evaluation for in undefined behaviour in C, so is not a good idea to do. This is why Odin only has and . This is mainly an aesthetic thing and I personally don’t care either way. I chose , , and for Odin as I was trying to make an alternative to C. However, I could have easily gone down the keyword approach like Python with , , and . Multiple results are a godsend to use in Odin. They allow for numerous things where passing pointers were originally needed. In Odin, if a pointer is passed to a procedure now, it is usually implies an “in-out” parameter, coupled with the Odin’s calling convention allowing certain parameters to be passed by pointer implicitly to improve performance (equivalent to an implicit in C++). Multiple results allow for named results which allows for a form of auto-documentation. If you would like to see Odin’s approach to multiple results, I recommend reading the overview for them at https://odin-lang.org/docs/overview/#multiple-results . As many know, I am not a fan of exception-style error handling, which I expressed in my article on exceptions . With Odin’s multiple results, strong type system, and s & s, error handling is a lot nice to deal with. I think the notion that “null” is a billion dollar mistake is well overblown. / is just one of many invalid memory addresses, and in practice most of invalid memory address are not . This is related to the drunkard’s search principle (a drunk man looks for his lost keys at night under a lamppost because he can see in that area). I have found that null pointers are usually very easy to find and fix, especially since most platforms reserve the first page of (virtual) memory to check for these errors. In theory, is still a perfectly valid memory address it is just that we have decided on the convention that is useful for marking a pointer as unset. Many languages now have support for maybe/option types or nullable types (monads), however I am still not a huge fan of them in practice as I rarely require them in systems-level programming. I know very well this is a “controversial” opinion, but systems-level programming languages deal with memory all the time and can easily get into “unsafe” states on purpose. Restricting this can actually make things like custom allocators very difficult to implement, along with other things 2 . Odin can technically implement a maybe type through but not to the extent many people would like (especially with the syntactic sugar). I completely agree that assignment as an expression is a bad idea and causes more problems than the utility it gives. I think using possible operators in identifiers is just waiting for trouble, even if it “looks pretty”. This is mainly an aesthetic thing with huge consequences. Braces and semicolons serve purposes. Braces denote a new block and semicolons terminate a statement. If you want to remove semicolons from a language, you will either need to enforce a brace style (like Go) or have sensitive white-space through something like Python’s offside rule. For Odin, I wanted braces but I did not want to enforce a code style on the use at the language level, which means that it requires semicolons. I could have gone with Python’s offside rule but I personally prefer braces even if I enjoy programming in Python. A lot of the issues in C come from C’s poor and lacking type system. I am not against “blaming the programmer”, in theory, for a systems-level programming language because in many cases you are meant to be able to shoot your foot off if you so do wish. However, I do think providing extra structure to the type system to not require the user having to do any of this is a very good idea. A lot of C’s mistakes are only mistakes in retrospective, but C is still an extremely useful language to use, and I am still extremely productive in it. To quote Fred Brooks in his book No Silver Bullet – Essence and Accident in Software Engineering : There is no single development, in either technology or management technique, which by itself promises even one order of magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity. And sadly, this is still true with regards to comparing C to other higher level general purpose languages, as they do not offer an order of magnitude improvement in productivity. C hit a local maximum of productive for which so many still have yet to understand why and learn from. The preprocessor is technically a separate language on top of C, and single-pass too.  ↩︎ This topic alone may require an article to itself.  ↩︎ Python (file is the library scope) Modula/Odin/Go (directory is the library scope) It is not a cheap operation and it is rarely a single instruction on the hardware There are different approaches to handling the power operation, so favouring one over the other in syntax is a little weird discriminated s Multiple cases The preprocessor is technically a separate language on top of C, and single-pass too.  ↩︎ This topic alone may require an article to itself.  ↩︎

0 views
Ginger Bill 6 years ago

A Reply to _The Road to Zig 1.0_

It is lovely to see many new programming languages being produced to solve different issues that the designers are trying to address. Many of the new big ones include Rust, Go, and Swift, all of which are trying to solve different problems. There are some not-as-big programming languages that I recommend everyone to checkout: They are all very good languages but with entirely different philosophies behind them. See if one suits your personal philosophy better! Which brings to me to the latest talk by Andrew Kelley about his programming language, Zig , titled The Road to Zig 1.0 . In this talk, Andrew presents the Zig programming language as a programming language for maintaining robust reusable software with a tour of the unique features of that Zig has. I recommend watching the talk before reading this article to make your own decision about it. The talk opens with the xkcd comic 2030 regarding the fragility and terrifying nature of software compared to engineering disciplines; with Andrew commenting that “we can do better than this”. I agree with this basic premise however, I do not agree with Andrew’s philosophical approach to aid this issue, and in general the entire philosophy behind Zig. Andrew’s two key points to improving the problem are: From experience of programming and the current programming culture, the second should be rephrased to “promote reuse of quality code” as with the rise of package managers and library repositories, code is being reused quite a lot, even if that code is not at all robust. This crappy code reuse as resulted in many of the infamous dependency hells 2 . However, my biggest gripe is the first point: write quality software. I wholeheartedly believe that writing quality software is not due to the issues of the programming language but the lack of incentives to make quality software and general culture around programming 3 . Humans are flawed beings and are incentive focused. Unless we have an incentive to do something, we are very unlikely to do it. There are many languages that are striving for robustness in the code, such as Rust and Ada, but to ensure robustness and quality, a robust and quality culture will be required to enforce it. Andrew then precedes to talk about many of the things which will prevent software from being widely used: These points are only mildly related to producing quality software and closer to concerns about performance, ease of use, safety, and compatibility. I appreciate the caveats do not fit on the slide but many of these points do seem to be a completely different issue to the what question of quality software . After this, Andrew talks about a lot of the issues in C which can cause friction and problems, and how Zig addresses them. Many are simple problems which have been solved by many languages before and after C, and many are core library issues and have nothing to do with the language itself. One of the issues that Andrew covers is conditional compilation and the compile-time system in Zig is used in conjunction with branch/if statements to produce compile-time evaluated clauses. The issue I have with this approach is that run-time evaluated if statements and compile-time evaluated if statements use the exact same syntax: A huge issue with this approach is that it is not clear that these clauses will be evaluated at compile-time nor that they will actually produce a scope or not. Odin solves this issue by using a separate keyword, , to denote a compile-time branching statement: This statement is clearer to read and because it is denoted different, it is clear that it has different semantics to an statement. A statement in Odin does not produce a new scope for each clause, and only the clause that is true at compile-time will be semantically checked (all clauses are syntactically checked, unlike an block in C), whilst in an statement, all clauses are semantically checked. My biggest issue with the design of the Zig language, which is demonstrated in this talk, is that “error values” are built into the language itself. In a previous article , I discuss the issues with exceptions, error handling, and error propagation. Zig’s error types have a similar syntax to exceptions but they do not share the same internals as exceptions in that they are handled as part of the return values rather than through unwinding the stack (or some similar mechanism). Andrew comments that people are lazy (which they are) and will do whatever the default way is, so why not at least get them to handle things “correctly?” and “make handling errors fun”. Firstly, error handling should not be fun—it should be painless. It’s akin to saying paying taxes should be fun. It will never be fun, but you can remove a lot of the drudge work from it. Secondly, as I previously wrote, my issue with exception-based/exception-like errors is not the syntax but how they encourage error propagation. This encouragement promotes a culture of pass the error up the stack for “someone else” to handle the error. I hate this culture and I do not want to encourage it at the language level. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it. Error propagation is fine within a library/package/module but you ought not to be encouraging propagation across library boundaries. There are so many issues I have seen in real life, which lead to crappy software , where errors are just left for “someone else” to handle. Error propagation also has the the tendency to remove information about the type of error as it is reduced to a “simpler” error type up the stack. If you are designing a language to aid in producing quality software , don’t add a core feature to the language which encourages sloppy habits. Odin’s approach to error handling is to treat errors as nothing special. Error cases are handled like any other piece of code. In conjunction with Odin’s multiple return values, enumerations, unions, and many other features, this means the user can return extra information, including error values , alongside the normal return values if wanted and handle them accordingly. One feature that both Odin and Zig share is . A statement defers a statement until the end of scope. I think this concept is much more powerful and versatile than C++’s RAII concept as it conceptually links the creation with the destruction without hiding it nor tying it to a type 6 . Zig extends the to work with the error system with , which is a nice and natural extension of this system and is coherent with the rest of the language’s design. Errors sets are an interesting concept that extend the error system in Zig. I would personally like to see this feature extended to enumerations too as it seems like a natural extension to the concept of a value set 7 . However, it might be better to have a specialized error type than a specialization of a previous error type even if it increases code use (this is of course depends on the situation). The Zig build system is the language itself. It extends the compile-time execution system to call a “build” function to set up the build requirements. Having the same language be the build-language is very useful and a natural extension of the compile-time execution system. The compile-time execution system itself is already a huge requirement for the language which does complicate things but if that is accepted and desired, this feature is less of an issue. Update: It appears that the build system is not part of the compile-time system whatsoever. It appears to be apart of the core library rather than a feature of the language itself. This could be easily replicate this in any language however having it part of the core library itself, this means that there is a standard approach to use which nudges it towards being the default approach. In Odin, “foreign” code is handled by the system. Foreign code that is imported is associated with its dependencies, this means that only the dependencies that are used are built against. Coupled with Odin’s package system, it helps encapsulate the foreign code along with the Odin code. Odin does not directly import C header file, like Zig, but I found that in practice, you rarely want to use the raw bindings to the C code. From what I know of, how Zig is currently designed, its build system does not allow for this minimal dependency tracking that Odin offers. The Zig language is heavily designed around the LLVM ecosystem as a result, relies upon many of its features. I personally see this as vice rather than a virtue. Even though LLVM supports numerous target platforms, it does mean that any issue LLVM has, your language will have it too. Odin uses LLVM as its current backend, but there is work to replace it with a custom backend, and I have personally experienced a huge number of bugs and design flaws which cannot be avoided without removing LLVM itself. LLVM was built to optimize C and C++ and as a result, it is heavily designed around many of the “quirks” of C. LLVM is a huge dependency which I would rather not have to deal with. I will not talk about LLVM any more as it requires another entirely separate article if I was to go into it further. Andrew sets out by striving to aid in the production of quality software but does not really address why crappy software is produced in the first place. Many of the features in Zig itself encourage what I see to be poor and lazy practice, especially the error system. Zig is a well designed language but I disagree with many of the design decisions and the fundamental philosophy behind it. I hope Andrew is successful and is continually able to make a living off from Zig as this is a wonderful project. If you like the sound of it and want to find out more, please check it out at https://ziglang.org/ . Throughout this article, I have been discussing Zig and its issues. Why do I recommend Odin over Zig? Odin is not trying to solve the problem of quality software as that is fundamentally a cultural problem. The Odin programming language is fast, concise, readable, pragmatic, and open sourced. It is designed with the intent of being an alternative to C with the following goals: I wanted a language that solved many of the issues I had with programming. I wanted the specification of the language to be 100% knowable by a mere mortal. There are probably less than a dozen people in the world that know all of the C++ specification and understand it. Rust is starting to approach that complexity even though it is a new language. Swift already has a lot of baggage as its evolved past from Objective-C and the NeXTstep ecosystem. At the turn of 2016, I gave myself a New Year’s Resolution to start any new personal programming project in pure C to see what I really needed from a language. It turned out that I needed very little to be very productive. The friction from having to remember many of the features in C++ and other languages reduced my productivity a lot more than I realised. From using pure C for a few months, I noticed that there were features that I wanted to add to C to increase my productivity and reducing errors. These features included and tagged-unions. I started creating a metaprogramming tool to augment my C code so I could add these new features. I quickly began to realise that what I wanted was a new language as this endeavour was a dead-end. The project started one evening in late July 2016 when I was annoyed with programming in C++ (not a new project). The language began as a Pascal clone (with begin and end and more) but changed quite quickly to become something else. Odin borrows heavily from (in order of philosophy and impact): Pascal, C, Go, Oberon. Niklaus Wirth and Rob Pike have been the programming language design idols throughout this project. Simplicity was always a deriving force in the design, but as I found very early on, simplicity is complicated. The design underwent a lot of revisions and experiments in the very early stages as I did not know what was optimal for increasing my productivity. I knew the basic concepts of what I wanted but that was it. Concepts such as the package system took nearly a year to flesh-out and design as it took a while to discover what were the actual problems with certain approaches. At the end of the day, I am a pragmatic man and I do not see the need for type-theory purity if it increases friction in the language. If you like the sound of it and want to find out more, please check it out at https://www.odin-lang.org/ and https://github.com/odin-lang/Odin . For those who do not know, I am the creator of the Odin programming language .  ↩︎ See the Left-Pad Package causing severe issues across many other piece of code, https://en.wikipedia.org/wiki/Npm_(software)#Notable_breakages .  ↩︎ For a possible way to improve the culture, please see the Handmade.Network   ↩︎ I have discussed this issue before with Andrew and I really do not see this as much of an issue as he does. If you are a desktop machine and run out of memory, don’t try to recover from the panic, quit the program or even shut-down the computer. As for other machinery, plan accordingly!  ↩︎ And aiding your programming language with interfacing with foreign code and help it get more widely adopted.  ↩︎ I talk about the statement in a previous article .  ↩︎ It might be incorrect to call it a set as they can only be one of the values in the set of possible values. A union, sum-type, or something similar may be a better name for the concept.  ↩︎ Write quality software (compared to writing crappy software) Promote code reuse (compared to preventing code reuse) Garbage collection Automatic heap allocation (which may cause the system to run out of memory) If you code is slower than C, it will be rewritten in C A lack of C ABI will mean it is not usable by most languages Complicated build systems from source Garbage collection (and automatic memory management in general) has its uses in certain problem domains and in those domains, it does not reduce the quality of that software. Automatic heap allocation is more of a problem with the language (and third party code) doing things behind the back of the developer which may have performance implications. I have never had a program cause a system to run out of memory in real software (other than artificial stress tests). If you are working in a low-memory environment, you should be extremely aware of its limitations and plan accordingly 4 . A piece of software doesn’t need to be as fast as C to be quality software. If the performance is within a certain margin and/or the performance of your code has little overhead for the problem-at-hand (e.g. a one-off task), then the clarity of the code is much more important than its performance. This is why Python is so popular in the scientific community as prototyping/glue language, with the actual heavy-lifting done by C/FORTRAN bindings to the highly optimized code. A lack of C ABI is an issue if you need your software to interface with other software but as I said, this has nothing to do with quality software itself rather compatibility with other software 5 . Complicated build systems is mostly a cultural issue where people are either told to use a certain approach or learn about “best practices”. high performance built for modern systems joy of programming For those who do not know, I am the creator of the Odin programming language .  ↩︎ See the Left-Pad Package causing severe issues across many other piece of code, https://en.wikipedia.org/wiki/Npm_(software)#Notable_breakages .  ↩︎ For a possible way to improve the culture, please see the Handmade.Network   ↩︎ I have discussed this issue before with Andrew and I really do not see this as much of an issue as he does. If you are a desktop machine and run out of memory, don’t try to recover from the panic, quit the program or even shut-down the computer. As for other machinery, plan accordingly!  ↩︎ And aiding your programming language with interfacing with foreign code and help it get more widely adopted.  ↩︎ I talk about the statement in a previous article .  ↩︎ It might be incorrect to call it a set as they can only be one of the values in the set of possible values. A union, sum-type, or something similar may be a better name for the concept.  ↩︎

0 views
Ginger Bill 7 years ago

Exceptions --- And Why Odin Will Never Have Them

Article was originally posted here: https://odin.handmade.network/blogs/p/3372-exceptions_-_and_why_odin_will_never_have_them Original Comments: There will never be software exceptions in the traditional sense. I hate the entire philosophy behind the concept. Go does have exceptions with the defer, panic, recover approach. They are weird on purpose. Odin could have something similar for exceptional cases. You can the exact same semantics as a try except block by using a switch in statement. The same is true in Go. The difference is that the stack does not need to be unwinded and it’s structural control flow. Odin has discriminated unions, enums, bit sets, distinct type definitions, any type, and more. Odin also have multiple return values. Use the type system to your advantage. I do hate how most languages handle “errors”. Treat errors like any other piece of code. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it. To expand on what I mean by this statement: You can the exact same semantics as a try except block by using a switch in statement. The semantics are very similar in this case however the control flow is completely different. In the exceptions case (shown with Python), you enclose a block of code and catch any exceptions that have been raised. In the return value case (shown with Odin), you test the return value explicitly from the call. Exceptions require unwinding the stack; this can be slower 1 when an exception happens compared to the fixed small cost of a return value. In both cases, a “catch all” is possible: One “advantage” many people like with exceptions is the ability to catch any error from a block of code: I personally see this as a huge vice, rather than a virtue. From reading the code, you cannot know where the error comes from. Return values are explicit about this and you know exactly what and where has caused the error. One of the consequences of exceptions is that errors can be raised anywhere and caught anywhere. This means that the culture of pass the error up the stack for “someone else” to handle. I hate this culture and I do not want to encourage it at the language level. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it. Go’s built-in type has the exact same tendency of people return errors up the stack: From what I have read, most people’s complaints about the Go error handling system is the if err != nil, and not the return nil, err aspect. Another complain people have is that this idiom is repeated a lot, that the Go team think it is necessary to add a construct to the language reduce typing in the draft Go 2 proposal . I hope this has cleared up a lot of the questions regarding Odin’s take on error handling. I think error handling ought to be treated like any other piece of code. With many rules, there will be unexpected emergent behaviour. P.S. If you really want “exceptions”, you can until the cows come home. Theoretically exceptions can be just as fast a return statement because that’s effectively what it is. However, most modern languages do a lot more than that to implement exceptions and they can be a lot slower in pratice.  ↩︎ https://github.com/odin-lang/Odin/issues/256#issuecomment-418073701 https://github.com/odin-lang/Odin/issues/256#issuecomment-418289626 Theoretically exceptions can be just as fast a return statement because that’s effectively what it is. However, most modern languages do a lot more than that to implement exceptions and they can be a lot slower in pratice.  ↩︎

0 views