Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)
I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ... ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future. ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on. ↩ No idea why, but it does ¯_(ツ)_/¯ ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ... ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future. ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on. ↩ No idea why, but it does ¯_(ツ)_/¯ ↩