Small is Beautiful: Building Tiny LLMs
There are two not-so-related discourses about the notion of small things in the world of AI right now. First is a phenomenon in the startup world. Tech startups are gaining valuation with smaller teams. I just saw a post this morning about Lovable and other AI startups gaining traction with a sub-20-person team. While productivity boost from AI can be a significant factor, this phenomenon speaks more about how capital and attention operate in the startup economy. It has more to do with the tech world's imagination of what is possible with AI and institutions' embrace of productivity and the lean startup (think of Shopify CEO's recent memo about hiring, and DOGE too).
The other discourse, much subdued, is the age-old mentality about self-sufficiency, privacy, and DIY spirits. Developers would find this mentality intuitive: building things from scratch, rewriting established tools, running things bare metal, and adopting niche tools for configurability and privacy reasons. This latter discourse brings me to the topic I want to write about today, tiny LLMs.
In the past week, I joined a hackathon to implement tiny LLMs based on a paper from Ronen Eldan and Yuanzhi Li.1 The idea of the paper sounded attractive. It talks about building small LLMs (sub 100M) with synthetic (AI-generated) data to draft children's stories that only use 1500 vocabulary. It is a proof of concept about how small a model can get while still being able to generate coherent sentences. The models are initially trained for text completions and then fine-tuned with an instruction-following dataset. The framework of this paper makes a perfect exercise for pretraining and finetuning an LLM. The authors released their datasets on Hugging Face, which was used for this hackathon. In the rest of the blog post, I present a few findings and observations in the space of designing and training small LLMs.
Architecture of small LLMs
While Eldan and Li's study uses GPTs for training their models, many recent examples adopted a LLaMA architecture. SmolLM (100M, 300M, 1B) is a family of open source models developed by the Hugging Face team.2 These models received lots of coverage last year. They adopted a LLaMA architecture based on MobileLLM, a study from Meta presented at ICML 2024.
The MobileLLM paper is a study of the architectural choices of small LLMs. The team developed sub-billion models based on the LLaMA architecture.3 They demonstrated that their models have comparable performance with LLaMA-2 7B, but with fewer parameters and can be run on an iPhone 13. The key architectural decisions the paper highlighted included prioritizing depth over width, Grouped-Query Attention (GQA), embedding sharing, and block-wise weight sharing. Intuitively, depth over breadth is a trade-off between the degree of feature development and the number of features captured by the model. With breath, the model captures more features from the training samples. With depth, the model sculpts each feature further. For simple datasets, depth over breadth makes sense, since there are fewer features to begin with, so more resources should contribute to develop these features. The last 3 strategies are all about keeping the model small. Luckily, all of these features, except block-wise weight sharing, can be configured from the Hugging Face transformer library's LLaMA module out-of-box.
The Training Process
Training an LLM from the ground up is not a technical challenge. The bottlenecks are resources and data. Mature packages such as Hugging Face and Fastai abstract away much of the complexities of composing an LLM from scratch. When I read the source code for MobileLLM, I was surprised to find out that they also used Hugging Face transformers to scaffold their model. In the past 2 years, there have been many tutorials poured on the subject that it will not be hard to write a basic transformer model from the ground up. The bottlenecks, however, are computing and high-quality datasets. Renting a 40G GPU at Runpod costs $0.8 per hour. Even for sub-billion models, the price quickly adds up during training. Another constraint is the available training data. For most smaller models, open source datasets will suffice. Tiny Stories, the dataset developed by Eldan and Li, consists of 2M rows, which amounts to around five hundred million tokens depending on the tokenization method. Since LLMs are general-purpose models and pretraining datasets are aimed at developing models' basic language abilities, models can use similar datasets to for pretraining. However, creating any dataset from scratch would be a major undertaking, just think about scraping millions of webpages or generating that much content with LLMs.
Once you start playing with LLMs, you realize how much carbon footprint your project generates without even having to calculate the numbers. I can run any web development tasks on my laptop, which does not have a GPU. Once I start playing with LLMs, my laptop cannot even run inferences with small models.
An important lesson I learned in the process is to monitor the GPU performance. My training scripts are quite basic, mostly leveraging the Transformers package from Hugging Face. The biggest challenge I had was managing and monitoring GPU performance and handle out of memory (OOM) errors. On the one hand, you want to max out GPU performance on the rented server to reduce the training time. On the other hand, you do not want an OOM error and lose all the training progress. Some good ground rules to start with:
monitor GPU memory and utilization with tools like `nvitop`
estimate the length of your batch, making sure you have enough GPU memory for the longest batch
monitor your GPU's power consumption, which is the ground truth of your GPU's performance
A significant proportion of my time was spent on figuring out the appropriate batching and packing methods. The context length of my model is 512. Most of the samples in the dataset come down to 200 tokens, which means that I can theoretically pack 2 rows together. During the first run, I packed multiple rows into one context. However, this ends up cutting off and mixing stories. As a result, the model sometimes introduces new characters in the middle of the plot. Other stories would end with a broken sentence. To deal with the problem, I have several options. One is to retokenize the dataset with the correct packing method, which means no truncation, and additional masks to signal to the model that the stories are separate. Since this masking is an addition to the training defaults, I would also need to instruct the LLM on how to process it. This turns out to be more complicated than I expected. Due to time constraints, I opted for the second approach, which is to retokenize the dataset with one sample in a context window and continue to train the model on 20% of the correctly formatted dataset.
After the second training session, I think the results got better, but I had no way to verify. Here lies another lesson, developing an evaluation system to measure the training results. The default loss scores are helpful only in terms of monitoring the training process. They are inadequate for measuring the quality of the output. Specifically, in my case, I would want to measure truncation at the end of the generation, as well as generation coherence.
The space of pretraining LLMs
Pretraining LLMs are increasingly monopolized by a few companies. This is hardly surprising. As mentioned above, the resources it takes to train production-ready LLMs are enormous. If you browse the catalog of available LLMs, the smallest models major vendors offer are on the order of billions of parameters. Training models usually require datasets that are orders of magnitude higher than the size of the model. For example, the SmolLM-135M model was trained on 600B tokens.
Besides the resource constraint, there is a theoretical question of how training a small LLM from the ground up compares to distillation or quantization. Distillation and quantization are techniques to compress large models. Quantization uses lower-precision numbers to reduce the size of a model's memory requirement. Quantization can reduce the model size to 1/8 the size of the full-precision model. For example, a 1B model can be quantized to achieve the same memory usage as a 125M model. The advantage of quantization is that it is almost free. You can use a library to quantize a model on the fly. In practice, quantization is a no-brainer. In theory, quantized models rely heavily on transferring representations already learned by larger models, which might impose constraints on their performance. For embedded systems, quantized billion-level models are still too large. So there is always a need to make smaller models.
In my conversation with a friend, we talked about training AI for specialized tasks, like transforming data into JSON. This goal makes intuitive sense. Since small LLMs are trained on limited data, training them to perform only one task sounds like a reasonable compromise. This has been the dominant approach for training ML models. And LLMs are not so different. Before undergoing supervised fine-tuning, causal language models like GPTs can only perform text completions. If you only feed them one type of data, it would only be able to predict that type of text in a meaningful manner. A recently hyped example of a specialized model is BloombergGPT, which was trained to perform financial analysis with propriety data from Bloomberg. However, it was shown that BloombergGPT underperforms in financial NLP tasks against general models such as GPT-4 and ChatGPT (GPT-3.5), which admittedly have more parameters than BloombergGPT.4 In practice, it is almost always preferable to finetune a general-purpose model to optimize for specific tasks because of the superior quality of off-the-shelf models and the low cost of finetuning.
Conclusion
Training a model from the ground up was a fun experience. I went through a few training iterations and ended up spending a lot of time wrestling with batching and packing. This process confirmed the famous mantra in ML: garbage in, garbage out. Even with the same dataset, there are many choices to be made in terms of how to pack the context window, batch the data, and add padding. So always be extra careful with what you feed the model.
Model training is a resource-intensive endeavor, which is why most companies prefer to use off-the-shelf models. It is still unclear whether it is worth training a smaller model from scratch, and when it is appropriate to shrink larger models using techniques like quantization. However, the companies that are still pretraining models focus mainly on larger, billion-level models, which will never be able to serve smaller IoTs. Notably, the iPhone now hosts a 3B model on the device. But at this point, phones are more like computers than IoT devices, given their price point. Most IoT devices run on slow chips that serve limited functionality, so putting LLMs on these devices will always require specialized treatments consumer-level LLMs are not designed for, and this is where training small LLMs can be useful.
Eldan, R., & Li, Y. (2023, May 12). TinyStories: How small can language models be and still speak coherent English? arXiv.org. https://arxiv.org/abs/2305.07759
SmolLM - blazingly fast and remarkably powerful. (2024, July 16). https://huggingface.co/blog/smollm
Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., Lai, L., & Chandra, V. (2024, February 22). MobileLLM: Optimizing Sub-billion parameter language Models for On-Device use Cases. arXiv.org. https://arxiv.org/abs/2402.14905
Li, X., Chan, S., Zhu, X., Pei, Y., Ma, Z., Liu, X., & Shah, S. (2023, May 10). Are ChatGPT and GPT-4 General-Purpose solvers for financial text analytics? A study on several typical tasks. arXiv.org. https://arxiv.org/abs/2305.05862