Undoubtedly, there is a lot of hype around Large Language Models. We are pleased to observe what is happening and simultaneously gather knowledge and experience in the field. These powerful models have demonstrated their immense capabilities in a wide range of use cases, so our customers are also curious about new possibilities and eager to use in the projects popular large-scale models like ChatGPT. To the surprise of our clients, it is not always the best choice.


In a world where bigger is often perceived as better, perhaps it’s time to challenge this preconception – at least when it comes to Large Language Models. In this article, we’ll delve into scenarios in which opting for a more modestly sized LLM might prove to be the wiser and more pragmatic approach.


Large language models (LLMs) are characterized by a significant increase in the number of parameters they possess, often reaching billions or even trillions. As the parameter count grows, these models tend to deliver greater accuracy and generate higher-quality outputs in tasks like translation, text generation, and question answering. Imagine GPT-3.5, developed by OpenAI, a powerful language model with 175 billion parameters. As the GPT series is expanding the GPT-4 is said to be based on eight models with 220 billion parameters each, which gives a total of about 1.76 trillion parameters, making it nearly 1000 times larger than the GPT-3.5. However, it is important to note that as LLMs grow, they bring along a set of challenges that must be acknowledged and considered.





The first challenge could be the cost, which depends on many factors. Primarily, LLMs can be distinguished for commercials and open source. In the case of commercial ones usually the cost is evaluated for each model usage based on the number of tokens used in its call. Even if the unit cost of the model usage is relatively small, for example gpt-3.5-turbo around $0.002 per 1000 tokens, the cost grows rapidly if you want to use the model a million times a day.


On the other hand, open-source models have no direct cost per request, they are generally free to use. Open-source LLMs expenses are related to the infrastructure. Simplifying, GPU memory requirements depend linearly on the number of model parameters. It can be assumed that storing a 1B parameter in GPU memory, required for inference — costs 4 GB at 32-bit float precision. Please find below the cost of some open-source models which can be run on the NC A100 v4 series.


Model name Size Cluster GPU Cost
LLaMA2–7B 7b parameter NC24ads A100 v4 1X A100 $3.67/hour
Dolly-v2-12b 12b parameter NC24ads A100 v4 1X A100 $3.67/hour
LLaMA-2–70b 70b parameter NC48ads A100 v4 2X A100 $7.35/hour


Smaller LLMs offer a more efficient alternative, allowing for computing and training on less powerful hardware. Sometimes it is possible to self-host such a model on a private machine instead of using computational server, but we need to be sure to provide minimum system requirements to do so. In the end, the number of requests or the usage volume is a critical factor in determining the real cost for a given use case.


When we think about resources, environmental aspects are also an advantage, as using smaller models creates a smaller carbon footprint.



Use case


Despite the fact that pre-trained LLMs can provide valuable insights and generate text in various domains, they may lack the domain-specific knowledge required for certain specialized tasks. In the realm of data science projects, where the focus is on addressing specific business needs, the relevance of information concerning distinctions between butter and margarine, or the causes of the French Revolution, is not evident. While information from diverse set of areas such as cuisine or history can be insightful, they may not be pertinent to business clients seeking solutions tailored to their specific tasks. Not every project requires the vast knowledge and generative abilities of billion-parameter LLMs.


If your data science project involves highly technical or specialized content, using a pre-trained LLM alone may result in inaccurate or incomplete results. In such cases, incorporating domain-specific models or knowledge bases may be necessary. Smaller models can be tailored to specific use cases more effectively. They allow data scientists to fine-tune the model for particular tasks, resulting in better performance and efficiency.



Response time


Massive models can introduce delays in processing due to their size and complexity. Generally, smaller language models provide responses faster than larger models. This is because smaller models have fewer parameters and require less computational power to generate responses. They can process and generate text more quickly, making them a preferred choice for applications where low latency is important. Let’s see the difference in OpenAI models we mentioned earlier.  One of the experiments comparing response time for these models result in following:

  • GPT-3.5: 35ms per generated token,
  • GPT-4: 94ms per generated token.


The trade-off between response speed and response quality needs to be carefully considered when choosing a model for a specific application. The choice of model size should align with the specific requirements and constraints of the project.


With all that said, we hoped to expand your perspective on the language models and the idea that larger models may not always be a better one. When considering an LLM for your data science project, it’s essential to evaluate the specific requirements of your task and weigh them against the potential drawbacks of using a massive model. Smaller LLMs offer practical advantages in terms of computational efficiency, cost-effectiveness, environmental sustainability, and tailored performance, despite their own disadvantages and limitations.