Visual Guide to LLM Quantisation Methods for Beginners

Introduction to Quantisation

Quantisation is the process of reducing the precision of numbers used in a model. This makes models smaller, faster, and more efficient to run, often with only a small drop in accuracy. For large language models, quantisation is especially important because of their size and hardware demands.

What is Quantisation?

Quantisation involves storing weights in 8-bit integers instead of 16- or 32-bit floats. This reduction in precision leads to models that are more efficient and require less hardware to run. To understand this concept better, consider the example of a large language model. These models are massive and require significant hardware to operate efficiently. By applying quantisation techniques, the size and hardware demands of these models can be reduced.

Quantisation Methods

There are two main approaches to quantisation: Quantisation Aware Training (QAT) and Post-Training Quantisation (PTQ).

Quantisation Aware Training (QAT)

QAT involves training the model with quantisation in mind from the beginning. This approach allows the model to learn how to best represent itself with reduced precision, often resulting in minimal loss of accuracy.

Post-Training Quantisation (PTQ)

PTQ, on the other hand, applies quantisation to a model after it has been trained. This method is faster and more straightforward but may result in a slightly larger drop in accuracy compared to QAT.

Choosing the Best Approach

The choice between QAT and PTQ depends on the specific use case and requirements of the project. If accuracy is the top priority and time is not a concern, QAT might be the better choice. However, if speed and efficiency are more important, PTQ could be more suitable.

Implications for Deployment

Quantisation has significant implications for the deployment of large language models. By reducing the size and hardware demands of these models, quantisation makes them more accessible and easier to deploy on a variety of devices, from smartphones to servers.

Conclusion

Quantisation is a crucial technique for making large language models more efficient and accessible. Understanding the different quantisation methods, such as QAT and PTQ, and their trade-offs is essential for choosing the best approach for a specific project. By applying quantisation techniques, developers can create models that are faster, smaller, and more efficient, with only a minimal loss of accuracy.

FAQs

What is the main purpose of quantisation in large language models?

The main purpose of quantisation is to reduce the precision of numbers used in a model, making it smaller, faster, and more efficient to run.

What are the two main approaches to quantisation?

The two main approaches are Quantisation Aware Training (QAT) and Post-Training Quantisation (PTQ).

Which approach is better, QAT or PTQ?

The choice between QAT and PTQ depends on the project’s specific requirements. QAT is generally better for maintaining accuracy, while PTQ is faster and more efficient.