I have rephrased the quotes for better readability, and enriched some of the content with additional information.
Start Simple
- Do not start by trying to fine-tune but with prompt engineering and retrieval augmented generation (RAG)
- Use OpenAI, Claude, etc.
- “Vibe-checks” are okay in the beginning
- Write simple tests and assertions
- Ship fast
Evaluations are Central
When to Fine-Tune
If you have a prompt that resembles a programming language with numerous conditional statements, it’s an indication that fine-tuning might be beneficial.
Additionally, because of the diverse types of examples and various edge cases, it is challenging to cover all possibilities with just a few examples. However, there are strategies to enhance these few-shot examples. For instance, you could use RAG and maintain a database of different examples, making the few-shot examples dynamic. This approach can sometimes be effective.
These are some indicators to consider when evaluating the problem itself.
Hamel Husain
Data Privacy
Fine-tuning allows you to use private data to customize a model without exposing sensitive information to third-party services.
Quality vs Latency Trade-off
While using and/or generating fewer tokens can reduce latency, a fine-tuned model may also become so specialized that it limits its ability to perform other tasks.
Extremely Narrow Problem
For very specialized tasks, general models may not perform adequately. Fine-tuning can help to address these issues.
When Prompt Engineering Becomes Impractical
For complex tasks that require detailed prompts with many conditions, prompt engineering can become inefficient. In such cases, fine-tuning can be an effective solution.
Fine-Tuning vs RAG
One does not replace the other. They are two different things.
They can be complementary, though. You could use a fine-tuned model to generate better answers with your RAG system.
Preparing your Data
Get as much data as possible (prompt/output pairs). Ultimately, it will often be a matter of time and cost.
Consistent Template
It is paramount to clean and have a consistent template for your data.
Synthetic Data
Use the most powerful model to generate synthetic data. To avoid any legal issues, Mistral Large seems quite permissive in its terms and conditions.
Hamel Husain
But you do you. OpenAI, Anthropic, Google, and the other all used stolen data to train their models while legislating against their intellectual property being open source.
This was not in the course, but I think it might be relevant regarding the intellectual property: A short course on OSS Licensing for Research and Education.
You probably know what you are doing.
Base Models vs Instruct Models
An instruction tuned model is a base model that has been fine-tuned to speak with you in a chat like manner.
Hamel Husain
Which one to choose will depend on the problem you are trying to solve with fine-tuning. Basically, if you’re not building a chatbot, you might not need an instruct model.
Which Size to Choose
Choosing the right model size is crucial as it affects both performance and operational costs.
Try the 7B range models first, but you should develop an intuition with experience. Is the task something a small model can do? Does it require more “reasoning”?
A larger model will cost more and be harder to host. I only fine-tune when it’s a very narrow problem, and where I think it’s going to fit in a small model.
Hamel Husain
Manage User Expectations
- The “Ask me anything” problem
- 9 out of 10 people will ask you to build a chat bot: don’t do it
- Be skeptical of general chat bot
- Figure out how to have a better scope of the problem
The “Ask Me Anything” approach should be avoided because it sets unrealistic expectations for users.
Maintaining such a broad scope is not only challenging but also impractical, as it requires the chatbot to understand and process a wide range of inputs. Instead, it is better to define a specific scope right from the start.
Examples of why you should not build general chatbots:
Establish Standards for Ideal Prompt/Output Pairs
Don’t mix “okay” answers with “great” ones in your dataset. You should have a clear distinction between them.
It is hard to write great answers, but we are good at evaluating which ones we prefer. There are many algorithms to implement those preferences. Currently, the one that seems to work best is Direct Preference Optimization (DPO).
Remark: the 23rd of May, 2024, an article was published on a new algorithm found to be both simpler and more effective than DPO: Simple Preference Optimization (SimPO). This is very new at the time of writing (25th of May, 2024), so we will have to wait and see how it really performs.
What is DPO?
Basically, instead of linking a single output to a prompt, you will create a dataset with a prompt and a preference between two outputs: a chosen response and a rejected response.
See more:
Practical Application and Comparison of DPO
The use case was to automate responses to incoming emails. Here are the results of the blinded comparisons (ordered from best to worst):
- DPO
- Human Agent
- Supervised Fine-Tuning on Mistral
- GPT-4
DPO consistently produced “super human” responses in comparison to the other methods.
References
- Your AI Product Needs Evals
- A short course on OSS Licensing for Research and Education
- DPD error caused chatbot to swear at customer
- Air Canada has to honor a refund policy its chatbot made up
- Belgian man dies by suicide following exchanges with chatbot
- SimPO: Simple Preference Optimization with a Reference-Free Reward (Meng et al., 2024)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
- A practical example of DPO with the TRL library from HuggingFace