I fine-tuned Flan-T5. Can it cook?
Teaching a language model to cook
Generative AI is all I see when I open Twitter, so I wanted to try these new models for myself. I love to cook, but I hate sifting through recipes on Google and scrolling a dozen times to find the actual ingredients and instructions. I decided to learn how to train a large language model (LLM) to generate a recipe for any dish I could dream up so that I would never have to search for recipes again. If you want to play with the final model, visit Le Chef.
Choosing the dataset and model architecture
On Kaggle, I found RecipeNLG, a dataset that contains over 2.2 million recipes from a range of cuisines and dish types.
For my LLM, I chose to use the T5 architecture because it performs well on a variety of NLP tasks. Of the various pre-trained T5 variants, the 220M parameter Flan-T5 version provides good performance without requiring too much GPU memory for training. To get my footing with the model, I went to Blueprint to experiment with the model and begin the finetuning process.
Preparing the dataset
Because the T5 architecture is an encoder-decoder model, it expects an input sequence and uses the decoder to produce the output sequence. I decided to use the name of the dish as the input sequence and combine the ingredients and directions as the output sequence. This allowed the model to generate the ingredients and instructions for a dish given its name.
To preprocess the data, I applied some basic NLP techniques such as stripping extra whitespace and removing duplicates. I also only selected data from the “Gathered” source because it had a more consistent format than the other sources in the dataset. Finally, I merged the directions and ingredients to get a standardized output format:
Ingredients: - ingredient 1 - ingredient 2 - ingredient 3 Directions: - direction 1 - direction 2 - direction 3
To run training, I split the dataset into train, validation, and test sets with a ratio of 70/20/10. This allows me to use the training set to train the model, the validation set to ensure the model was generalizing well, and the test set to evaluate the final performance of the model.
Finetuning the model
Fine-tuning any model is a trial and error process. I had to adjust several hyperparameters, such as the batch size, learning rate, number of epochs, and maximum target and source lengths, to improve the model's performance, in addition to choosing which layers of the model to freeze during training.
To find the best combination of these hyperparameters, I trained multiple models with different settings and evaluated their performance on the validation set on a 40GB NVIDIA A100 GPU, hosted on Blueprint. Although this GPU seems overkill for a tiny 220M param model, it was important to have a buffer so that I could experiment with hyperparameters. Blueprint was also ideal for my use-case since they already provide the code to fine-tune the model, and I just have to focus on my data and hyperparameters.
I started by training a basic model with the following hyperparameters:
model_name: "t5-base" max_target_length / max_source_length: 64 / 64 batch_size: 8 learning_rate: 5e-6 scheduler: linear num_epochs: 1 unfreeze_layers: [lm_head]
This model performed poorly, with an average perplexity of over 500 on the validation set. Perplexity is a measure of how well a model predicts the next word in a sequence, with a lower perplexity indicating a better model. Perplexity is usually between 1 and 1000, based on what I have seen in papers.
Prompt: Chocolate Chip Cookies (everything below is model output) - Bake at 350° for 25 minutes. - Add remaining ingredients and bake at 350° for 20 minutes. - Bake at 350° for 30 minutes. - Bake at 350° for 35 minutes.
After investigating, I found that the output source length was cutting off a large chunk of the output in the dataset. This was causing the model to perform poorly, so I decided to increase the maximum target and source length to 128 and 256.
I also changed the model such that it would finetune all of the layers instead of just the last one. I’d originally assumed that you could finetune the last layer and have it work (similar to how transfer learning would work with image classification models) but it appears this isn’t the case with LLMs. I haven’t found a good explanation for this, so curious if anyone knows why (feel free to reach out @aqaderb on Twitter).
model_name: "t5-base" max_target_length / max_source_length: 128 / 256 batch_size: 8 learning_rate: 5e-6 scheduler: linear num_epochs: 1 unfreeze_layers: [all]
With the new maximum target and source length, the model performed significantly better. The average perplexity of this model was 112, which was a vast improvement over the previous model. However, with the default generation settings, the directions were still a little unclear.
Prompt: Chocolate Chip Cookies Ingredients - 2 1/4 c. flour - 1 tsp. baking soda - 1/2 tsp. salt - 3/4 c. shortening - 3/4 c. brown sugar - 3/4 c. white sugar - 2 eggs - 1 tsp. vanilla - 1 (12 oz.) pkg. semi-sweet chocolate chips Directions - Preheat oven to 375°. - Sift together flour, baking soda and salt. - Set aside.
I wanted to experiment with a larger batch size to see if that would further improve the performance of the model so I trained the last model with a batch size of 16.
model_name: "t5-base" max_target_length / max_source_length: 128 / 256 batch_size: 16 learning_rate: 5e-6 scheduler: linear num_epochs: 1 unfreeze_layers: [all]
The average perplexity of this model came down to ~105. After trying some sample inputs, I was pretty happy with the model. For example, when given the input "chocolate chip cookies," the model generated the following output:
Prompt: Chocolate Chip Cookies Ingredients - 1 c. shortening - 1 c. brown sugar - 1 c. white sugar - 2 eggs - 1 tsp. vanilla - 2 1/2 c. flour - 1 tsp. baking soda - 1 tsp. Salt - 1 (12 oz.) pkg. chocolate chips Directions - Cream shortening and sugars. - Add eggs and vanilla. - Mix well. - Sift together flour, soda and salt. - Add to creamed mixture. - Stir in chocolate chips. - Drop by teaspoon onto greased cookie sheet. - Bake at 350° for 10 to 12 minutes.
Throughout the process of fine-tuning the model (took about 2 days to iterate through multiple models), I learned a few key things that are worth noting:
1) It’s critical to carefully select the maximum target and source length. If these values are too small, the model will not produce output that accurately reflects the input. On the other hand, if they are too large, the model will require a lot of GPU memory to train, which may be an issue if you do not have a sufficiently powerful GPU.
2) It’s important to use the validation set / test set to ensure that the model is generalizing well. If the model performs well on the training set but poorly on the validation set, it's a sign that the model is overfitting and will not perform well on new data. In such cases, it's important to adjust the hyperparameters or add regularization to prevent overfitting. I ran into this when I started out as the training loss would get very low and the validation loss was nowhere near the training loss (these iterations were omitted as I terminated them pretty quickly after seeing this).
3) Having good metrics really makes the iterative process easier. When changing hyperparameters across model runs, the metrics are your guiding light, helping you better understand how your changes are affecting the model.
4) Fine-tuning a model can be very experimental and requires careful adjustment of the various hyperparameters. There's no one-size-fits-all solution, and it's important to try out different combinations of hyperparameters to see which one works best for your particular dataset and use case.
In the future, there are a couple improvements I would make:
1) Find a larger, more diverse dataset. With a larger dataset, it may also make sense to try a larger T5 model, like the Large or XLarge variants.
2) Try other ways to preprocess the data and improve input/output sequences. For example, passing the ethnicity of a dish in the input may help the model better understand the ingredients used as a given ethnicity consistently uses a base set of ingredients.
3) Add time estimation to recipes. As mentioned earlier, it’d be interesting to see if this model could also accurately estimate the time it takes to produce a dish or even take in a set of ingredients and produce a dish that could utilize those ingredients. I’d include the time-estimate in the formatted output and train the model to see how accurately it could estimate it.
Quick thanks to Baseten, where I work. As an employee, I had early access to Blueprint where I used their fine-tuning APIs and serverless GPUs, which made this process much faster. You can sign up for the waitlist at: blueprint.baseten.co