Introduction
Have you ever wondered how google text/image search works or how the recommendation system works under the hood? It uses embedding models. Embedding models are used to convert text into vector embeddings and these vector embeddings are used to represent the semantic similarity between 2 things, whether it can be an image or text. If you want to learn more about vector embeddings in more detail then I have already written a blog about it here.
There are many embedding models available in the market but what if you have an ecommerce business and you want to implement your own recommendation system for users based on the similarity of your products then you have to fine-tune or train your own embedding model, sounds hard right 👀? actually it’s not. The sentence transformer library has made it easy to train or fine-tune your own embedding model.
In this blog, we will first learn about the sentence transformer library and then we will train our model from scratch.
💡 You can get the full source code discussed in this blog from our github repo
What are embedding models?
Embedding models are types of LLMs that can convert a given text or media into vector embeddings. These embeddings are then stored in a vector space from which you can perform different operations on these embeddings to get the desired results. For example, you can perform a semantic search to get similar results for a given sentence or you can get the similarity score of a sentence which shows how similar 2 sentences are.
This is how the vector embeddings look like:
As you can see in the diagram, the similar sentences will be closer to each other in vector space which makes it easy to get the similar sentences for a given sentence. Also feel free to check out my other blog in which I explained the vector embeddings in more detail.
What are sentence transformers?
Sentence transformers is a library specifically created to create and fine-tune embedding models for sentences. You can use sentence transformers to generate embeddings for your sentences, get the similarity score between 2 or more sentences or do a semantic search for your sentence. You can also easily fine-tune an existing embedding model for specific tasks or train your own model using sentence transformers.
Let’s see some of the features of sentence transformers in action!
Converting text to embeddings
Let’s first try converting a given text into embeddings. You might also have used openai embeddings to get the embeddings of a given text but it charges money to use their embedding model so as an alternative you can use models from huggingface or any other opensource embedding model with sentence transformers to generate vector embeddings for your sentence.
Let’s see how we can generate embeddings using all-MiniLM-L6-v2 model:
First, install the sentence transformer library using pip
Now let’s import our model, you can also use another model of your choice.
Now we will define the sentences for which we want to generate the embeddings in an array and then we can use “encode” method from our model to generate embeddings.
After running the above code we can see the 384 dimensional embeddings generated for each sentence.
Cosine Similarity between Sentences
You can use cosine similarity to find out the similarity between 2 sentences. Sentence transformers allow us to find the cosine similarity score between 2 sentences so let’s see it in action!
First, we will import the required modules and convert our sentences into embeddings using the same model we used before
Now we can find the cosine similarity between these 2 embeddings using “util.cos_sim” method
After running the above code, you will see a similarity score which shows how much similar these 2 sentences are ( If it is closer to 1 then these sentences are similar and if it is closer to 0 then the sentences are not similar).
Semantic Search
We have discussed about google search or google image search before and it all works based on semantic search. In semantic search, you have a query (it can be a sentence or an image) and you convert that query into embeddings and then you find the similar sentence embeddings for the given query embedding using semantic search by performing cosine similarity.
Once we get all the similarity scores for different sentences, we then sort the sentences based on the scores in descending order meaning that the most similar sentence or a sentence with highest similarity score will be at the top and we can specify the number of similar sentences we want as “k”.
Let’s see it in action!
First we will define the existing sentences which works as a database meaning that we want to find the top k similar sentences from this list. We will have to convert these sentences into encodings so that we can perform cosine similarity on them.
Now we will define our queries and for each query we will find top 3 similar sentences from corpus
instead of using “util.cos_sim” and then getting the top k results, you can use “util.semantic_search” method to do the same thing easily.
Now we know how to use sentence transformers, let’s take a look at how we can train or fine-tune our own model.
💡 You can get the full source code discussed above from our github repo
Before we train: Preparing the Model
Before we dive into training, let’s first prepare our model to fine-tune it:
Selecting a Model
You can use any kind of “sentence similarity” model from hugging face or any different open source model with sentence transformers. Once you get the model, you can then load it just like we loaded it above.
We will use the “bert-base-uncased” model as our base model and we limit that layer to a maximal sequence length of 256, texts longer than that will be truncated.
Checkout the different model list from the sentence transformer website to select the model of your choice
Pooling
BERT produces contextualized word embeddings for all input tokens in our text. Different models might give us embeddings in an array of different size. If you want to get a fixed-sized output representation then you need to add a pooling layer.
Here we want to get the 768 dimensional vector embeddings so we will provide that size in pooling layer and you can easily get the embedding dimension for any model using “get_word_embedding_dimension()" method
Additionally, if you want to reduce the dimensions of your output array then you can add dense layer after pooling
Preparing a Dataset and Loss Function
To train a SentenceTransformer model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different.
The training data totally depends on your goal and the structure of your data. There are different type of datasets that you can prepare for your model but the main goal of every dataset is to define the similarity between 2 or more sentences.
Here are some of the popular dataset types:
- Pair of sentences with label: Every example in this dataset will have a pair of sentences with a label that shows whether they are similar or not. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not.
- Pair of sentences without label: Every example in this dataset will have a pair of sentences indicating that those sentences are similar. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language)
- Sentence with an integer label: Every example in this dataset will have an integer label to indicate the class of the sentence. This data can be easily converted by loss functions into the triplets which contains the main sentence (anchor), the positive sentences of the same class as the anchor and the negative sentences of the same class as anchor.
- Triplets without class: Every example in this dataset will have an anchor. the positive sentences which are similar to anchor and the negative sentences which are not similar to the anchor. These triplets don’t have any class or label.
For this blog, we are going to explore the Triplets and the pair of sentences with label type of datasets to train our model.
Loss Functions
The loss function plays a critical role in model training because it determines how well our embedding model will work for the specific task.
There is no single loss function that you can use for every model so you have to decide the loss function suitable for you based on the training data and the target task. You can take a look at the below table to determine the loss function for your model:
Training an embedding model
Now we know everything we need to know before training an embedding model so it’s time to get our hands dirty 🚀 !
Sentence Transformers was designed in such a way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task.
Here we will fine-tune a “bert-base-uncased” model using 2 different types of datasets and then we will evaluate the performance and results with both models.
Training a model using a triplets dataset
Let’s first pull our base model and apply pooling on it so that we can get fixed 768 sized embedding array in output
Now let’s pull our dataset, we are going to use “embedding-data/QQP_triplets” but you can use any other triplet dataset too if you want
Let’s take a look at how each data looks like in dataset
As we can see, each example have a query, a positive sentence which is similar to that query and a list of negative sentences which are not similar to query.
We can’t directly pass this dataset examples into our model because first we have to convert them to a specific format that sentence transformers and model can understand. Every training example must be in “InputExample” format in sentence transformers so we will convert our dataset data into this format.
We will also take only first sentence from both “pos” and “neg” arrays to make it easy but in production scenario, you might need to pass the full array for better performance and accuracy
Now let’s create our dataloader
Now let’s define our loss function. We can use “losses” class from sentence transformers which allows us to get different loss functions that we discussed above.
We just have to attach the model to triplet loss function
And now we are ready, let’s combine everything we prepared and fine-tune the model using “model.fit” method which takes dataloader and loss function as a train objectives.
Once the training is completed, you will see output like this:
Now let’s push this fine-tuned model on huggingface so that we can share it with other people and they can also see what we cooked!
First login with huggingface using your access token
After that. call “save_to_hub” method to push your model on huggingface
And we have successfully fine-tuned and pushed the embedding model!
Training a model using labeled sentences dataset
Now let’s try to fine-tune a model using a different dataset. This time we will use a dataset in which each example contains a pair of sentences with a label score that defines the relationship between 2 sentences.
Let’s first load our model and add pooling to it
We will use “snli” dataset to train this model which have the data in the format we discussed above
Let’s take a look at how each example looks in dataset
Now let’s convert each data example in InputExample format
Now let’s define our dataloader and loss function. For this type of dataset, we will use sotfmax loss function
Now let’s train our model!
Once the training is completed, you will see output like this:
Now let’s push this model on huggingface
And we have successfully fine-tuned a model using both datasets!
💡 You can get the full source code discussed above from our github repo
Evaluation
Now it’s time to test our fine-tuned models with the base model and analyze the accuracy and performance of these models.
We will first get the vector embeddings of some sentences using both models and then reduce the dimensions of these embeddings to 2 using “TSNE” technique then using metaploitlib, we will plot the embeddings on a 2D graph
We will use these sentences for testing
Let’s first get the embeddings of these sentences using “bert-base-uncased” model which is our base model
Now let’s reduce the embedding dimensions using TSNE
Now we have 2D embeddings, we will do clustering to classify all these embeddings into different classes so that it will be easy for us to visualize how these models are classifying different embeddings and the positions of embeddings in vector space
If you print the “cluster_assignment” array then you will see the class labels for every sentence which shows the class of each sentence
Now let’s plot these embeddings in 2D vector space using metaplotlib
And here are the results!
Now let’s get the embedding plot for our fine-tuned models in the same way and compare them side by side
Here is the comparison of BERT base model with a fine-tuned model with snli dataset
As we can see from above image, the BERT base model is not properly able to classify the sentence embeddings and those are not placed at the right place but if you compare it with fine-tuned SNLI dataset model then the classification is better than base model and 2 similar sentences are also close to each other.
But still it’s not that much good 🤔 because take a look at “The girl is carrying a baby” and “Women is playing violin” sentences should be closer to each other but still they are far away and why “monkey is playing drums” is closer to “The girl is carrying a baby” 💀? It also can happen because we have trained the model with limited examples from the dataset.
Do we get better results with triplets? Let’s check it out 👀
Here is the comparison of BERT base model with a fine-tuned model with triplet dataset
Now you can see the classification is done properly and now the closer sentences are making more sense. So we can say that the model trained with triplet dataset gave more better results than the model trained with SNLI dataset. But all this depends on the input examples you added for training and the examples in your dataset so it’s all about your use case.
Conclusion
The embedding models are very useful in searching, recommendation systems or getting similar results for a query and is widely used in different domains. We also saw that how easy it is to fine-tune or train your own model from scratch using sentence transformers in few lines of code and how better the models perform after fine-tuning them so it is always advisable to fine-tune a model before using it for your own use case for better results.
Want to train your own LLM?
As we all know that public large language models (LLM) like gpt or llama are trained on public data and might not perform well for your specific use case so to make them efficient and accurate for your specific task, you have to fine-tune a base model with your own dataset.If you have any business idea for which you might need to train or fine-tune a LLM or embedding model then kindly book a call with us and we will be happy to convert your ideas into reality.
Thanks for reading 😀