Sorry, we can’t find that page that you’re looking for. You can try again by going back to the homepage.
- - - -diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..ff9877d --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +_site/* +.sass-cache/* \ No newline at end of file diff --git a/_posts/2024-10-29-tokenlearn_blogpost.md b/_posts/2024-10-29-tokenlearn_blogpost.md index 7be5b88..5d9daa7 100644 --- a/_posts/2024-10-29-tokenlearn_blogpost.md +++ b/_posts/2024-10-29-tokenlearn_blogpost.md @@ -1,14 +1,12 @@ --- layout: post -title: Tokenlearn blogpost +title: "POTION: bag of tricks leads to better models" categories: [Model2Vec] --- This blogpost describes the [Tokenlearn](https://github.com/MinishLab/tokenlearn) method, which is a method to pre-train Model2Vec models. -# Bag of tricks leads to better models: POTION - We've been brewing, concocting, distilling, and came up with a new distillation technique that leads to much better models, which we are now releasing under the name POTION. We open source all models, code, and data. We're releasing three versions: a 64-dim (1.9M params), 128-dim (3.8M params), and 256-dim (7.6M params) model, all based on the same base model, which is, in turn, a bge-base distillation. All POTION models outperform all previous distillations in their size class, and should be considered to be drop-in replacements of our M2V_base_output model. potion-base-8M, in particular, even improves over our largest model, M2V_base_glove. potion-base-8M is better than any set of static embeddings we could find on any task, including glove, fasttext and specialized word embeddings. diff --git a/_site/404.html b/_site/404.html deleted file mode 100644 index e708b70..0000000 --- a/_site/404.html +++ /dev/null @@ -1,114 +0,0 @@ - - -
-Small models everywhere
-Sorry, we can’t find that page that you’re looking for. You can try again by going back to the homepage.
- - - -Small models everywhere
-Small models everywhere
-Tokenlearn blogpost - 29 October 2024
- - - - - - - - - -Model2Vec Introduction blogpost - 14 October 2024
- -We’ve been brewing, concocting, distilling, and came up with a new distillation technique that leads to much better models, which we are now releasing under the name POTION. We open source all models, code, and data.
- -We’re releasing three versions: a 64-dim (1.9M params), 128-dim (3.8M params), and 256-dim (7.6M params) model, all based on the same base model, which is, in turn, a bge-base distillation. All POTION models outperform all previous distillations in their size class, and should be considered to be drop-in replacements of our M2V_base_output model. potion-base-8M, in particular, even improves over our largest model, M2V_base_glove. potion-base-8M is better than any set of static embeddings we could find on any task, including glove, fasttext and specialized word embeddings.
- -Get them here:
- - -The Tokenlearn code can be found here.
- -The rest of the post will detail how we made the models, how they perform, and further improvements we have in store.
- -In our regular model2vec framework we distill sentence transformers down to really fast tiny models by doing a forward pass for all tokens separately. We then perform Principal Component Analysis (PCA) on the resulting embeddings, and weigh the individual embeddings via Zipf’s law. See our previous blog post here. The new distillation framework is composed of 4 steps.
- -These four steps take a bit longer than the previous distillation framework. If you are looking for a quick way to get a model2vec model, distillation is still your best bet. If you are looking for maximum performance, read on!
- -We start from a distilled model. In our case, we are using the M2V_base_output model as our starting point.
- -We then go back to the original big sentence transformer, and use that transformer to create ~1M embeddings on an in-domain corpus, which for us is C4. We then throw away the sentence transformer, never to see it again. Forget it existed.
- -So, we now have a base model, and 1M texts and 1M vector representations of those texts. We then train the base model to minimize the cosine distance between the representations it produces and the representations we produced before. In doing so, our model learns to better mimic representations made by a large model. We also add a super heavy regularization term to the produced embeddings.
- -During training, we apply a few standard methods to improve performance, such as reducing the learning rate on plateau, and early stopping.
- -Finally, after training, we re-regularize our models by performing PCA, and by manually re-weighting individual tokens. As we show below, this massively improves performance, again.
- -Of note here is the manual re-weighting, which is very similar to the Zipf weighting we use, but now relies on external data. Before, we assumed that all tokens were in rank order, and simply weighted them as follows:
- -w = log(1 / rank)
-
This works really well, as shown in our original blog post. Using actual frequencies, however, works even better. We use the same 1M documents on which we trained, and collect token probabilities for all tokens in our vocabulary. We then reweight using the following formula from the SIF paper:
- -w = 1e-3 / (1e-3 + proba)
-
where proba
is the probability of the token in the corpus. While this does mean our new distillation method relies on some data, it is worth it, as we will show below.
Just like in our original experiments, we again evaluate on MTEB, as well as our two additional tasks (PEARL and WordSim). The results are shown in the table below.
- -Model | -Avg (All) | -Avg (MTEB) | -Class | -Clust | -PairClass | -Rank | -Ret | -STS | -Sum | -Pearl | -WordSim | -
---|---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 | -56.08 | -56.09 | -62.62 | -41.94 | -82.37 | -58.04 | -41.95 | -78.90 | -30.81 | -60.83 | -49.91 | -
potion-base-8M | -50.54 | -50.03 | -64.44 | -32.93 | -76.62 | -49.73 | -31.71 | -73.24 | -29.28 | -53.54 | -50.75 | -
M2V_base_glove_subword | -49.06 | -46.69 | -61.27 | -30.03 | -74.71 | -49.15 | -27.16 | -69.09 | -30.08 | -56.82 | -57.99 | -
potion-base-4M | -48.87 | -48.23 | -62.19 | -31.47 | -75.37 | -48.75 | -29.11 | -72.19 | -28.89 | -52.55 | -49.21 | -
M2V_base_glove | -48.58 | -47.6 | -61.35 | -30.52 | -75.34 | -48.5 | -29.26 | -70.31 | -31.5 | -50.28 | -54.29 | -
M2V_base_output | -46.79 | -45.34 | -61.25 | -25.58 | -74.9 | -47.63 | -26.14 | -68.58 | -29.2 | -54.02 | -49.18 | -
potion-base-2M | -45.52 | -44.77 | -58.45 | -27.5 | -73.72 | -46.82 | -24.13 | -70.14 | -31.51 | -50.82 | -44.72 | -
GloVe_300d | -42.84 | -42.36 | -57.31 | -27.66 | -72.48 | -43.3 | -22.78 | -61.9 | -28.81 | -45.65 | -43.05 | -
BPEmb_50k_300d | -39.34 | -37.78 | -55.76 | -23.35 | -57.86 | -43.21 | -17.5 | -55.1 | -29.74 | -47.56 | -41.28 | -
As can be seen, potion-base-8M is the best model we have released so far (surpassing the 50% average MTEB score mark!), further pushing the limits of what is possible with static word embeddings. Furthermore, the 4M and 2M models still work quite well, with the 2M model outperforming GloVE while being ~55 times smaller.
- -To show the relationship between the number of sentences per second and the average MTEB score, we plot the average MTEB score against sentences per second. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
- --The average MTEB score plotted against sentences per second. The circle size indicates model size.
]]>(Large) language models have become the de facto standard for feature extraction. While these models have shown state-of-the-art performance on a large number of tasks they also come with heavy resource requirements: large energy consumption, computational demands, and longer processing times. Although there are many ways in which you can make existing (Sentence) Transformers faster, e.g. quantization, or specialized kernels, they are still relatively slow, especially on CPU. What if you need to go faster and are working on a time-constrained product (e.g. a search engine), or have very little resources available?
- -This is where Model2Vec comes in — offering static embeddings that are hardware and eco-friendly while maintaining strong performance.
- -In this blog, we will discuss what Model2Vec is, how it works, how you can use it, and its performance.
- -- |
---|
Visualization of the Model2Vec architecture. | -
Model2Vec is a technique to distill a small, fast, high performance static model from any Sentence Transformer. At a high level, it works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. No dataset is needed, just a model (and optionally, a vocabulary). During inference, we simply take the mean of all token embeddings occurring in a sentence. A Model2Vec model is therefore completely uncontextualized. While this may sound like a big downside, we’ll show that it still performs quite well considering how small and fast it is.
- -The above might sound like a lot to you, so let’s unpack this a little.
- -In a sentence transformer encoding step, a string is first chopped up into subword tokens. The embeddings of these tokens are then fed through the model, which contextualizes them to create high-quality sentence representations. At the output, you get as many embeddings as you put in, so if your input sentence consists of 10 tokens, you also get 10 output tokens. These tokens are then turned into a sentence representation by a pooling mechanism, which can either be a simple mean, or a special pooler module.
- -On to Model2Vec: the project first started as a kind of cache for sentence transformers. Because a transformer vocabulary typically only has about 32k tokens, a word like astoundingly
gets chopped up into four unique tokens: 'as', '##tou', '##nding', '##ly'
, which means that we re-compute the attention between those four tokens each time this word occurs. But the meaning of this word might not be ambiguous at all!
However, as we started implementing this, we noticed that you actually do not need to cache any words at all, and you can just use the output representations of individual tokens to get good sentence representations. And this is exactly what the basic mode of operation of Model2Vec is: for each of the 32k input tokens in a sentence transformer vocabulary, we do a forward pass, and then store the resulting embedding. For a new sentence, we then just take the mean of the token embeddings we computed.
- -Note that the output token representations of a model2vec model are uncontextualized. Unlike with normal transformer models, there is no way for the model to give different meanings to the same token in different contexts. While this might seem like a huge downside, we think that the actual context provides models with enough disambiguation potential.
- -In addition to this trick, we show that two additional tricks are necessary to get optimal performance.
- -We reduce the dimensionality of the resulting token space by using Principal Component Analysis (PCA). Normally, using PCA is associated with a loss in performance, because you throw away information. However, in our case, reducing the dimensionality actually increased performance significantly. We think this is because PCA also normalizes the resulting space, in the sense of removing biases in the original vector space, thereby making it easier to learn from the vectors.
- -As we take a simple mean over tokens in the space, it is important that the vectors are weighted correctly. Normally, a sentence transformer would be there to correctly weight all the tokens for us given the context, but we don’t have that luxury any more. Intuitively, we would like to use something like Inverse Document Frequency (IDF) to down-weight very frequent or uninteresting words. But we don’t have access to a corpus over which to compute document frequencies.
- -To overcome this, we opt to use a well-known principle from language sciences, which is that, given a frequency ranked list, the frequency of the items in that list follow a power law distribution. This is called Zipf’s law. So, if we take the assumption that a vocabulary is ranked by frequency, we can accurately down-weight really frequent items without needing to have access to actual frequencies. As tokenizer vocabularies are sorted by frequency, we already have access to a ranked list, so this optimization can be applied without any additional work.
- -- |
---|
Visualization of the effects of applying PCA and Zipf weighting on the embeddings. | -
The Model2Vec library has two broad modes of usage: distillation and inference. In distillation mode, you can distill your own model using any Sentence Transformer (and optionally your own vocabulary). In inference mode, you can use the distilled model (or use one of our pre-distilled models) to generate embeddings for your text data at extremely high speed.
- -There are three ways to distill a model:
-Note that, while vocabulary-based models are larger in terms of RAM, all models are equally fast, because our model is independent of vocabulary size.
- -Model2Vec embeddings can be used in a wide variety of applications, such as text classification, clustering, building a search engine, or a RAG system. They are an especially good fit for applications that require fast, lightweight embeddings with low resource requirements.
- -As we will show next, Model2Vec is very easy to use. It can either be used as a standalone package, or used directly in Sentence Transformers. This means you can easily integrate it into any pipeline that supports Sentence Transformers (e.g. LangChain and LlamaIndex). You can also train model2vec models directly using Sentence Transformers, keeping the fast inference speed, but optimizing them directly for your use case.
- -Model2Vec can be installed using pip:
- -pip install model2vec
-
The easiest way to get started with Model2Vec is to download one of our flagship models from our HuggingFace hub. These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings:
- -from model2vec import StaticModel
-
-# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
-model_name = "minishlab/M2V_base_output"
-model = StaticModel.from_pretrained(model_name)
-
-# Make embeddings
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-
Or distill your own models and directly use them:
- -from model2vec import distill
-
-# Choose a Sentence Transformer model
-base_model_name = "BAAI/bge-base-en-v1.5"
-
-# Distill an output model with the chosen dimensions
-model = distill(model_name=base_model_name, pca_dims=256)
-
-# Make embeddings
-embeddings = model.encode(["supervillain Ganondorf has invaded Hyrule!"])
-
-print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
-# ['super', '##vill', '##ain', 'gan', '##ond', '##orf', 'has', 'invaded', 'h', '##yr', '##ule', '!']
-
-# It looks like we split Ganondorf and Hyrule up into many subtokens
-# To solve this, we can add these words to our vocabulary.
-vocabulary = ["supervillain", "ganondorf", "hyrule"]
-
-# Distill the model with the custom vocabulary.
-model = distill(model_name=base_model_name, vocabulary=vocabulary, pca_dims=256)
-
-print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
-# ['supervillain', 'ganondorf', 'has', 'invaded', 'hyrule', '!']
-# Much better.
-
-
Model2Vec is also directly supported in Sentence Transformers. To use Model2Vec in Sentence Transformers, you can initialize a StaticEmbedding
class using from_model2vec
. To directly distill in Sentence Transformers, the StaticEmbedding
class can be initialized using from_distillation
:
from sentence_transformers import SentenceTransformer
-from sentence_transformers.models import StaticEmbedding
-
-# Initialize a StaticEmbedding module using a pre-trained model
-static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_base_output")
-model = SentenceTransformer(modules=[static_embedding])
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-# Or distill your own directly without leaving sentence-transformers
-static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256)
-model = SentenceTransformer(modules=[static_embedding])
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-
We evaluated Model2Vec on a large number of tasks and datasets. Model2Vec is evaluated on MTEB, as well as two additional tasks: PEARL (a phrase representation task) and WordSim (a collection of word similarity tasks). The results are shown in the table below.
- -Model | -Avg (All) | -Avg (MTEB) | -Class | -Clust | -PairClass | -Rank | -Ret | -STS | -Sum | -Pearl | -WordSim | -
---|---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 | -56.08 | -56.09 | -62.62 | -41.94 | -82.37 | -58.04 | -41.95 | -78.90 | -30.81 | -60.83 | -49.91 | -
M2V_base_glove_subword | -49.06 | -46.69 | -61.27 | -30.03 | -74.71 | -49.15 | -27.16 | -69.09 | -30.08 | -56.82 | -57.99 | -
M2V_base_glove | -48.58 | -47.60 | -61.35 | -30.52 | -75.34 | -48.50 | -29.26 | -70.31 | -31.50 | -50.28 | -54.29 | -
M2V_base_output | -46.79 | -45.34 | -61.25 | -25.58 | -74.90 | -47.63 | -26.14 | -68.58 | -29.20 | -54.02 | -49.18 | -
GloVe_300d | -42.84 | -42.36 | -57.31 | -27.66 | -72.48 | -43.30 | -22.78 | -61.90 | -28.81 | -45.65 | -43.05 | -
BPEmb_50k_300d | -39.34 | -37.78 | -55.76 | -23.35 | -57.86 | -43.21 | -17.50 | -55.10 | -29.74 | -47.56 | -41.28 | -
As can be seen, Model2Vec significantly outperforms GloVe and BPEmb on all tasks, and even outperforms MiniLM, which is a much slower model, on some tasks.
- -In addition, we evaluated Model2Vec on a number of classification datasets that are not in MTEB. We also use these to benchmark the speed of the model. The results are shown in the table below.
- -Model | -Average | -SST2 | -IMDB | -TREC | -AG News | -
---|---|---|---|---|---|
bge-base-en-v1.5 | -90.00 | -91.54 | -91.88 | -85.16 | -91.45 | -
all-MiniLM-L6-v2 | -84.10 | -83.95 | -81.36 | -81.31 | -89.77 | -
M2V_base_output | -82.23 | -80.92 | -84.56 | -75.27 | -88.17 | -
M2V_base_glove_subword | -81.95 | -82.84 | -85.96 | -70.51 | -88.49 | -
BPEmb_50k_300d | -81.15 | -80.42 | -84.04 | -71.25 | -88.92 | -
M2V_base_glove | -80.76 | -83.07 | -85.24 | -66.12 | -88.61 | -
GloVe_300d | -77.77 | -81.68 | -84.00 | -55.67 | -89.71 | -
Again, Model2Vec outperforms GloVe BPEmb on all tasks, and even shows similar performance to MiniLM.
- -The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.
- -- |
---|
The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size. | -
To better understand the factors contributing to the performance of Model2Vec, we conducted a comprehensive set of ablation studies, covering various aspects of the model’s architecture and preprocessing methods. In these studies, we examined the impact of key elements such as PCA, Zipf weighting, and the use of Sentence Transformers versus regular transformer models. We also compared the performance of input embeddings versus output embeddings, since it would seem plausible that these should also work well. The results are shown in the table below.
- -Model | -Avg (All) | -Avg (MTEB) | -Class | -Clust | -PairClass | -Rank | -Ret | -STS | -Sum | -Pearl | -WordSim | -
---|---|---|---|---|---|---|---|---|---|---|---|
M2V_base_output | -46.79 | -45.34 | -61.25 | -25.58 | -74.9 | -47.63 | -26.14 | -68.58 | -29.2 | -54.02 | -49.18 | -
M2V_base_output_nopca | -44.04 | -42.31 | -61.42 | -20.15 | -68.21 | -44.67 | -25.25 | -61.87 | -29.85 | -51.02 | -48.96 | -
M2V_base_output_nozipf | -43.61 | -41.52 | -60.44 | -21.62 | -72.15 | -45.57 | -20.35 | -62.71 | -30.66 | -52.28 | -49.17 | -
M2V_base_input_nozipf_nopca | -40.97 | -39.55 | -54.16 | -18.62 | -68.3 | -43.65 | -23.63 | -59.38 | -32.04 | -50.19 | -40.52 | -
M2V_base_output_nozipf_nopca | -40.8 | -38.44 | -59.78 | -19.31 | -62.39 | -42.26 | -19.01 | -55.16 | -30 | -49.09 | -48.97 | -
M2V_base_input | -40.74 | -39.93 | -60.35 | -22.66 | -59.63 | -43.02 | -25.47 | -50.05 | -29.35 | -50.61 | -34.47 | -
M2V_bert_output_nozipf_nopca | -35.54 | -34.82 | -55.69 | -15.42 | -58.68 | -39.87 | -12.92 | -55.24 | -30.15 | -46.9 | -26.72 | -
There’s four main findings in these results:
-M2V_bert_output_nozipf_nopca
(which uses BERT, a non-Sentence Transformer) and M2V_base_output_nozipf_nopca
(which uses BGE-base, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.M2V_base_output_nozipf_nopca
and M2V_base_output_nozipf
which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on all tasks.M2V_base_output_nozipf_nopca
and M2V_base_output_nopca
which gives a ~3.1% increase in performance.M2V_base_input
and M2V_base_output
which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.Thanks for reading our blog post on Model2Vec! We hope you found it informative and useful. If you have any questions or comments, please feel free to reach out to us. We are still actively working on the project, and have a number of features already planned, so stay tuned.
- -@software{minishlab2024word2vec,
- authors = {Stephan Tulkens, Thomas van Dongen},
- title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
- year = {2024},
- url = {https://github.com/MinishLab/model2vec},
-}
-
We’d like to thank Tom Aarsen for integrating Model2Vec into Sentence Transformers and helping us with our HuggingFace integration, as well as his general feedback on the project.
]]>Small models everywhere
-This blog was first posted on the Hugging Face blog. We’re also posting it here for archival purposes.
- -(Large) language models have become the de facto standard for feature extraction. While these models have shown state-of-the-art performance on a large number of tasks they also come with heavy resource requirements: large energy consumption, computational demands, and longer processing times. Although there are many ways in which you can make existing (Sentence) Transformers faster, e.g. quantization, or specialized kernels, they are still relatively slow, especially on CPU. What if you need to go faster and are working on a time-constrained product (e.g. a search engine), or have very little resources available?
- -This is where Model2Vec comes in — offering static embeddings that are hardware and eco-friendly while maintaining strong performance.
- -In this blog, we will discuss what Model2Vec is, how it works, how you can use it, and its performance.
- -- |
---|
Visualization of the Model2Vec architecture. | -
Model2Vec is a technique to distill a small, fast, high performance static model from any Sentence Transformer. At a high level, it works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. No dataset is needed, just a model (and optionally, a vocabulary). During inference, we simply take the mean of all token embeddings occurring in a sentence. A Model2Vec model is therefore completely uncontextualized. While this may sound like a big downside, we’ll show that it still performs quite well considering how small and fast it is.
- -The above might sound like a lot to you, so let’s unpack this a little.
- -In a sentence transformer encoding step, a string is first chopped up into subword tokens. The embeddings of these tokens are then fed through the model, which contextualizes them to create high-quality sentence representations. At the output, you get as many embeddings as you put in, so if your input sentence consists of 10 tokens, you also get 10 output tokens. These tokens are then turned into a sentence representation by a pooling mechanism, which can either be a simple mean, or a special pooler module.
- -On to Model2Vec: the project first started as a kind of cache for sentence transformers. Because a transformer vocabulary typically only has about 32k tokens, a word like astoundingly
gets chopped up into four unique tokens: 'as', '##tou', '##nding', '##ly'
, which means that we re-compute the attention between those four tokens each time this word occurs. But the meaning of this word might not be ambiguous at all!
However, as we started implementing this, we noticed that you actually do not need to cache any words at all, and you can just use the output representations of individual tokens to get good sentence representations. And this is exactly what the basic mode of operation of Model2Vec is: for each of the 32k input tokens in a sentence transformer vocabulary, we do a forward pass, and then store the resulting embedding. For a new sentence, we then just take the mean of the token embeddings we computed.
- -Note that the output token representations of a model2vec model are uncontextualized. Unlike with normal transformer models, there is no way for the model to give different meanings to the same token in different contexts. While this might seem like a huge downside, we think that the actual context provides models with enough disambiguation potential.
- -In addition to this trick, we show that two additional tricks are necessary to get optimal performance.
- -We reduce the dimensionality of the resulting token space by using Principal Component Analysis (PCA). Normally, using PCA is associated with a loss in performance, because you throw away information. However, in our case, reducing the dimensionality actually increased performance significantly. We think this is because PCA also normalizes the resulting space, in the sense of removing biases in the original vector space, thereby making it easier to learn from the vectors.
- -As we take a simple mean over tokens in the space, it is important that the vectors are weighted correctly. Normally, a sentence transformer would be there to correctly weight all the tokens for us given the context, but we don’t have that luxury any more. Intuitively, we would like to use something like Inverse Document Frequency (IDF) to down-weight very frequent or uninteresting words. But we don’t have access to a corpus over which to compute document frequencies.
- -To overcome this, we opt to use a well-known principle from language sciences, which is that, given a frequency ranked list, the frequency of the items in that list follow a power law distribution. This is called Zipf’s law. So, if we take the assumption that a vocabulary is ranked by frequency, we can accurately down-weight really frequent items without needing to have access to actual frequencies. As tokenizer vocabularies are sorted by frequency, we already have access to a ranked list, so this optimization can be applied without any additional work.
- -- |
---|
Visualization of the effects of applying PCA and Zipf weighting on the embeddings. | -
The Model2Vec library has two broad modes of usage: distillation and inference. In distillation mode, you can distill your own model using any Sentence Transformer (and optionally your own vocabulary). In inference mode, you can use the distilled model (or use one of our pre-distilled models) to generate embeddings for your text data at extremely high speed.
- -There are three ways to distill a model:
-Note that, while vocabulary-based models are larger in terms of RAM, all models are equally fast, because our model is independent of vocabulary size.
- -Model2Vec embeddings can be used in a wide variety of applications, such as text classification, clustering, building a search engine, or a RAG system. They are an especially good fit for applications that require fast, lightweight embeddings with low resource requirements.
- -As we will show next, Model2Vec is very easy to use. It can either be used as a standalone package, or used directly in Sentence Transformers. This means you can easily integrate it into any pipeline that supports Sentence Transformers (e.g. LangChain and LlamaIndex). You can also train model2vec models directly using Sentence Transformers, keeping the fast inference speed, but optimizing them directly for your use case.
- -Model2Vec can be installed using pip:
- -pip install model2vec
-
The easiest way to get started with Model2Vec is to download one of our flagship models from our HuggingFace hub. These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings:
- -from model2vec import StaticModel
-
-# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
-model_name = "minishlab/M2V_base_output"
-model = StaticModel.from_pretrained(model_name)
-
-# Make embeddings
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-
Or distill your own models and directly use them:
- -from model2vec import distill
-
-# Choose a Sentence Transformer model
-base_model_name = "BAAI/bge-base-en-v1.5"
-
-# Distill an output model with the chosen dimensions
-model = distill(model_name=base_model_name, pca_dims=256)
-
-# Make embeddings
-embeddings = model.encode(["supervillain Ganondorf has invaded Hyrule!"])
-
-print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
-# ['super', '##vill', '##ain', 'gan', '##ond', '##orf', 'has', 'invaded', 'h', '##yr', '##ule', '!']
-
-# It looks like we split Ganondorf and Hyrule up into many subtokens
-# To solve this, we can add these words to our vocabulary.
-vocabulary = ["supervillain", "ganondorf", "hyrule"]
-
-# Distill the model with the custom vocabulary.
-model = distill(model_name=base_model_name, vocabulary=vocabulary, pca_dims=256)
-
-print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
-# ['supervillain', 'ganondorf', 'has', 'invaded', 'hyrule', '!']
-# Much better.
-
-
Model2Vec is also directly supported in Sentence Transformers. To use Model2Vec in Sentence Transformers, you can initialize a StaticEmbedding
class using from_model2vec
. To directly distill in Sentence Transformers, the StaticEmbedding
class can be initialized using from_distillation
:
from sentence_transformers import SentenceTransformer
-from sentence_transformers.models import StaticEmbedding
-
-# Initialize a StaticEmbedding module using a pre-trained model
-static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_base_output")
-model = SentenceTransformer(modules=[static_embedding])
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-# Or distill your own directly without leaving sentence-transformers
-static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256)
-model = SentenceTransformer(modules=[static_embedding])
-embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
-
-
We evaluated Model2Vec on a large number of tasks and datasets. Model2Vec is evaluated on MTEB, as well as two additional tasks: PEARL (a phrase representation task) and WordSim (a collection of word similarity tasks). The results are shown in the table below.
- -Model | -Avg (All) | -Avg (MTEB) | -Class | -Clust | -PairClass | -Rank | -Ret | -STS | -Sum | -Pearl | -WordSim | -
---|---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 | -56.08 | -56.09 | -62.62 | -41.94 | -82.37 | -58.04 | -41.95 | -78.90 | -30.81 | -60.83 | -49.91 | -
M2V_base_glove_subword | -49.06 | -46.69 | -61.27 | -30.03 | -74.71 | -49.15 | -27.16 | -69.09 | -30.08 | -56.82 | -57.99 | -
M2V_base_glove | -48.58 | -47.60 | -61.35 | -30.52 | -75.34 | -48.50 | -29.26 | -70.31 | -31.50 | -50.28 | -54.29 | -
M2V_base_output | -46.79 | -45.34 | -61.25 | -25.58 | -74.90 | -47.63 | -26.14 | -68.58 | -29.20 | -54.02 | -49.18 | -
GloVe_300d | -42.84 | -42.36 | -57.31 | -27.66 | -72.48 | -43.30 | -22.78 | -61.90 | -28.81 | -45.65 | -43.05 | -
BPEmb_50k_300d | -39.34 | -37.78 | -55.76 | -23.35 | -57.86 | -43.21 | -17.50 | -55.10 | -29.74 | -47.56 | -41.28 | -
As can be seen, Model2Vec significantly outperforms GloVe and BPEmb on all tasks, and even outperforms MiniLM, which is a much slower model, on some tasks.
- -In addition, we evaluated Model2Vec on a number of classification datasets that are not in MTEB. We also use these to benchmark the speed of the model. The results are shown in the table below.
- -Model | -Average | -SST2 | -IMDB | -TREC | -AG News | -
---|---|---|---|---|---|
bge-base-en-v1.5 | -90.00 | -91.54 | -91.88 | -85.16 | -91.45 | -
all-MiniLM-L6-v2 | -84.10 | -83.95 | -81.36 | -81.31 | -89.77 | -
M2V_base_output | -82.23 | -80.92 | -84.56 | -75.27 | -88.17 | -
M2V_base_glove_subword | -81.95 | -82.84 | -85.96 | -70.51 | -88.49 | -
BPEmb_50k_300d | -81.15 | -80.42 | -84.04 | -71.25 | -88.92 | -
M2V_base_glove | -80.76 | -83.07 | -85.24 | -66.12 | -88.61 | -
GloVe_300d | -77.77 | -81.68 | -84.00 | -55.67 | -89.71 | -
Again, Model2Vec outperforms GloVe BPEmb on all tasks, and even shows similar performance to MiniLM.
- -The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.
- -- |
---|
The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size. | -
To better understand the factors contributing to the performance of Model2Vec, we conducted a comprehensive set of ablation studies, covering various aspects of the model’s architecture and preprocessing methods. In these studies, we examined the impact of key elements such as PCA, Zipf weighting, and the use of Sentence Transformers versus regular transformer models. We also compared the performance of input embeddings versus output embeddings, since it would seem plausible that these should also work well. The results are shown in the table below.
- -Model | -Avg (All) | -Avg (MTEB) | -Class | -Clust | -PairClass | -Rank | -Ret | -STS | -Sum | -Pearl | -WordSim | -
---|---|---|---|---|---|---|---|---|---|---|---|
M2V_base_output | -46.79 | -45.34 | -61.25 | -25.58 | -74.9 | -47.63 | -26.14 | -68.58 | -29.2 | -54.02 | -49.18 | -
M2V_base_output_nopca | -44.04 | -42.31 | -61.42 | -20.15 | -68.21 | -44.67 | -25.25 | -61.87 | -29.85 | -51.02 | -48.96 | -
M2V_base_output_nozipf | -43.61 | -41.52 | -60.44 | -21.62 | -72.15 | -45.57 | -20.35 | -62.71 | -30.66 | -52.28 | -49.17 | -
M2V_base_input_nozipf_nopca | -40.97 | -39.55 | -54.16 | -18.62 | -68.3 | -43.65 | -23.63 | -59.38 | -32.04 | -50.19 | -40.52 | -
M2V_base_output_nozipf_nopca | -40.8 | -38.44 | -59.78 | -19.31 | -62.39 | -42.26 | -19.01 | -55.16 | -30 | -49.09 | -48.97 | -
M2V_base_input | -40.74 | -39.93 | -60.35 | -22.66 | -59.63 | -43.02 | -25.47 | -50.05 | -29.35 | -50.61 | -34.47 | -
M2V_bert_output_nozipf_nopca | -35.54 | -34.82 | -55.69 | -15.42 | -58.68 | -39.87 | -12.92 | -55.24 | -30.15 | -46.9 | -26.72 | -
There’s four main findings in these results:
-M2V_bert_output_nozipf_nopca
(which uses BERT, a non-Sentence Transformer) and M2V_base_output_nozipf_nopca
(which uses BGE-base, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.M2V_base_output_nozipf_nopca
and M2V_base_output_nozipf
which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on all tasks.M2V_base_output_nozipf_nopca
and M2V_base_output_nopca
which gives a ~3.1% increase in performance.M2V_base_input
and M2V_base_output
which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.Thanks for reading our blog post on Model2Vec! We hope you found it informative and useful. If you have any questions or comments, please feel free to reach out to us. We are still actively working on the project, and have a number of features already planned, so stay tuned.
- -@software{minishlab2024word2vec,
- authors = {Stephan Tulkens, Thomas van Dongen},
- title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
- year = {2024},
- url = {https://github.com/MinishLab/model2vec},
-}
-
We’d like to thank Tom Aarsen for integrating Model2Vec into Sentence Transformers and helping us with our HuggingFace integration, as well as his general feedback on the project.
- -Small models everywhere
-October 29, 2024
-This blogpost describes the Tokenlearn method, which is a method to pre-train Model2Vec models.
- - -October 14, 2024
-This blog was first posted on the Hugging Face blog. We’re also posting it here for archival purposes.
- - -