-
Notifications
You must be signed in to change notification settings - Fork 470
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add doc for MS MARCO V2.1 baselines: BM25 + Arctic (#2701)
- Loading branch information
Showing
1 changed file
with
277 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,277 @@ | ||
# Anserini: Baselines for MS MARCO V2.1 | ||
|
||
The MS MARCO V2.1 collections were created for the [TREC RAG Track](https://trec-rag.github.io/). | ||
It was the official corpus used in 2024 and will remain the corpus for 2025. | ||
There are two separate MS MARCO V2.1 "variants", documents and segmented documents: | ||
|
||
+ The segmented documents corpus (segments = passages) is the one actually used for the TREC RAG evaluations. It contains 113,520,750 passages. | ||
+ The documents corpus is the source of the segments and useful as a point of reference (but not actually used in the TREC evaluations). It contains 10,960,555 documents. | ||
|
||
This guide focuses on the segmented documents corpus. | ||
|
||
## Effectiveness Summary | ||
|
||
### TREC 2024 RAG | ||
|
||
With Anserini, you can reproduce baseline runs on the TREC 2024 RAG test queries using BM25 and ArcticEmbed-L embeddings. | ||
Using the [UMBRELA qrels](https://trec-rag.github.io/annoucements/umbrela-qrels/), these are the evaluation numbers you'd get: | ||
|
||
**nDCG@20** | ||
|
||
| Dataset | BM25 | ArcticEmbed-L | | ||
|:---------------------|:------:|:-------------:| | ||
| RAG24 Test (UMBRELA) | 0.3198 | 0.5497 | | ||
|
||
**nDCG@100** | ||
|
||
| Dataset | BM25 | ArcticEmbed-L | | ||
|:---------------------|:------:|:-------------:| | ||
| RAG24 Test (UMBRELA) | 0.2563 | 0.4855 | | ||
|
||
**Recall@100** | ||
|
||
| Dataset | BM25 | ArcticEmbed-L | | ||
|:----------------------|:------:|:-------------:| | ||
| RAG24 Test (UMBRELA) | 0.1395 | 0.2547 | | ||
|
||
See instructions below on how to reproduce these runs. | ||
|
||
More details can be found in the following paper: | ||
|
||
> Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. [A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look.](https://arxiv.org/abs/2411.08275) _arXiv:2411.08275_, November 2024. | ||
### Dev Queries | ||
|
||
With Anserini, you can reproduce baseline runs on the TREC 2024 RAG "dev queries". | ||
These capture topics and _document-level_ qrels originally targeted at the V2 documents corpus, but have been "projected" over to the V2.1 corpus. | ||
These are the evaluation numbers you'd get: | ||
|
||
**nDCG@10** | ||
|
||
| Dataset | BM25 | ArcticEmbed-L | | ||
|:------------|:------:|:-------------:| | ||
| dev | 0.2301 | 0.3545 | | ||
| dev2 | 0.2339 | 0.3533 | | ||
| DL21 | 0.5778 | 0.6989 | | ||
| DL22 | 0.3576 | 0.5465 | | ||
| DL23 | 0.3356 | 0.4644 | | ||
| RAG24 RAGGy | 0.4227 | 0.5770 | | ||
|
||
**Recall@100** | ||
|
||
| Dataset | BM25 | ArcticEmbed-L | | ||
|:------------|:------:|:-------------:| | ||
| dev | 0.6683 | 0.8385 | | ||
| dev2 | 0.6771 | 0.8337 | | ||
| DL21 | 0.3811 | 0.4077 | | ||
| DL22 | 0.2330 | 0.3147 | | ||
| DL23 | 0.3049 | 0.3490 | | ||
| RAG24 RAGGy | 0.2807 | 0.3624 | | ||
|
||
See instructions below on how to reproduce these runs. | ||
|
||
## BM25 Baselines | ||
|
||
For the MS MARCO V2.1 segmented document collection, Anserini provides prebuilt inverted indexes (for BM25). | ||
|
||
❗ Beware, the `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed. | ||
The commands below will download the index automatically, so make sure you have plenty of space. | ||
See [this guide on prebuilt indexes](prebuilt-indexes.md) for more details. | ||
|
||
Here's how you reproduce results on the TREC 2024 RAG Track test queries: | ||
|
||
```bash | ||
bin/run.sh io.anserini.search.SearchCollection \ | ||
-index msmarco-v2.1-doc-segmented \ | ||
-topics rag24.test \ | ||
-output runs/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \ | ||
-bm25 -hits 1000 | ||
``` | ||
|
||
And to evaluate: | ||
|
||
```bash | ||
bin/run.sh trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt | ||
bin/run.sh trec_eval -c -m recall.100 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt | ||
``` | ||
|
||
You should arrive at exactly the effectiveness metrics [above](#trec-2024-rag). | ||
Note that these are _passage-level_ relevance judgments. | ||
|
||
Here's how you reproduce results on the "dev queries" (iterating over the different query sets). | ||
Note that here we are generating document-level runs via the MaxP technique (i.e., each document is represented by its highest-scoring passage). | ||
|
||
```bash | ||
TOPICS=(msmarco-v2-doc.dev msmarco-v2-doc.dev2 dl21-doc dl22-doc dl23-doc rag24.raggy-dev); for t in "${TOPICS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchCollection -index msmarco-v2.1-doc-segmented -topics $t -output runs/run.msmarco-v2.1.doc-segmented.bm25.${t}.txt -threads 16 -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 | ||
done | ||
``` | ||
|
||
And to evaluate: | ||
|
||
```bash | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 msmarco-v2.1-doc.dev runs/run.msmarco-v2.1.doc-segmented.bm25.msmarco-v2-doc.dev.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 msmarco-v2.1-doc.dev2 runs/run.msmarco-v2.1.doc-segmented.bm25.msmarco-v2-doc.dev2.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl21-doc.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl22-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl22-doc.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl23-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl23-doc.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 rag24.raggy-dev runs/run.msmarco-v2.1.doc-segmented.bm25.rag24.raggy-dev.txt | ||
|
||
bin/run.sh trec_eval -c -m recall.100 msmarco-v2.1-doc.dev runs/run.msmarco-v2.1.doc-segmented.bm25.msmarco-v2-doc.dev.txt | ||
bin/run.sh trec_eval -c -m recall.100 msmarco-v2.1-doc.dev2 runs/run.msmarco-v2.1.doc-segmented.bm25.msmarco-v2-doc.dev2.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl21-doc.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl22-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl22-doc.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl23-doc-msmarco-v2.1 runs/run.msmarco-v2.1.doc-segmented.bm25.dl23-doc.txt | ||
bin/run.sh trec_eval -c -m recall.100 rag24.raggy-dev runs/run.msmarco-v2.1.doc-segmented.bm25.rag24.raggy-dev.txt | ||
``` | ||
|
||
You should arrive at exactly the effectiveness metrics [above](#dev-queries). | ||
Note that these are _document-level_ relevance judgments. | ||
|
||
## ArcticEmbed-L Baselines | ||
|
||
For the MS MARCO V2.1 segmented document collection, Anserini provides prebuilt indexes with ArcticEmbed-L embeddings. | ||
The embedding vectors were generated by Snowflake and are freely downloadable [on Hugging Face](https://huggingface.co/datasets/Snowflake/msmarco-v2.1-snowflake-arctic-embed-l). | ||
We provide prebuilt HNSW indexes with int8 quantization, divided into 10 shards, `00` to `09`. | ||
|
||
❗ Beware, the complete ArcticEmbed-L index for all 10 shards of the MS MARCO V2.1 segmented document collection totals 558 GB! | ||
The commands below will download the indexes automatically, so make sure you have plenty of space. | ||
See [this guide on prebuilt indexes](prebuilt-indexes.md) for general info on prebuilt indexes. | ||
Additional helpful tips are provided below for dealing with space issues. | ||
|
||
Here's how you reproduce results on the TREC 2024 RAG Track test queries: | ||
|
||
```bash | ||
# RAG24 test | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics rag24.test.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt -hits 250 -threads 32 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt 2>&1 | ||
done | ||
``` | ||
|
||
Note that here we are generating passage-level runs. | ||
As it turns out, for evaluation purposes, you can just cat all the 10 run files together and evaluate: | ||
|
||
```bash | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt | ||
|
||
bin/run.sh trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt | ||
bin/run.sh trec_eval -c -m recall.100 rag24.test-umbrela-all runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt | ||
``` | ||
|
||
You should arrive at exactly the effectiveness metrics [above](#trec-2024-rag). | ||
Note that these are _passage-level_ relevance judgments. | ||
|
||
Here's how you reproduce results on the "dev queries". | ||
Note that here we are generating document-level runs via the MaxP technique (i.e., each document is represented by its highest-scoring passage). | ||
|
||
```bash | ||
# dev | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics msmarco-v2-doc.dev.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.shard${shard}.txt 2>&1 | ||
done | ||
|
||
# dev2 | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics msmarco-v2-doc.dev2.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev2.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev2.shard${shard}.txt 2>&1 | ||
done | ||
|
||
# DL21 | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics dl21.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl21.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.dl21.shard${shard}.txt 2>&1 | ||
done | ||
|
||
# DL22 | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics dl22.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl22.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.dl22.shard${shard}.txt 2>&1 | ||
done | ||
|
||
# DL23 | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics dl23.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl23.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.dl23.shard${shard}.txt 2>&1 | ||
done | ||
|
||
# RAG24 Raggy | ||
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}" | ||
do | ||
bin/run.sh io.anserini.search.SearchHnswDenseVectors -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -efSearch 1000 -topics rag24.raggy-dev.snowflake-arctic-embed-l -output runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.shard${shard}.txt -threads 32 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 250 > logs/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.shard${shard}.txt 2>&1 | ||
done | ||
``` | ||
|
||
As it turns out, for evaluation purposes, you can just cat all the 10 run files together and evaluate: | ||
|
||
``` | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.txt | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev2.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc2.dev.txt | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl21.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl21.txt | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl22.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl22.txt | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl23.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl23.txt | ||
cat runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.shard0* > runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 msmarco-v2.1-doc.dev runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 msmarco-v2.1-doc.dev2 runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc2.dev.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl21.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl22-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl22.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 dl23-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl23.txt | ||
bin/run.sh trec_eval -c -m ndcg_cut.10 rag24.raggy-dev runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.txt | ||
bin/run.sh trec_eval -c -m recall.100 msmarco-v2.1-doc.dev runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc.dev.txt | ||
bin/run.sh trec_eval -c -m recall.100 msmarco-v2.1-doc.dev2 runs/run.msmarco-v2.1-doc-segmented.arctic-l.msmarco-v2-doc2.dev.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl21.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl22-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl22.txt | ||
bin/run.sh trec_eval -c -m recall.100 dl23-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc-segmented.arctic-l.dl23.txt | ||
bin/run.sh trec_eval -c -m recall.100 rag24.raggy-dev runs/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.raggy-dev.txt | ||
``` | ||
|
||
You should arrive at exactly the effectiveness metrics [above](#dev-queries). | ||
Note that these are _document-level_ relevance judgments. | ||
|
||
The indexes for ArcticEmbed-L are big! | ||
Here are their sizes, in GB: | ||
|
||
``` | ||
56 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard00.arctic-embed-l.20250114.4884f5.aab3f8e9aa0563bd0f875584784a0845 | ||
51 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard01.arctic-embed-l.20250114.4884f5.34ea30fe72c2bc1795ae83e71b191547 | ||
64 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard02.arctic-embed-l.20250114.4884f5.b6271d6db65119977491675f74f466d5 | ||
61 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard03.arctic-embed-l.20250114.4884f5.a9cd644eb6037f67d2e9c06a8f60928d | ||
58 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard04.arctic-embed-l.20250114.4884f5.07b7e451e0525d01c1f1f2b1c42b1bd5 | ||
56 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard05.arctic-embed-l.20250114.4884f5.2573dce175788981be2f266ebb33c96d | ||
54 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard06.arctic-embed-l.20250114.4884f5.a644aea445a8b78cc9e99d2ce111ff11 | ||
52 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard07.arctic-embed-l.20250114.4884f5.402d37deccb44b5fc105049889e8aaea | ||
58 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard08.arctic-embed-l.20250114.4884f5.89ebcd027f7297b26a1edc8ae5726527 | ||
52 lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard09.arctic-embed-l.20250114.4884f5.5e580bb7eb9ee2bb6bfa492b3430c17d | ||
558 total | ||
``` | ||
|
||
The list above shows the complete index directory name after each index shard has been downloaded and unpacked into `~/.cache/pyserini/indexes/`. | ||
|
||
One helpful tip is to share the indexes among multiple people using symlinks, instead of everyone having their own copy. | ||
Something like: | ||
|
||
``` | ||
cd ~/.cache/pyserini/indexes/ | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard00.arctic-embed-l.20250114.4884f5.aab3f8e9aa0563bd0f875584784a0845 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard01.arctic-embed-l.20250114.4884f5.34ea30fe72c2bc1795ae83e71b191547 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard02.arctic-embed-l.20250114.4884f5.b6271d6db65119977491675f74f466d5 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard03.arctic-embed-l.20250114.4884f5.a9cd644eb6037f67d2e9c06a8f60928d . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard04.arctic-embed-l.20250114.4884f5.07b7e451e0525d01c1f1f2b1c42b1bd5 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard05.arctic-embed-l.20250114.4884f5.2573dce175788981be2f266ebb33c96d . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard06.arctic-embed-l.20250114.4884f5.a644aea445a8b78cc9e99d2ce111ff11 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard07.arctic-embed-l.20250114.4884f5.402d37deccb44b5fc105049889e8aaea . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard08.arctic-embed-l.20250114.4884f5.89ebcd027f7297b26a1edc8ae5726527 . | ||
ln -s /path/to/lucene-hnsw-int8.msmarco-v2.1-doc-segmented-shard09.arctic-embed-l.20250114.4884f5.5e580bb7eb9ee2bb6bfa492b3430c17d . | ||
``` | ||
|
||
On UWaterloo servers (e.g., `orca`), the base path for the shards is `/mnt/msmarco-v2_1/indexes/`. | ||
|
||
## Reproduction Log[*](reproducibility.md) | ||
|
||
+ Results reproduced by [@xxx](https://github.com/xxx) on 2025-xx-xx (commit [`xxxxxx`](https://github.com/castorini/anserini/commit/xxxxxx)) |