Update documentation (#2687)

+ README, to better describe various MS MARCO V2.1 conditions. + fatjar regressions, to illustrate experiments on MS MARCO V2.1 doc and segmented doc. + other minor doc fixes (typos, etc.); removed defunct docs.
castorini · Jan 20, 2025 · e88c485 · e88c485
1 parent 4cdbbf6
commit e88c485
Show file tree

Hide file tree

Showing 10 changed files with 158 additions and 249 deletions.
diff --git a/README.md b/README.md
@@ -13,13 +13,11 @@ Among other goals, our effort aims to be [the opposite of this](http://phdcomics
 Anserini grew out of [a reproducibility study of various open-source retrieval engines in 2016](https://link.springer.com/chapter/10.1007/978-3-319-30671-1_30) (Lin et al., ECIR 2016). 
 See [Yang et al. (SIGIR 2017)](https://dl.acm.org/doi/10.1145/3077136.3080721) and [Yang et al. (JDIQ 2018)](https://dl.acm.org/doi/10.1145/3239571) for overviews.
 
-❗ Anserini was upgraded from JDK 11 to JDK 21 at commit [`272565`](https://github.com/castorini/anserini/commit/39cecf6c257bae85f4e9f6ab02e0be101338c3cc) (2024/04/03), which corresponds to the release of v0.35.0.
-
 
 ## 💥 Try It!
 
 Anserini is packaged in a self-contained fatjar, which also provides the simplest way to get started.
-Assuming you've already got Java installed, fetch the fatjar:
+Assuming you've already got Java 21 installed (Yes, you need _exactly_ this version), fetch the fatjar:
 
 ```bash
 wget https://repo1.maven.org/maven2/io/anserini/anserini/0.39.0/anserini-0.39.0-fatjar.jar
@@ -39,14 +37,19 @@ java -cp anserini-0.39.0-fatjar.jar io.anserini.search.SearchCollection \
 To evaluate:
 
 ```bash
-java -cp anserini-0.39.0-fatjar.jar trec_eval -c -M 10 -m recip_rank msmarco-passage.dev-subset run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt
+java -cp anserini-0.39.0-fatjar.jar trec_eval \
+  -c -M 10 -m recip_rank msmarco-passage.dev-subset \
+  run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt
 ```
 
 See [detailed instructions](docs/fatjar-regressions/fatjar-regressions-v0.39.0.md) for the current fatjar release of Anserini (v0.39.0) to reproduce regression experiments on the MS MARCO V2.1 corpora for TREC 2024 RAG, on MS MARCO V1 Passage, and on BEIR, all directly from the fatjar!
 
 Also, Anserini comes with a built-in webapp for interactive querying along with a REST API that can be used by other applications.
 Check out our documentation [here](docs/rest-api.md).
 
+❗ Beware, Anserini ships with many prebuilt indexes, which are automatically downloaded upon request (for example, `-index msmarco-v1-passage.splade-pp-ed` above triggers the download of a prebuilt index): these indexes can take up a lot of space.
+See [this guide on prebuilt indexes](docs/prebuilt-indexes.md) for more details.
+
 <!--
 We also have [forthcoming instructions](docs/fatjar-regressions/fatjar-regressions-v0.39.1-SNAPSHOT.md) for the next release (v0.39.1-SNAPSHOT) if you're interested.
 -->
@@ -296,11 +299,11 @@ Key:
 
 ### MS MARCO V2.1 Segmented Document Regressions
 
-The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track.
+The MS MARCO V2.1 corpora (documents and segmented documents) were derived from the V2 documents corpus for the TREC 2024 RAG Track.
 Instructions for downloading the corpus can be found [here](https://trec-rag.github.io/annoucements/2024-corpus-finalization/).
-The experiments below use _passage-level_ qrels.
+The experiments below capture topics and _passage-level_ qrels for the V2.1 segmented documents corpus.
 
-|           |                            RAG 24                             |
+|           |                        RAG 24 UMBRELA                         |
 |-----------|:-------------------------------------------------------------:|
 | baselines | [+](docs/regressions/regressions-rag24-doc-segmented-test.md) |
 
@@ -312,10 +315,11 @@ The experiments below use _passage-level_ qrels.
 
 ### MS MARCO V2.1 Document Regressions
 
-The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track.
+The MS MARCO V2.1 corpora (documents and segmented documents) were derived from the V2 documents corpus for the TREC 2024 RAG Track.
 Instructions for downloading the corpus can be found [here](https://trec-rag.github.io/annoucements/2024-corpus-finalization/).
-The experiments below capture topics and _document-level_ qrels originally targeted at the V2 corpora, but have been "projected" over to the V2.1 corpora.
+The experiments below capture topics and _document-level_ qrels originally targeted at the V2 documents corpus, but have been "projected" over to the V2.1 documents corpus.
 These should be treated like dev topics for the TREC 2024 RAG Track; actual qrels for that track were generated at the passage level.
+There are no plans to generate addition _document-level_ qrels beyond these.
 
 |                                         |                               dev                               |                                 DL21                                 |                                 DL22                                 |                                 DL23                                 |                             RAGgy dev                              |
 |-----------------------------------------|:---------------------------------------------------------------:|:--------------------------------------------------------------------:|:--------------------------------------------------------------------:|:--------------------------------------------------------------------:|:------------------------------------------------------------------:|
@@ -635,6 +639,7 @@ Beyond that, there are always [open issues](https://github.com/castorini/anserin
 
 ## 📜️ Historical Notes
 
++ Anserini was upgraded from JDK 11 to JDK 21 at commit [`272565`](https://github.com/castorini/anserini/commit/39cecf6c257bae85f4e9f6ab02e0be101338c3cc) (2024/04/03), which corresponds to the release of v0.35.0.
 + Anserini was upgraded to Lucene 9.3 at commit [`272565`](https://github.com/castorini/anserini/commit/27256551e958f39495b04e89ef55de9d27f33414) (8/2/2022): this upgrade created backward compatibility issues, see [#1952](https://github.com/castorini/anserini/issues/1952).
 Anserini will automatically detect Lucene 8 indexes and disable consistent tie-breaking to avoid runtime errors.
 However, Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes.

diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md b/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md
@@ -46,7 +46,7 @@ Both indexes will be downloaded automatically.
 
 For the TREC 2024 RAG track, we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants).
 Current results are based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments.
-The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10):
+The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10):
 
 |                                                                            |    dev |   dev2 |   DL21 |   DL22 |   DL23 |  RAGgy |
 |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:|

diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md b/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md
@@ -101,7 +101,7 @@ Replace `-index msmarco-v2.1-doc` with `-index msmarco-v2.1-doc-segemented` if y
 
 Since the TREC 2024 RAG evaluation hasn't happened yet, there are no qrels for evaluation.
 However, we _do_ have results based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments.
-The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10):
+The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10):
 
 |                                                                            |    dev |   dev2 |   DL21 |   DL22 |   DL23 |  RAGgy |
 |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:|

diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md b/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md
@@ -101,7 +101,7 @@ Replace `-index msmarco-v2.1-doc` with `-index msmarco-v2.1-doc-segemented` if y
 
 Since the TREC 2024 RAG evaluation hasn't happened yet, there are no qrels for evaluation.
 However, we _do_ have results based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments.
-The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10):
+The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10):
 
 |                                                                            |    dev |   dev2 |   DL21 |   DL22 |   DL23 |  RAGgy |
 |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:|