Skip to content

Commit

Permalink
Reproduce doc2query document expansion experiments (#2642)
Browse files Browse the repository at this point in the history
  • Loading branch information
b8zhong authored Jan 18, 2025
1 parent b8acb5b commit 75e51e0
Showing 1 changed file with 35 additions and 25 deletions.
60 changes: 35 additions & 25 deletions docs/experiments-doc2query.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ Here's a summary of the datasets referenced in this guide:

File | Size | MD5 | Download
:----|-----:|:----|:-----
`msmarco-passage-pred-test_topk10.tar.gz` | 764 MB | `241608d4d12a0bc595bed2aff0f56ea3` | [[Dropbox](https://www.dropbox.com/s/57g2s9vhthoewty/msmarco-passage-pred-test_topk10.tar.gz?dl=1)] [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz)]
`paragraphCorpus.v2.0.tar.xz` | 4.7 GB | `a404e9256d763ddcacc3da1e34de466a` | [[Dropbox](https://www.dropbox.com/s/1xq559k5i86gk17/paragraphCorpus.v2.0.tar.xz?dl=1)] [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/paragraphCorpus.v2.0.tar.xz)]
`trec-car-pred-test_topk10.tar.gz` | 2.7 GB | `b9f98b55e6260c64e830b34d80a7afd7` | [[Dropbox](https://www.dropbox.com/s/rl4r0md0xgxg7d9/trec-car-pred-test_topk10.tar.gz?dl=1)] [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/trec-car-pred-test_topk10.tar.gz)]
`msmarco-passage-pred-test_topk10.tar.gz` | 764 MB | `241608d4d12a0bc595bed2aff0f56ea3` | [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz)]
`paragraphCorpus.v2.0.tar.xz` | 4.7 GB | `a404e9256d763ddcacc3da1e34de466a` | [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/paragraphCorpus.v2.0.tar.xz)]
`trec-car-pred-test_topk10.tar.gz` | 2.7 GB | `b9f98b55e6260c64e830b34d80a7afd7` | [[GitLab](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/trec-car-pred-test_topk10.tar.gz)]

The GitLab repo is [here](https://git.uwaterloo.ca/jimmylin/doc2query-data/) if you want direct access.

Expand All @@ -31,8 +31,7 @@ Before going through this guide, it is recommended that you [reproduce our BM25
To start, grab the predicted queries:

```bash
# Grab tarball from either one of two sources:
wget https://www.dropbox.com/s/57g2s9vhthoewty/msmarco-passage-pred-test_topk10.tar.gz -P collections/msmarco-passage
# Grab tarball:
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz -P collections/msmarco-passage

# Unpack tarball:
Expand Down Expand Up @@ -62,8 +61,10 @@ To verify (and to track progress), the above script will generate a total of 9 J
After the script completes, we can index the expanded documents:

```
sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -threads 9 \
bin/run.sh io.anserini.index.IndexCollection \
-collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 6 \
-input collections/msmarco-passage/collection_jsonl_expanded_topk10 \
-index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
-storePositions -storeDocvectors -storeRaw
Expand All @@ -72,19 +73,27 @@ sh target/appassembler/bin/IndexCollection -collection JsonCollection \
And perform retrieval:

```
python tools/scripts/msmarco/retrieve.py --hits 1000 \
--index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
--queries collections/msmarco-passage/queries.dev.small.tsv \
--output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv
python -m pyserini.search.lucene \
--index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
--topics collections/msmarco-passage/queries.dev.small.tsv \
--topics-format default \
--output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv \
--output-format msmarco \
--bm25 --k1 0.82 --b 0.68 --hits 1000
```

Alternatively, we can use the Java implementation of the above script, which is faster (taking advantage of multi-threaded retrieval with the `-threads` option):

```
sh target/appassembler/bin/SearchMsmarco -hits 1000 -threads 8 \
-index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
-queries collections/msmarco-passage/queries.dev.small.tsv \
-output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv
bin/run.sh io.anserini.search.SearchCollection \
-index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicReader TsvInt \
-output runs/run.msmarco-passage.dev.small.expanded-topk10.tsv \
-format msmarco \
-hits 1000 \
-threads 8 \
-bm25 -bm25.k1 0.82 -bm25.b 0.68
```

Finally, to evaluate:
Expand Down Expand Up @@ -127,11 +136,9 @@ To start, download the TREC CAR dataset and the predicted queries:
```bash
mkdir collections/trec_car

# Grab tarballs from either one of two sources:
wget https://www.dropbox.com/s/1xq559k5i86gk17/paragraphCorpus.v2.0.tar.xz -P collections/trec_car
# Grab tarballs:
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/paragraphCorpus.v2.0.tar.xz -P collections/trec_car

wget https://www.dropbox.com/s/rl4r0md0xgxg7d9/trec-car-pred-test_topk10.tar.gz -P collections/trec_car
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/trec-car-pred-test_topk10.tar.gz -P collections/trec_car

# Unpack tarballs:
Expand Down Expand Up @@ -162,10 +169,12 @@ To verify (and to track progress), the above script will generate a total of 30
After the script completes, we can index the expanded documents:

```
sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -threads 30 \
-input collections/trec_car/collection_jsonl_expanded_topk10 \
-index indexes/trec_car/lucene-index.car17v2.0-expanded-topk10
bin/run.sh io.anserini.index.IndexCollection \
-collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 30 \
-input collections/trec_car/collection_jsonl_expanded_topk10 \
-index indexes/trec_car/lucene-index.car17v2.0-expanded-topk10
```

And perform retrieval on the test queries:
Expand All @@ -180,9 +189,9 @@ sh target/appassembler/bin/SearchCollection -topicReader Car \
Evaluation is performed with `trec_eval`:

```
target/appassembler/bin/trec_eval -c -m map -c -m recip_rank \
tools/topics-and-qrels/qrels.car17v2.0.benchmarkY1test.txt \
runs/run.car17v2.0.bm25.expanded-topk10.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -m map -m recip_rank \
tools/topics-and-qrels/qrels.car17v2.0.benchmarkY1test.txt \
runs/run.car17v2.0.bm25.expanded-topk10.txt
```

With the above commands, you should be able to reproduce the following results:
Expand All @@ -203,3 +212,4 @@ TREC CAR corpus v2.0 in this experiment instead of corpus v1.5 used in the paper
+ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-23 (commit [`0ae567d`](https://github.com/castorini/anserini/commit/0ae567df5c8a70ac211efd958c9ca1ff609ff782))
+ Results reproduced by [@kelvin-jiang](https://github.com/kelvin-jiang) on 2020-05-25 (commit [`b6e0367`](https://github.com/castorini/anserini/commit/b6e0367ef4e2b4fce9d81c8397ef1188e35971e7))
+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-09 (commit [`94eae4`](https://github.com/castorini/anserini/commit/94eae4e06678446954446f2d47dae1666efe134f))
+ Results reproduced by [@b8zhong](https://github.com/b8zhong) on 2024-11-29 (commit [`778968f`](https://github.com/castorini/pyserini/commit/778968fd3a4ab7e2e756d9f7e58aca0314bfbf5d))

0 comments on commit 75e51e0

Please sign in to comment.