Skip to content

Commit

Permalink
minor doc updates
Browse files Browse the repository at this point in the history
  • Loading branch information
fschlatt committed Nov 15, 2024
1 parent 4a02a30 commit 0c86c8f
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 6 deletions.
14 changes: 8 additions & 6 deletions docs/howto/dataset.rst
Original file line number Diff line number Diff line change
@@ -1,26 +1,28 @@
.. _howto-dataset:

.. _ir_datasets: https://ir-datasets.com/

====================
Use a Custom Dataset
====================

Lightning IR currently supports all datasets registered with the `ir_datasets <https://ir-datasets.com/>`_ library. However, it is also possible to use custom datasets with Lightning IR. ``ir_datasets`` supports five different data types:
Lightning IR currently supports all datasets registered with the `ir_datasets`_ library. However, it is also possible to use custom datasets with Lightning IR. `ir_datasets`_ supports five different data types:

- Documents (a collection of documents)
- Queries (a collection of queries)
- Qrels (a collection of relevance judgements for query-document pairs)
- Training n-tuples (a collection of n-tuples consisting of a query and n-1 documents used for training)
- Run Files (a collection of queries and ranked documents)

Depending on your use case, you may need to integrate one or more of these data types. In the following, we will show you locally register datasets with ``ir_datasets`` for easy use in Lightning IR. However, first, we will demonstrate how to integrate custom run files, as these are often generated for datasets already supported by ``ir_datasets``.
Depending on your use case, you may need to integrate one or more of these data types. In the following, we will show you locally register datasets with `ir_datasets`_ for easy use in Lightning IR. However, first, we will demonstrate how to integrate custom run files, as these are often generated for datasets already supported by `ir_datasets`_.

Run Files
---------

Integrating your own run files is as simple as providing the run file to the :py:class:`~lightning_ir.data.datasets.RunDataset`. Two types of run files are supported.
Integrating your own run files is as simple as providing the run file to the :py:class:`~lightning_ir.data.dataset.RunDataset`. Two types of run files are supported.

1. The first is a standard TREC run file. When using this format, the file name must conform to a specific naming convention. The file name must correspond to the ``ir_datasets`` dataset id that the run file is associated with. For example, if you have a run file for the TREC Deep Learning 2019, the ``ir_datasets`` dataset id is ``msmarco-passage/trec-dl-2019/judged``. The run file should be named ``msmarco-passage-trec-dl-2019-judged.run``. Optionally, to discern between different run files, you can prefix the file name with meta information surrounded by two underscores, e.g., ``__my-cool-model__msmarco-passage-trec-dl-2019-judged.run``.
2. The second format is a ``.jsonl`` file that not only provides the ``query_id``, ``doc_id``, and the ``score``, but also the actual query and document texts. This format is useful when you want to re-rank a run file but do not want to register the dataset with ``ir_datasets``. The file can optionally contain relevance judgements for easy evaluation. Here is an example of a ``.jsonl`` run file:
1. The first is a standard TREC run file. When using this format, the file name must conform to a specific naming convention. The file name must correspond to the `ir_datasets`_ dataset id that the run file is associated with. For example, if you have a run file for the TREC Deep Learning 2019, the `ir_datasets`_ dataset id is ``msmarco-passage/trec-dl-2019/judged``. The run file should be named ``msmarco-passage-trec-dl-2019-judged.run``. Optionally, to discern between different run files, you can prefix the file name with meta information surrounded by two underscores, e.g., ``__my-cool-model__msmarco-passage-trec-dl-2019-judged.run``.
2. The second format is a ``.jsonl`` file that not only provides the ``query_id``, ``doc_id``, and the ``score``, but also the actual query and document texts. This format is useful when you want to re-rank a run file but do not want to register the dataset with `ir_datasets`_. The file can optionally contain relevance judgements for easy evaluation. Here is an example of a ``.jsonl`` run file:

.. code-block:: json
Expand All @@ -32,7 +34,7 @@ Integrating your own run files is as simple as providing the run file to the :py
Registering a Local Dataset
---------------------------

To integrate a custom dataset it needs to be locally registered with the `ir_datasets <https://ir-datasets.com/>`_. Lightning IR provides a :py:class:`~lightning_ir.lightning_utils.callbacks.RegisterLocalDatasetCallback` class to make registering datasets easy. This function takes a dataset id, and optional paths to local files or already valid ``ir_datasets`` dataset ids.
To integrate a custom dataset it needs to be locally registered with the `ir_datasets`_. Lightning IR provides a :py:class:`~lightning_ir.lightning_utils.callbacks.RegisterLocalDatasetCallback` class to make registering datasets easy. This function takes a dataset id, and optional paths to local files or already valid `ir_datasets`_ dataset ids.

Let's look at an example. Say we wanted to register a new set of training triples for the MS MARCO passage dataset. Our triples file is named ``msmarco-passage-train-triples.tsv`` and has the following format:

Expand Down
1 change: 1 addition & 0 deletions docs/model-zoo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The following command and configuration can be used to reproduce the results:
trainer:
logger: false
enable_checkpointing: false
model:
class_path: CrossEncoderModule # for cross-encoders
# class_path: BiEncoderModule # for bi-encoders
Expand Down

0 comments on commit 0c86c8f

Please sign in to comment.