Skip to content

Commit

Permalink
Rename some cli options
Browse files Browse the repository at this point in the history
--init to --config-init
--dynamic and -d to --incremental and -i
--dynamic-content-hash to --incremental-content-hash
  • Loading branch information
ang-zeyu committed Dec 30, 2021
1 parent 0fe896b commit 2fce338
Show file tree
Hide file tree
Showing 19 changed files with 157 additions and 157 deletions.
2 changes: 1 addition & 1 deletion docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
- [Language](./indexer/language.md)
- [Indexing](./indexer/indexing.md)
- [Tradeoffs](./tradeoffs.md)
- [Dynamic Indexing](./dynamic_indexing.md)
- [Incremental Indexing](./incremental_indexing.md)
- [Search API]()
- [Contributing](./contributing.md)
- [Setting Up](./developers_setting_up.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/src/developers_setting_up.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The test collection I used for the majority of development is a `380mb` csv corp
unused, | mapped to 'title' field, | mapped to 'body' field, | mapped to 'heading' field
```

Once you have your test files, place them under the `<project-root>/test_files/1/source` directory. If using a custom file format, you will likely need to configure the [data and field mappings](./indexer/fields.md) as well. You can run `cargo run -p morsels_indexer <project-root>/test_files/1/source --init` to create the default configuration file as a template, and proceed from there.
Once you have your test files, place them under the `<project-root>/test_files/1/source` directory. If using a custom file format, you will likely need to configure the [data and field mappings](./indexer/fields.md) as well. You can run `cargo run -p morsels_indexer <project-root>/test_files/1/source --config-init` to create the default configuration file as a template, and proceed from there.

---

Expand Down
45 changes: 0 additions & 45 deletions docs/src/dynamic_indexing.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/src/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ If you are using the binaries, replace `morsels` with the appropriate executable

### Other Cli Options

- `--init` or `-i`: While optional, if it is your first time running the tool, you can run the above command with the `--init` or `-i` flag, then **run it again without this flag**.
- `--config-init`: While optional, if it is your first time running the tool, you can run the above command with this flag, then **run it again without this flag**.
This flag outputs the default `morsels_config.json` that can be used to [configure the indexer](./indexer_configuration.md) later on, and does not perform any indexing.
- `-c <config-file-path>`: You may also change the config file location (relative to the `source-folder-path`) using the `-c <config-file-path>` option.
- `--preserve-output-folder`: **All existing contents** in the output folder are also **removed** when running a full index. Specify this option to avoid this.
Expand Down
45 changes: 45 additions & 0 deletions docs/src/incremental_indexing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Incremental Indexing

*Incremental* indexing is also supported by the indexer cli tool.

Detecting **deleted, changed, or added files** is done by storing an **internal file path --> last modified timestamp** map.

To use it, simply pass the `--incremental` or `-i` option when running the indexer.

## How it Works

As the core idea of Morsels is to split up the index into many tiny parts (and not more than necessary), the incremental indexing feature works by "patching" only the files which were updated during the current run. This means that at search time, the same amount of index files are retrieved and searched through as before, to reduce the number of network requests.

This is in contrast to a more traditional "segment" based approach you might find in search servers, whereby each incremental indexing run generates an entirely separate "segment", and segments are merged together at runtime (during search). While this makes sense for traditional search tools, it may unfortunately generate too many network requests for index files and search overhead from merging files, something Morsels is trying to minimise.

## Content Based Hashing

The default change detection currently relies on the last modified time in file metadata. This may not always be guaranteed by the tools that generate the files Morsels indexes, or be an accurate reflection of whether a file's contents were updated.

If file metadata is *unavailable* for any given file, the file would always be re-indexed as well.

You may specify the `--incremental-content-hash` option in such a case to opt into using a crc32 hash comparison for all files instead. This option should also be specified when running a full index and intending to run incremental indexing somewhere down the line.

It should only be marginally more expensive for the majority of cases, and may be the default option in the future.

## Circumstances that Trigger a Full (Re)Index

Note also, that the following circumstances will forcibly trigger a **full** reindex:
- If the output folder path does not contain any files indexed by morsels
- It contains files indexed by a different version of morsels
- The configuration file (`morsels_config.json`) was changed in any way
- Usage of the `--incremental-content-hash` option changed

## Caveats

There are some additional caveats to note when using this option. Whenever possible, try to run a full reindex of the documents, utilising incremental indexing only when indexing speed is of concern -- for example, updating the index repeatedly when developing this documentation (although, the mdbook plugin this documentation is built on currently dosen't do that).

### Collection Statistics

Collection statistics will tend to drift off when deleting documents (which also entails updating documents). This is because such documents may contain terms that were not encountered during the current run of incremental indexing (from added / updated documents). Detecting such terms is difficult, as there is no guarantee the deleted documents are available anymore. The alternative would be to store such information in a non-inverted index, but that again takes up extra space =(.

As such, the information for these terms may not be "patched". As a result, you *may* notice some slight drifting in the relative ranking of documents returned after some number of incremental indexing runs.

### File Bloat

When deleting documents or updating documents, old field stores are not removed. This may lead to file bloat after many incremental indexing runs.
2 changes: 1 addition & 1 deletion docs/src/indexer_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

All indexer configurations are sourced from a json file. By default, the cli tool looks for `morsels_config.json` in the source folder (first argument specified in the command).

You can run the cli command with the `--init` option to initialise the default configuration file in the source folder.
You can run the cli command with the `--config-init` option to initialise the default configuration file in the source folder.


## Full Example
Expand Down
44 changes: 22 additions & 22 deletions e2e/e2e.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -150,18 +150,18 @@ const testSuite = async (configFile) => {
// ------------------------------------------------------

// ------------------------------------------------------
// Test dynamic indexing addition
// Test incremental indexing addition

// 1, to be deleted later
await clearInput();
await typePhrase('This URL is invalid');
await waitNoResults();

fs.copyFileSync(
path.join(__dirname, 'dynamic_indexing/deletions/404.html'),
path.join(__dirname, 'incremental_indexing/deletions/404.html'),
path.join(__dirname, 'input/404.html'),
);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`);

await reloadPage();
await typePhrase('This URL is invalid');
Expand All @@ -174,10 +174,10 @@ const testSuite = async (configFile) => {

const contributingHtmlOutputPath = path.join(__dirname, 'input/contributing.html');
fs.copyFileSync(
path.join(__dirname, 'dynamic_indexing/updates/contributing.html'),
path.join(__dirname, 'incremental_indexing/updates/contributing.html'),
contributingHtmlOutputPath,
);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`);

await reloadPage();
await typePhrase('Contributions of any form');
Expand All @@ -186,25 +186,25 @@ const testSuite = async (configFile) => {
// ------------------------------------------------------

// ------------------------------------------------------
// Test dynamic indexing deletion
// Test incremental indexing deletion

fs.rmSync(path.join(__dirname, 'input/404.html'));
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`);

await reloadPage();
await typePhrase('This URL is invalid');
await waitNoResults();

// also assert dynamic indexing is actually run
let dynamicIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'),
// also assert incremental indexing is actually run
let incrementalIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'),
);
expect(dynamicIndexInfo.num_deleted_docs).toBe(1);
expect(incrementalIndexInfo.num_deleted_docs).toBe(1);

// ------------------------------------------------------

// ------------------------------------------------------
// Test dynamic indexing update
// Test incremental indexing update

await clearInput();
await typePhrase('Contributions of all forms');
Expand All @@ -215,7 +215,7 @@ const testSuite = async (configFile) => {
'Contributions of any form', 'Contributions of all forms atquejxusd',
);
fs.writeFileSync(contributingHtmlOutputPath, contributingHtml);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`);

await reloadPage();
await typePhrase('Contributions of any form');
Expand All @@ -229,15 +229,15 @@ const testSuite = async (configFile) => {
await typeText('atquejxusd ');
await assertSingle('contributions of all forms atquejxusd');

// also assert dynamic indexing is actually run
dynamicIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'),
// also assert incremental indexing is actually run
incrementalIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'),
);
expect(dynamicIndexInfo.num_deleted_docs).toBe(2);
expect(incrementalIndexInfo.num_deleted_docs).toBe(2);

// then delete it again
fs.rmSync(contributingHtmlOutputPath);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`);
runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`);

await reloadPage();
await typePhrase('Contributions of any form');
Expand All @@ -251,11 +251,11 @@ const testSuite = async (configFile) => {
await typeText('atquejxusd');
await waitNoResults();

// also assert dynamic indexing is actually run
dynamicIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'),
// also assert incremental indexing is actually run
incrementalIndexInfo = JSON.parse(
fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'),
);
expect(dynamicIndexInfo.num_deleted_docs).toBe(3);
expect(incrementalIndexInfo.num_deleted_docs).toBe(3);

// ------------------------------------------------------
};
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions packages/mdbook-morsels/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ fn main() {
.arg(morsels_config_path);

if let Some(_livereload_url) = ctx.config.get("output.html.livereload-url") {
command.arg("--dynamic");
command.arg("--incremental");
}

let output = command.output().expect("mdbook-morsels: failed to execute indexer process");
Expand Down Expand Up @@ -101,7 +101,7 @@ fn setup_config_file(ctx: &PreprocessorContext, total_len: u64) -> std::path::Pa

if !morsels_config_path.exists() || !morsels_config_path.is_file() {
let mut init_config_command = Command::new("morsels");
init_config_command.current_dir(ctx.root.clone()).args(&["./", "./morsels_output", "--init"]);
init_config_command.current_dir(ctx.root.clone()).args(&["./", "./morsels_output", "--config-init"]);
init_config_command.arg("-c");
init_config_command.arg(&morsels_config_path);
init_config_command
Expand Down
Loading

0 comments on commit 2fce338

Please sign in to comment.