diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index de9c055c..ae60d4d2 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -12,7 +12,7 @@ - [Language](./indexer/language.md) - [Indexing](./indexer/indexing.md) - [Tradeoffs](./tradeoffs.md) -- [Dynamic Indexing](./dynamic_indexing.md) +- [Incremental Indexing](./incremental_indexing.md) - [Search API]() - [Contributing](./contributing.md) - [Setting Up](./developers_setting_up.md) diff --git a/docs/src/developers_setting_up.md b/docs/src/developers_setting_up.md index 8ae1b17d..f0edad8c 100644 --- a/docs/src/developers_setting_up.md +++ b/docs/src/developers_setting_up.md @@ -21,7 +21,7 @@ The test collection I used for the majority of development is a `380mb` csv corp unused, | mapped to 'title' field, | mapped to 'body' field, | mapped to 'heading' field ``` -Once you have your test files, place them under the `/test_files/1/source` directory. If using a custom file format, you will likely need to configure the [data and field mappings](./indexer/fields.md) as well. You can run `cargo run -p morsels_indexer /test_files/1/source --init` to create the default configuration file as a template, and proceed from there. +Once you have your test files, place them under the `/test_files/1/source` directory. If using a custom file format, you will likely need to configure the [data and field mappings](./indexer/fields.md) as well. You can run `cargo run -p morsels_indexer /test_files/1/source --config-init` to create the default configuration file as a template, and proceed from there. --- diff --git a/docs/src/dynamic_indexing.md b/docs/src/dynamic_indexing.md deleted file mode 100644 index ebc9d7b3..00000000 --- a/docs/src/dynamic_indexing.md +++ /dev/null @@ -1,45 +0,0 @@ -# Dynamic Indexing - -*Dynamic*, or *incremental* indexing is also supported by the indexer cli tool. - -Detecting **deleted, changed, or added files** is done by storing an **internal file path --> last modified timestamp** map. - -To use it, simply pass the `--dynamic` or `-d` option when running the indexer. - -## How it Works - -As the core idea of Morsels is to split up the index into many tiny parts (and not more than necessary), the dynamic indexing feature works by "patching" only the files which were updated during the current run. This means that at search time, the same amount of index files are retrieved and searched through as before, to reduce the number of network requests. - -This is in contrast to a more traditional "segment" based approach you might find in search servers, whereby each dynamic indexing run generates an entirely separate "segment", and segments are merged together at runtime (during search). While this makes sense for traditional search tools, it may unfortunately generate too many network requests for index files and search overhead from merging files, something Morsels is trying to minimise. - -## Content Based Hashing - -The default change detection currently relies on the last modified time in file metadata. This may not always be guaranteed by the tools that generate the files Morsels indexes, or be an accurate reflection of whether a file's contents were updated. - -If file metadata is *unavailable* for any given file, the file would always be re-indexed as well. - -You may specify the `--dynamic-content-hash` option in such a case to opt into using a crc32 hash comparison for all files instead. This option should also be specified when running a full index and intending to run dynamic indexing somewhere down the line. - -It should only be marginally more expensive for the majority of cases, and may be the default option in the future. - -## Circumstances that Trigger a Full (Re)Index - -Note also, that the following circumstances will forcibly trigger a **full** reindex: -- If the output folder path does not contain any files indexed by morsels -- It contains files indexed by a different version of morsels -- The configuration file (`morsels_config.json`) was changed in any way -- Usage of the `--dynamic-content-hash` option changed - -## Caveats - -There are some additional caveats to note when using this option. Whenever possible, try to run a full reindex of the documents, utilising dynamic indexing only when indexing speed is of concern -- for example, updating the index repeatedly when developing this documentation (although, the mdbook plugin this documentation is built on currently dosen't do that). - -### Collection Statistics - -Collection statistics will tend to drift off when deleting documents (which also entails updating documents). This is because such documents may contain terms that were not encountered during the current run of dynamic indexing (from added / updated documents). Detecting such terms is difficult, as there is no guarantee the deleted documents are available anymore. The alternative would be to store such information in a non-inverted index, but that again takes up extra space =(. - -As such, the information for these terms may not be "patched". As a result, you *may* notice some slight drifting in the relative ranking of documents returned after some number of dynamic indexing runs. - -### File Bloat - -When deleting documents or updating documents, old field stores are not removed. This may lead to file bloat after many incremental indexing runs. diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md index 0da3d681..014eeb79 100644 --- a/docs/src/getting_started.md +++ b/docs/src/getting_started.md @@ -28,7 +28,7 @@ If you are using the binaries, replace `morsels` with the appropriate executable ### Other Cli Options -- `--init` or `-i`: While optional, if it is your first time running the tool, you can run the above command with the `--init` or `-i` flag, then **run it again without this flag**. +- `--config-init`: While optional, if it is your first time running the tool, you can run the above command with this flag, then **run it again without this flag**. This flag outputs the default `morsels_config.json` that can be used to [configure the indexer](./indexer_configuration.md) later on, and does not perform any indexing. - `-c `: You may also change the config file location (relative to the `source-folder-path`) using the `-c ` option. - `--preserve-output-folder`: **All existing contents** in the output folder are also **removed** when running a full index. Specify this option to avoid this. diff --git a/docs/src/incremental_indexing.md b/docs/src/incremental_indexing.md new file mode 100644 index 00000000..c0111563 --- /dev/null +++ b/docs/src/incremental_indexing.md @@ -0,0 +1,45 @@ +# Incremental Indexing + +*Incremental* indexing is also supported by the indexer cli tool. + +Detecting **deleted, changed, or added files** is done by storing an **internal file path --> last modified timestamp** map. + +To use it, simply pass the `--incremental` or `-i` option when running the indexer. + +## How it Works + +As the core idea of Morsels is to split up the index into many tiny parts (and not more than necessary), the incremental indexing feature works by "patching" only the files which were updated during the current run. This means that at search time, the same amount of index files are retrieved and searched through as before, to reduce the number of network requests. + +This is in contrast to a more traditional "segment" based approach you might find in search servers, whereby each incremental indexing run generates an entirely separate "segment", and segments are merged together at runtime (during search). While this makes sense for traditional search tools, it may unfortunately generate too many network requests for index files and search overhead from merging files, something Morsels is trying to minimise. + +## Content Based Hashing + +The default change detection currently relies on the last modified time in file metadata. This may not always be guaranteed by the tools that generate the files Morsels indexes, or be an accurate reflection of whether a file's contents were updated. + +If file metadata is *unavailable* for any given file, the file would always be re-indexed as well. + +You may specify the `--incremental-content-hash` option in such a case to opt into using a crc32 hash comparison for all files instead. This option should also be specified when running a full index and intending to run incremental indexing somewhere down the line. + +It should only be marginally more expensive for the majority of cases, and may be the default option in the future. + +## Circumstances that Trigger a Full (Re)Index + +Note also, that the following circumstances will forcibly trigger a **full** reindex: +- If the output folder path does not contain any files indexed by morsels +- It contains files indexed by a different version of morsels +- The configuration file (`morsels_config.json`) was changed in any way +- Usage of the `--incremental-content-hash` option changed + +## Caveats + +There are some additional caveats to note when using this option. Whenever possible, try to run a full reindex of the documents, utilising incremental indexing only when indexing speed is of concern -- for example, updating the index repeatedly when developing this documentation (although, the mdbook plugin this documentation is built on currently dosen't do that). + +### Collection Statistics + +Collection statistics will tend to drift off when deleting documents (which also entails updating documents). This is because such documents may contain terms that were not encountered during the current run of incremental indexing (from added / updated documents). Detecting such terms is difficult, as there is no guarantee the deleted documents are available anymore. The alternative would be to store such information in a non-inverted index, but that again takes up extra space =(. + +As such, the information for these terms may not be "patched". As a result, you *may* notice some slight drifting in the relative ranking of documents returned after some number of incremental indexing runs. + +### File Bloat + +When deleting documents or updating documents, old field stores are not removed. This may lead to file bloat after many incremental indexing runs. diff --git a/docs/src/indexer_configuration.md b/docs/src/indexer_configuration.md index a931ad57..b2c4f61a 100644 --- a/docs/src/indexer_configuration.md +++ b/docs/src/indexer_configuration.md @@ -2,7 +2,7 @@ All indexer configurations are sourced from a json file. By default, the cli tool looks for `morsels_config.json` in the source folder (first argument specified in the command). -You can run the cli command with the `--init` option to initialise the default configuration file in the source folder. +You can run the cli command with the `--config-init` option to initialise the default configuration file in the source folder. ## Full Example diff --git a/e2e/e2e.test.js b/e2e/e2e.test.js index 0ea303df..d0b1a1d5 100644 --- a/e2e/e2e.test.js +++ b/e2e/e2e.test.js @@ -150,7 +150,7 @@ const testSuite = async (configFile) => { // ------------------------------------------------------ // ------------------------------------------------------ - // Test dynamic indexing addition + // Test incremental indexing addition // 1, to be deleted later await clearInput(); @@ -158,10 +158,10 @@ const testSuite = async (configFile) => { await waitNoResults(); fs.copyFileSync( - path.join(__dirname, 'dynamic_indexing/deletions/404.html'), + path.join(__dirname, 'incremental_indexing/deletions/404.html'), path.join(__dirname, 'input/404.html'), ); - runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`); + runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`); await reloadPage(); await typePhrase('This URL is invalid'); @@ -174,10 +174,10 @@ const testSuite = async (configFile) => { const contributingHtmlOutputPath = path.join(__dirname, 'input/contributing.html'); fs.copyFileSync( - path.join(__dirname, 'dynamic_indexing/updates/contributing.html'), + path.join(__dirname, 'incremental_indexing/updates/contributing.html'), contributingHtmlOutputPath, ); - runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`); + runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`); await reloadPage(); await typePhrase('Contributions of any form'); @@ -186,25 +186,25 @@ const testSuite = async (configFile) => { // ------------------------------------------------------ // ------------------------------------------------------ - // Test dynamic indexing deletion + // Test incremental indexing deletion fs.rmSync(path.join(__dirname, 'input/404.html')); - runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`); + runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`); await reloadPage(); await typePhrase('This URL is invalid'); await waitNoResults(); - // also assert dynamic indexing is actually run - let dynamicIndexInfo = JSON.parse( - fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'), + // also assert incremental indexing is actually run + let incrementalIndexInfo = JSON.parse( + fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'), ); - expect(dynamicIndexInfo.num_deleted_docs).toBe(1); + expect(incrementalIndexInfo.num_deleted_docs).toBe(1); // ------------------------------------------------------ // ------------------------------------------------------ - // Test dynamic indexing update + // Test incremental indexing update await clearInput(); await typePhrase('Contributions of all forms'); @@ -215,7 +215,7 @@ const testSuite = async (configFile) => { 'Contributions of any form', 'Contributions of all forms atquejxusd', ); fs.writeFileSync(contributingHtmlOutputPath, contributingHtml); - runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`); + runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`); await reloadPage(); await typePhrase('Contributions of any form'); @@ -229,15 +229,15 @@ const testSuite = async (configFile) => { await typeText('atquejxusd '); await assertSingle('contributions of all forms atquejxusd'); - // also assert dynamic indexing is actually run - dynamicIndexInfo = JSON.parse( - fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'), + // also assert incremental indexing is actually run + incrementalIndexInfo = JSON.parse( + fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'), ); - expect(dynamicIndexInfo.num_deleted_docs).toBe(2); + expect(incrementalIndexInfo.num_deleted_docs).toBe(2); // then delete it again fs.rmSync(contributingHtmlOutputPath); - runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --dynamic -c ${configFile}`); + runIndexer(`cargo run -p morsels_indexer -- ./e2e/input ./e2e/output --incremental -c ${configFile}`); await reloadPage(); await typePhrase('Contributions of any form'); @@ -251,11 +251,11 @@ const testSuite = async (configFile) => { await typeText('atquejxusd'); await waitNoResults(); - // also assert dynamic indexing is actually run - dynamicIndexInfo = JSON.parse( - fs.readFileSync(path.join(__dirname, 'output/_dynamic_index_info.json'), 'utf-8'), + // also assert incremental indexing is actually run + incrementalIndexInfo = JSON.parse( + fs.readFileSync(path.join(__dirname, 'output/_incremental_info.json'), 'utf-8'), ); - expect(dynamicIndexInfo.num_deleted_docs).toBe(3); + expect(incrementalIndexInfo.num_deleted_docs).toBe(3); // ------------------------------------------------------ }; diff --git a/e2e/dynamic_indexing/deletions/404.html b/e2e/incremental_indexing/deletions/404.html similarity index 100% rename from e2e/dynamic_indexing/deletions/404.html rename to e2e/incremental_indexing/deletions/404.html diff --git a/e2e/dynamic_indexing/updates/contributing.html b/e2e/incremental_indexing/updates/contributing.html similarity index 100% rename from e2e/dynamic_indexing/updates/contributing.html rename to e2e/incremental_indexing/updates/contributing.html diff --git a/packages/mdbook-morsels/src/main.rs b/packages/mdbook-morsels/src/main.rs index 0bddf182..6862d57a 100644 --- a/packages/mdbook-morsels/src/main.rs +++ b/packages/mdbook-morsels/src/main.rs @@ -66,7 +66,7 @@ fn main() { .arg(morsels_config_path); if let Some(_livereload_url) = ctx.config.get("output.html.livereload-url") { - command.arg("--dynamic"); + command.arg("--incremental"); } let output = command.output().expect("mdbook-morsels: failed to execute indexer process"); @@ -101,7 +101,7 @@ fn setup_config_file(ctx: &PreprocessorContext, total_len: u64) -> std::path::Pa if !morsels_config_path.exists() || !morsels_config_path.is_file() { let mut init_config_command = Command::new("morsels"); - init_config_command.current_dir(ctx.root.clone()).args(&["./", "./morsels_output", "--init"]); + init_config_command.current_dir(ctx.root.clone()).args(&["./", "./morsels_output", "--config-init"]); init_config_command.arg("-c"); init_config_command.arg(&morsels_config_path); init_config_command diff --git a/packages/morsels_indexer/src/dynamic_index_info.rs b/packages/morsels_indexer/src/incremental_info.rs similarity index 81% rename from packages/morsels_indexer/src/dynamic_index_info.rs rename to packages/morsels_indexer/src/incremental_info.rs index 599a1988..de403c81 100644 --- a/packages/morsels_indexer/src/dynamic_index_info.rs +++ b/packages/morsels_indexer/src/incremental_info.rs @@ -19,7 +19,7 @@ lazy_static! { } // Not used for search -static DYNAMIC_INDEX_INFO_FILE_NAME: &str = "_dynamic_index_info.json"; +static INCREMENTAL_INFO_FILE_NAME: &str = "_incremental_info.json"; fn get_default_dictionary() -> Dictionary { Dictionary { term_infos: FxHashMap::default(), trigrams: FxHashMap::default() } @@ -33,12 +33,12 @@ struct DocIdsAndFileHash( ); #[derive(Serialize, Deserialize)] -pub struct DynamicIndexInfo { +pub struct IncrementalIndexInfo { pub ver: String, pub use_content_hash: bool, - // Mapping of external doc identifier -> internal doc id(s) / hashes, used for dynamic indexing + // Mapping of external doc identifier -> internal doc id(s) / hashes, used for incremental indexing mappings: FxHashMap, pub last_pl_number: u32, @@ -56,9 +56,9 @@ pub struct DynamicIndexInfo { pub dictionary: Dictionary, } -impl DynamicIndexInfo { - pub fn empty(use_content_hash: bool) -> DynamicIndexInfo { - DynamicIndexInfo { +impl IncrementalIndexInfo { + pub fn empty(use_content_hash: bool) -> IncrementalIndexInfo { + IncrementalIndexInfo { ver: MORSELS_VERSION.to_owned(), use_content_hash, mappings: FxHashMap::default(), @@ -74,23 +74,23 @@ impl DynamicIndexInfo { pub fn new_from_output_folder( output_folder_path: &Path, raw_config_normalised: &str, - is_dynamic: &mut bool, + is_incremental: &mut bool, use_content_hash: bool, - ) -> DynamicIndexInfo { - if !*is_dynamic { - return DynamicIndexInfo::empty(use_content_hash); + ) -> IncrementalIndexInfo { + if !*is_incremental { + return IncrementalIndexInfo::empty(use_content_hash); } - if let Ok(meta) = std::fs::metadata(output_folder_path.join(DYNAMIC_INDEX_INFO_FILE_NAME)) { + if let Ok(meta) = std::fs::metadata(output_folder_path.join(INCREMENTAL_INFO_FILE_NAME)) { if !meta.is_file() { - println!("Old dynamic index info missing. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + println!("Old incremental index info missing. Running a full reindex."); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } } else { - println!("Old dynamic index info missing. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + println!("Old incremental index info missing. Running a full reindex."); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } if let Ok(mut file) = File::open(output_folder_path.join("old_morsels_config.json")) { @@ -99,28 +99,28 @@ impl DynamicIndexInfo { let old_config_normalised = &String::from_iter(normalized(old_config.chars())); if raw_config_normalised != old_config_normalised { println!("Configuration file changed. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } } else { eprintln!("Old configuration file missing. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } - let info_file = File::open(output_folder_path.join(DYNAMIC_INDEX_INFO_FILE_NAME)).unwrap(); + let info_file = File::open(output_folder_path.join(INCREMENTAL_INFO_FILE_NAME)).unwrap(); - let mut info: DynamicIndexInfo = serde_json::from_reader(BufReader::new(info_file)) - .expect("dynamic index info deserialization failed!"); + let mut info: IncrementalIndexInfo = serde_json::from_reader(BufReader::new(info_file)) + .expect("incremental index info deserialization failed!"); if &info.ver[..] != MORSELS_VERSION { println!("Indexer version changed. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } else if info.use_content_hash != use_content_hash { println!("Content hash option changed. Running a full reindex."); - *is_dynamic = false; - return DynamicIndexInfo::empty(use_content_hash); + *is_incremental = false; + return IncrementalIndexInfo::empty(use_content_hash); } // Dictionary @@ -228,7 +228,7 @@ impl DynamicIndexInfo { let serialized = serde_json::to_string(self).unwrap(); - File::create(output_folder_path.join(DYNAMIC_INDEX_INFO_FILE_NAME)) + File::create(output_folder_path.join(INCREMENTAL_INFO_FILE_NAME)) .unwrap() .write_all(serialized.as_bytes()) .unwrap(); diff --git a/packages/morsels_indexer/src/lib.rs b/packages/morsels_indexer/src/lib.rs index 756d9e54..027da9d2 100644 --- a/packages/morsels_indexer/src/lib.rs +++ b/packages/morsels_indexer/src/lib.rs @@ -1,5 +1,5 @@ mod docinfo; -mod dynamic_index_info; +mod incremental_info; pub mod fieldinfo; pub mod loader; mod spimireader; @@ -25,7 +25,7 @@ use morsels_lang_latin::latin; use morsels_lang_chinese::chinese; use crate::docinfo::DocInfos; -use crate::dynamic_index_info::DynamicIndexInfo; +use crate::incremental_info::IncrementalIndexInfo; use crate::fieldinfo::FieldInfo; use crate::fieldinfo::FieldInfos; use crate::fieldinfo::FieldsConfig; @@ -161,7 +161,7 @@ pub struct MorselsConfig { pub raw_config: String, } -// Separate struct to support serializing for --init option but not output config +// Separate struct to support serializing for --config-init option but not output config #[derive(Serialize)] struct MorselsIndexingOutputConfig { loader_configs: FxHashMap>, @@ -208,11 +208,11 @@ pub struct Indexer { rx_worker: Receiver, num_workers_writing_blocks: Arc>, lang_config: MorselsLanguageConfig, - is_dynamic: bool, + is_incremental: bool, delete_unencountered_external_ids: bool, start_doc_id: u32, start_block_number: u32, - dynamic_index_info: DynamicIndexInfo, + incremental_info: IncrementalIndexInfo, } impl Indexer { @@ -220,7 +220,7 @@ impl Indexer { pub fn new( output_folder_path: &Path, config: MorselsConfig, - mut is_dynamic: bool, + mut is_incremental: bool, use_content_hash: bool, preserve_output_folder: bool, delete_unencountered_external_ids: bool, @@ -229,14 +229,14 @@ impl Indexer { let raw_config_normalised = &String::from_iter(normalized(config.raw_config.chars())); - let dynamic_index_info = DynamicIndexInfo::new_from_output_folder( + let incremental_info = IncrementalIndexInfo::new_from_output_folder( &output_folder_path, raw_config_normalised, - &mut is_dynamic, + &mut is_incremental, use_content_hash ); - if !is_dynamic && !preserve_output_folder { + if !is_incremental && !preserve_output_folder { if let Ok(read_dir) = fs::read_dir(output_folder_path) { for dir_entry in read_dir { if let Err(err) = dir_entry { @@ -316,7 +316,7 @@ impl Indexer { )) }; - let doc_infos = Arc::from(Mutex::from(if is_dynamic { + let doc_infos = Arc::from(Mutex::from(if is_incremental { let mut doc_infos_vec: Vec = Vec::new(); File::open(output_folder_path.join(DOC_INFO_FILE_NAME)) .unwrap() @@ -404,11 +404,11 @@ impl Indexer { rx_worker, num_workers_writing_blocks, lang_config: config.lang_config, - is_dynamic, + is_incremental, delete_unencountered_external_ids, start_doc_id: doc_id_counter, start_block_number: 0, - dynamic_index_info, + incremental_info, }; indexer.start_block_number = indexer.block_number(); @@ -460,8 +460,8 @@ impl Indexer { for loader in self.loaders.iter() { if let Some(loader_results) = loader.try_index_file(input_folder_path_clone, path, relative_path) { - let is_not_modified = self.dynamic_index_info.set_file(external_id, path); - if is_not_modified && self.is_dynamic { + let is_not_modified = self.incremental_info.set_file(external_id, path); + if is_not_modified && self.is_incremental { return; } @@ -474,7 +474,7 @@ impl Indexer { Self::try_index_doc(&mut self.doc_miner, &self.rx_worker, 30); // TODO 30 a little arbitrary? - self.dynamic_index_info.add_doc_to_file(external_id, self.doc_id_counter); + self.incremental_info.add_doc_to_file(external_id, self.doc_id_counter); self.doc_id_counter += 1; self.spimi_counter += 1; @@ -521,7 +521,7 @@ impl Indexer { .unwrap() .write_all( serde_json::to_string_pretty(&config) - .expect("Failed to serialize morsels config for --init!") + .expect("Failed to serialize morsels config for --config-init!") .as_bytes(), ) .unwrap(); @@ -564,7 +564,7 @@ impl Indexer { self.write_morsels_config(); - self.dynamic_index_info.write(&self.output_folder_path, self.doc_id_counter); + self.incremental_info.write(&self.output_folder_path, self.doc_id_counter); if !self.is_deletion_only_run() { spimireader::common::cleanup_blocks(first_block, last_block, &self.output_folder_path); @@ -583,12 +583,12 @@ impl Indexer { fn merge_blocks(&mut self, first_block: u32, last_block: u32) { let num_blocks = last_block - first_block + 1; - if self.is_dynamic { + if self.is_incremental { if self.delete_unencountered_external_ids { - self.dynamic_index_info.delete_unencountered_external_ids(); + self.incremental_info.delete_unencountered_external_ids(); } - spimireader::dynamic::modify_blocks( + spimireader::incremental::modify_blocks( self.is_deletion_only_run(), self.doc_id_counter, num_blocks, @@ -598,7 +598,7 @@ impl Indexer { std::mem::take(&mut self.doc_infos), &self.tx_main, &self.output_folder_path, - &mut self.dynamic_index_info, + &mut self.incremental_info, ); } else { spimireader::full::merge_blocks( @@ -610,7 +610,7 @@ impl Indexer { std::mem::take(&mut self.doc_infos), &self.tx_main, &self.output_folder_path, - &mut self.dynamic_index_info, + &mut self.incremental_info, ); } } @@ -624,7 +624,7 @@ impl Indexer { .into_iter() .map(|loader| (loader.get_name(), loader)) .collect(), - pl_names_to_cache: self.dynamic_index_info.pl_names_to_cache.clone(), + pl_names_to_cache: self.incremental_info.pl_names_to_cache.clone(), num_docs_per_block: self.indexing_config.num_docs_per_block, num_pls_per_dir: self.indexing_config.num_pls_per_dir, with_positions: self.indexing_config.with_positions, diff --git a/packages/morsels_indexer/src/main.rs b/packages/morsels_indexer/src/main.rs index cd93c8dd..35734ee3 100644 --- a/packages/morsels_indexer/src/main.rs +++ b/packages/morsels_indexer/src/main.rs @@ -23,19 +23,19 @@ struct CliArgs { preserve_output_folder: bool, #[structopt(short, long, parse(from_os_str))] config_file_path: Option, - #[structopt(short, long, help = "Initialise the configuration file in the source folder")] - init: bool, + #[structopt(long, help = "Initialises the configuration file in the source folder. Does not run any indexing.")] + config_init: bool, #[structopt( short, long, - help = "Prefer dynamic indexing if the resources in output folder are available and compatible" + help = "Prefer incremental indexing if the resources in output folder are available and compatible" )] - dynamic: bool, + incremental: bool, #[structopt( long, - help = "Prefer dynamic indexing using content hashes. This flag is required even when running a full (re)index, if intending to use dynamic indexing runs later" + help = "Prefer incremental indexing using content hashes. This flag is required even when running a full (re)index, if intending to use incremental indexing runs later" )] - dynamic_content_hash: bool, + incremental_content_hash: bool, #[structopt(long, hidden = true)] perf: bool, } @@ -93,7 +93,7 @@ fn main() { config_file_path.to_str().unwrap(), ); - if args.init { + if args.config_init { morsels_indexer::Indexer::write_morsels_source_config(MorselsConfig::default(), &config_file_path); return; } @@ -115,8 +115,8 @@ fn main() { let mut indexer = morsels_indexer::Indexer::new( &output_folder_path, config, - args.dynamic, - args.dynamic_content_hash, + args.incremental, + args.incremental_content_hash, args.preserve_output_folder, true, ); diff --git a/packages/morsels_indexer/src/spimireader.rs b/packages/morsels_indexer/src/spimireader.rs index 881ffe5c..db4aeecb 100644 --- a/packages/morsels_indexer/src/spimireader.rs +++ b/packages/morsels_indexer/src/spimireader.rs @@ -1,3 +1,3 @@ pub mod common; -pub mod dynamic; +pub mod incremental; pub mod full; diff --git a/packages/morsels_indexer/src/spimireader/full.rs b/packages/morsels_indexer/src/spimireader/full.rs index 5d3c5ec9..4cc83086 100644 --- a/packages/morsels_indexer/src/spimireader/full.rs +++ b/packages/morsels_indexer/src/spimireader/full.rs @@ -13,7 +13,7 @@ use crate::spimireader::common::{ self, postings_stream::PostingsStream, terms, PostingsStreamDecoder, TermDocsForMerge, }; use crate::utils::varint; -use crate::DynamicIndexInfo; +use crate::IncrementalIndexInfo; use crate::MainToWorkerMessage; use crate::MorselsIndexingConfig; use crate::Receiver; @@ -29,7 +29,7 @@ pub fn merge_blocks( doc_infos: Arc>, tx_main: &Sender, output_folder_path: &Path, - dynamic_index_info: &mut DynamicIndexInfo, + incremental_info: &mut IncrementalIndexInfo, ) { /* Gist of this function: @@ -127,7 +127,7 @@ pub fn merge_blocks( doc_freq, curr_term_max_score, num_docs_double, - &mut dynamic_index_info.pl_names_to_cache, + &mut incremental_info.pl_names_to_cache, indexing_config, output_folder_path, ); @@ -154,12 +154,12 @@ pub fn merge_blocks( // --------------------------------------------- } - pl_writer.flush(curr_pl_offset, indexing_config.pl_cache_threshold, &mut dynamic_index_info.pl_names_to_cache); + pl_writer.flush(curr_pl_offset, indexing_config.pl_cache_threshold, &mut incremental_info.pl_names_to_cache); dict_table_writer.flush().unwrap(); dict_string_writer.flush().unwrap(); - dynamic_index_info.last_pl_number = if curr_pl_offset != 0 || curr_pl == 0 { + incremental_info.last_pl_number = if curr_pl_offset != 0 || curr_pl == 0 { curr_pl } else { curr_pl - 1 diff --git a/packages/morsels_indexer/src/spimireader/dynamic.rs b/packages/morsels_indexer/src/spimireader/incremental.rs similarity index 94% rename from packages/morsels_indexer/src/spimireader/dynamic.rs rename to packages/morsels_indexer/src/spimireader/incremental.rs index e4c0ad7e..81ec624f 100644 --- a/packages/morsels_indexer/src/spimireader/dynamic.rs +++ b/packages/morsels_indexer/src/spimireader/incremental.rs @@ -25,7 +25,7 @@ use crate::spimireader::common::{ self, postings_stream::PostingsStream, terms, PostingsStreamDecoder, TermDocsForMerge, }; use crate::utils::varint; -use crate::DynamicIndexInfo; +use crate::IncrementalIndexInfo; use crate::MainToWorkerMessage; use crate::MorselsIndexingConfig; use crate::Receiver; @@ -148,7 +148,7 @@ impl ExistingPlWriter { } } -// The same as merge_blocks, but for dynamic indexing. +// The same as merge_blocks, but for incremental indexing. // // Goes through things term-at-a-time (all terms found in the current iteration) as well, // but is different in all other ways: @@ -169,15 +169,15 @@ pub fn modify_blocks( doc_infos: Arc>, tx_main: &Sender, output_folder_path: &Path, - dynamic_index_info: &mut DynamicIndexInfo, + incremental_info: &mut IncrementalIndexInfo, ) { let mut postings_streams: BinaryHeap = BinaryHeap::new(); let postings_stream_decoders: Arc> = Arc::from(DashMap::with_capacity(num_blocks as usize)); let (blocking_sndr, blocking_rcvr): (Sender<()>, Receiver<()>) = crossbeam::channel::bounded(1); - let old_num_docs = dynamic_index_info.num_docs as f64; - let new_num_docs = (doc_id_counter - dynamic_index_info.num_deleted_docs) as f64; + let old_num_docs = incremental_info.num_docs as f64; + let new_num_docs = (doc_id_counter - incremental_info.num_deleted_docs) as f64; // Unwrap the inner mutex to avoid locks as it is now read-only let doc_infos_unlocked_arc = { @@ -211,10 +211,10 @@ pub fn modify_blocks( // Dictionary table / Postings list trackers let mut new_pl_writer = common::get_pl_writer( output_folder_path, - dynamic_index_info.last_pl_number + 1, + incremental_info.last_pl_number + 1, indexing_config.num_pls_per_dir, ); - let mut new_pl = dynamic_index_info.last_pl_number + 1; + let mut new_pl = incremental_info.last_pl_number + 1; let mut new_pls_offset: u32 = 0; let mut existing_pl_writers: FxHashMap = FxHashMap::default(); @@ -233,7 +233,7 @@ pub fn modify_blocks( &blocking_rcvr, ); - let existing_term_info = dynamic_index_info.dictionary.get_term_info(&curr_term); + let existing_term_info = incremental_info.dictionary.get_term_info(&curr_term); if let Some(old_term_info) = existing_term_info { // Existing term @@ -268,7 +268,7 @@ pub fn modify_blocks( doc_freq, curr_term_max_score, &mut curr_combined_term_docs, - &dynamic_index_info.invalidation_vector, + &incremental_info.invalidation_vector, &mut varint_buf, ); @@ -285,7 +285,7 @@ pub fn modify_blocks( doc_freq, curr_term_max_score, new_num_docs, - &mut dynamic_index_info.pl_names_to_cache, + &mut incremental_info.pl_names_to_cache, indexing_config, output_folder_path, ); @@ -308,7 +308,7 @@ pub fn modify_blocks( pl_writer.commit(&mut pl_file_length_differences); } - new_pl_writer.flush(new_pls_offset, indexing_config.pl_cache_threshold, &mut dynamic_index_info.pl_names_to_cache); + new_pl_writer.flush(new_pls_offset, indexing_config.pl_cache_threshold, &mut incremental_info.pl_names_to_cache); // --------------------------------------------- // Dictionary @@ -325,7 +325,7 @@ pub fn modify_blocks( let mut prev_term = Rc::new(SmartString::from("")); let mut prev_dict_pl = 0; - let mut old_pairs_sorted: Vec<_> = std::mem::take(&mut dynamic_index_info.dictionary.term_infos).into_iter().collect(); + let mut old_pairs_sorted: Vec<_> = std::mem::take(&mut incremental_info.dictionary.term_infos).into_iter().collect(); // Sort by old postings list order old_pairs_sorted.sort_by(|a, b| match a.1.postings_file_name.cmp(&b.1.postings_file_name) { @@ -437,7 +437,7 @@ pub fn modify_blocks( dict_table_writer.flush().unwrap(); dict_string_writer.flush().unwrap(); - dynamic_index_info.last_pl_number = if new_pls_offset != 0 || new_pl == 0 { + incremental_info.last_pl_number = if new_pls_offset != 0 || new_pl == 0 { new_pl } else { new_pl - 1 diff --git a/packages/morsels_indexer/src/spimiwriter.rs b/packages/morsels_indexer/src/spimiwriter.rs index 42eec3c6..a6bc5b80 100644 --- a/packages/morsels_indexer/src/spimiwriter.rs +++ b/packages/morsels_indexer/src/spimiwriter.rs @@ -56,7 +56,7 @@ impl Indexer { drop(num_workers_writing_blocks); let output_folder_path = PathBuf::from(&self.output_folder_path); - let check_for_existing_field_store = self.is_dynamic && block_number == self.start_block_number; + let check_for_existing_field_store = self.is_incremental && block_number == self.start_block_number; if is_last_block { combine_worker_results_and_write_block( worker_index_results, @@ -152,7 +152,7 @@ pub fn combine_worker_results_and_write_block( #[cfg(debug_assertions)] println!("Num docs in block {}: {}", block_number, sorted_doc_infos.len()); } else { - // possibly just a dynamic indexing run with a deletion + // possibly just a incremental indexing run with a deletion #[cfg(debug_assertions)] println!("Encountered empty block {}", block_number); } diff --git a/packages/morsels_indexer/src/spimiwriter/fields.rs b/packages/morsels_indexer/src/spimiwriter/fields.rs index f4f0d55f..4fc46bf5 100644 --- a/packages/morsels_indexer/src/spimiwriter/fields.rs +++ b/packages/morsels_indexer/src/spimiwriter/fields.rs @@ -91,7 +91,7 @@ fn open_new_block_file( } let output_file_path = output_dir.join(format!("{}--{}.json", file_number, block_number)); if check_for_existing && output_file_path.exists() { - // The first block for dynamic indexing might have been left halfway through somewhere before + // The first block for incremental indexing might have been left halfway through somewhere before let mut field_store_file = OpenOptions::new() .read(true) .write(true) diff --git a/packages/morsels_search/src/docinfo.rs b/packages/morsels_search/src/docinfo.rs index e7aaaf38..a4a68796 100644 --- a/packages/morsels_search/src/docinfo.rs +++ b/packages/morsels_search/src/docinfo.rs @@ -26,7 +26,7 @@ impl DocInfo { let mut byte_offset = 0; - // num_docs =/= doc_length_factors.len() due to dynamic indexing + // num_docs =/= doc_length_factors.len() due to incremental indexing let num_docs = LittleEndian::read_u32(&doc_info_vec); byte_offset += 4;