docs: add sample parameter (#87)

ucbepic · Oct 9, 2024 · 29ace39 · 29ace39
1 parent 2e6997d
commit 29ace39
Show file tree

Hide file tree

Showing 8 changed files with 9 additions and 4 deletions.
diff --git a/docs/operators/filter.md b/docs/operators/filter.md
@@ -91,6 +91,7 @@ This example demonstrates how the Filter operation distinguishes between high-im
 | `num_retries_on_validate_failure` | Number of retry attempts on validation failure    | 0                             |
 | `timeout`                         | Timeout for each LLM call in seconds              | 120                           |
 | `max_retries_per_timeout`         | Maximum number of retries per timeout             | 2                             |
+| `sample`                          | Number of samples to use for the operation        | None                          |
 
 !!! info "Validation"
 

diff --git a/docs/operators/gather.md b/docs/operators/gather.md
@@ -170,6 +170,7 @@ The Gather operation includes several key components:
 - `content_key`: Indicates the field containing the chunk content
 - `peripheral_chunks`: Specifies how to include context from surrounding chunks
 - `doc_header_key` (optional): Denotes a field representing extracted headers for each chunk
+- `sample` (optional): Number of samples to use for the operation
 
 ### Peripheral Chunks Configuration
 

diff --git a/docs/operators/map.md b/docs/operators/map.md
@@ -136,7 +136,7 @@ This example demonstrates how the Map operation can transform long, unstructured
 | `model`                           | The language model to use                                                                       | Falls back to `default_model` |
 | `optimize`                        | Flag to enable operation optimization                                                           | `True`                        |
 | `recursively_optimize`            | Flag to enable recursive optimization of operators synthesized as part of rewrite rules         | `false`                       |
-| `sample_size`                     | Number of samples to use for the operation                                                      | Processes all data            |
+| `sample`                     | Number of samples to use for the operation                                                      | Processes all data            |
 | `tools`                           | List of tool definitions for LLM use                                                            | None                          |
 | `validate`                        | List of Python expressions to validate the output                                               | None                          |
 | `num_retries_on_validate_failure` | Number of retry attempts on validation failure                                                  | 0                             |
@@ -223,5 +223,5 @@ You can use a map operation to act as an LLM no-op, and just drop any key-value
 1. **Clear Prompts**: Write clear, specific prompts that guide the LLM to produce the desired output.
 2. **Robust Validation**: Use validation to ensure output quality and consistency.
 3. **Appropriate Model Selection**: Choose the right model for your task, balancing performance and cost.
-4. **Optimize for Scale**: For large datasets, consider using `sample_size` to test your operation before running on the full dataset.
+4. **Optimize for Scale**: For large datasets, consider using `sample` to test your operation before running on the full dataset.
 5. **Use Tools Wisely**: Leverage tools for complex calculations or operations that the LLM might struggle with. You can write any Python code in the tools, so you can even use tools to call other APIs or search the internet.
diff --git a/docs/operators/parallel-map.md b/docs/operators/parallel-map.md
@@ -34,7 +34,7 @@ Each prompt configuration in the `prompts` list should contain:
 | `model`                   | The default language model to use          | Falls back to `default_model` |
 | `optimize`                | Flag to enable operation optimization      | True                          |
 | `recursively_optimize`    | Flag to enable recursive optimization      | false                         |
-| `sample_size`             | Number of samples to use for the operation | Processes all data            |
+| `sample`             | Number of samples to use for the operation | Processes all data            |
 | `timeout`                 | Timeout for each LLM call in seconds       | 120                           |
 | `max_retries_per_timeout` | Maximum number of retries per timeout      | 2                             |
 

diff --git a/docs/operators/reduce.md b/docs/operators/reduce.md
@@ -51,6 +51,7 @@ This Reduce operation processes customer feedback grouped by department:
 
 | Parameter                 | Description                                                                                            | Default                     |
 | ------------------------- | ------------------------------------------------------------------------------------------------------ | --------------------------- |
+| `sample`                  | Number of samples to use for the operation                                                      | None                        |
 | `synthesize_resolve`      | If false, won't synthesize a resolve operation between map and reduce                                  | true                        |
 | `model`                   | The language model to use                                                                              | Falls back to default_model |
 | `input`                   | Specifies the schema or keys to subselect from each item                                               | All keys from input items   |

diff --git a/docs/operators/resolve.md b/docs/operators/resolve.md
@@ -126,7 +126,7 @@ After determining eligible pairs for comparison, the Resolve operation uses a Un
 | `limit_comparisons`       | Maximum number of comparisons to perform                                          | None                          |
 | `timeout`                 | Timeout for each LLM call in seconds                                              | 120                           |
 | `max_retries_per_timeout` | Maximum number of retries per timeout                                             | 2                             |
-
+| `sample`                  | Number of samples to use for the operation                                                      | None                        |
 ## Best Practices
 
 1. **Anticipate Resolve Needs**: If you anticipate needing a Resolve operation and want to control the prompts, create it in your pipeline and let the optimizer find the appropriate blocking rules and thresholds.

diff --git a/docs/operators/split.md b/docs/operators/split.md
@@ -50,6 +50,7 @@ Note that chunks will not overlap in content.
 | --------------------- | ------------------------------------------------------------------------------- | ----------------------------- |
 | `model`               | The language model's tokenizer to use                                           | Falls back to `default_model` |
 | `num_splits_to_group` | Number of splits to group together into one chunk (only for "delimiter" method) | 1                             |
+| `sample`              | Number of samples to use for the operation                                                      | None                        |
 
 ### Splitting Methods
 

diff --git a/docs/operators/unnest.md b/docs/operators/unnest.md
@@ -38,6 +38,7 @@ The Unnest operation is valuable in scenarios where you need to:
 | expand_fields | A list of fields to expand from the nested dictionary into the parent dictionary, if unnesting a dict | []      |
 | recursive     | If true, the unnest operation will be applied recursively to nested arrays                            | false   |
 | depth         | The maximum depth for recursive unnesting (only applicable if recursive is true)                      | inf     |
+| sample        | Number of samples to use for the operation                                                            | None    |
 
 ## Output