Update documentation

insitro · Mar 5, 2024 · 4da7ded · 4da7ded
1 parent 4e57b25
commit 4da7ded
Show file tree

Hide file tree

Showing 17 changed files with 3,085 additions and 491 deletions.
diff --git a/CHANGELOG.html b/CHANGELOG.html
diff --git a/_sources/CHANGELOG.md.txt b/_sources/CHANGELOG.md.txt
@@ -1,5 +1,19 @@
 # Changelog
 
+## 0.19.0
+February 22, 2024
+
+* `#332` - Add Context
+
+## 0.18.0
+February 15, 2024
+
+* `#339` - Azure Blob Filesystem
+* `#338` - Fix is_valid_handle to work for unsaved handles.
+* `#337` - Allow arguments in a PartialTask call to override previously specified arguments
+* `#335` - Explicit top level redun.* exports
+* `#333` - Surface job_def_extra option
+
 ## 0.17.0
 November 03, 2023
 

diff --git a/_sources/config.md.txt b/_sources/config.md.txt
@@ -104,7 +104,7 @@ redun --setup profile=dev run workflow.py main --x 1 --y 2
 An integer (default: 20) of how many seconds between displays of the job status table.
 
 
-### `federated_configs`
+#### `federated_configs`
 
 A config file may mark other config files that should be imported. This is particularly useful
 for importing remote config files containing federated entrypoints and their executors.
@@ -118,16 +118,55 @@ or
 
 ```
 [scheduler]
-federated_configs = 
+federated_configs =
     <path to config_dir_1>
     <path to config_dir_2>
 ```
 
 Only the `executor` and `federated_tasks` from these config file(s) are imported, all other sections
 are ignored. There is no namespacing for these imported executors and federated tasks, either
-from each other or from the rest of the scheduler's configuration. Any duplicate names will result 
+from each other or from the rest of the scheduler's configuration. Any duplicate names will result
 in an error being raised.
 
+#### `context`
+
+Redun's context allows users to specify workflow arguments in the redun.ini config file and then pass those arguments to deeply nested sub-tasks, without having to pass the argument explicitly through all the intermediate tasks along the way.
+This is useful for avoiding tedious argument passing (what React developers call "prop-drilling") and avoids triggering rerunning tasks that don't even look at such config (they just pass it along).
+
+```ini
+[scheduler]
+# Here, we define context variables for our tool, my_tool.
+context =
+  {
+    "my_tool": {
+      "ratio": 1.618
+    }
+  }
+```
+
+Then in the task definition, we can access the context variable using the `get_context`:
+
+```python
+@task
+def my_tool(value: int, ratio: float = get_context("my_tool.ratio")) -> str:
+    # Perform desired computation with flag set by context.
+    return f"Ran my_tool with argument: {ratio}"
+```
+
+Context can be overridden with a few CLI flags that can be passed to `redun run`.
+Please use `redun run -h` to see the full list of options.
+
+#### `context_file`
+
+In addition to specifying context inline with the `context` option, users can instead provide a path to the JSON file containing context.
+This is useful for large or complex contexts that are easier to manage in a separate file.
+The path can be either absolute or relative to the redun config directory (`.redun` by default).
+
+```ini
+[scheduler]
+context_file = /Users/piotr/context.json
+```
+
 ### Backend
 
 Backend options, such as connecting to the redun database, can be configured in the `[backend]` section.
@@ -379,18 +418,18 @@ A float (default: 3.0) that specifies the maximum time, in seconds, jobs will wa
 
 An optional integer (default: None) that specifies the time duration in seconds (measure from job attempt's `startedAt` timestamp) after which AWS Batch will terminate the job. For more on job timeouts, see the [Job Timeouts on Batch docs](https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html). When not set, jobs will run indefinitely (unless on Fargate where there is a 14 day limit).
 
-##### privileged
+##### `privileged`
 
 An optional bool (default: False) that specifies whether to run the job in privileged mode.
 
-##### autocreate_job_def
+##### `autocreate_job_def`
 
 An optional bool (default: True). If `autocreate_job_def` is disabled, then we require a `job_def_name`
 to be present and lookup the job by name. If `autocreate_job_def` is enabled, then we will create
 a new job definition if an existing one matching `job_def_name` and required properties cannot be found.
 For backwards-compatibility, the deprecated `autocreate_job` is also supported.
 
-##### job_def_name
+##### `job_def_name`
 
 An optional str (default: None) that specifies a job definition to use. If not set, a new job definition will created.
 
@@ -416,6 +455,19 @@ A bool (default: True) that specifies whether redun should add default tags to a
 
 If not none, use a multi-node job and set the number of workers. 
 
+##### `job_def_extra`
+
+Dictionary of additional arguments to pass to AWS Batch job definition creation. Parameters are documented
+[here](https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html). Only the specified
+keys will be changed - any other options set elsewhere by redun (such as ulimit for multi-node jobs, etc)
+will remain set.
+
+For example, to allocate 100 MiB of swap space:
+
+```python
+job_def_extra = {"containerProperties": {"linuxParameters": {"maxSwap": 100, "swappiness": 0}}}
+```
+
 #### AWS Glue executor
 
 The [AWS Glue executor](executors.md#aws-glue-spark-executor) (`type = aws_glue`) executes tasks on the AWS Glue compute service.

diff --git a/_sources/filesystems.md.txt b/_sources/filesystems.md.txt
@@ -0,0 +1,42 @@
+# Filesystems
+
+redun supports reading and writing to various filesystems, including local, S3, GCS, and Azure Blob Storage.
+
+The filesystem to be used is determined by the protocol in the file URL provided to `redun.file` constructor.
+In most cases, users don't need to interact with `Filesystem` objects directly.
+
+## Azure Blob Storage filesystem
+
+This filesystem is considered experimental - please read the documentation carefully before using it.
+
+### Usage
+
+If you use `redun.File` or `Dir`, initialize it with `az` path: 
+`az://[email protected]/path/to/file`. Other paths format (like HTTPS blob URI) are not
+supported.
+
+### Thread safety
+
+By design, `fsspec` operations are not stateless, and there are potential issues with thread safety:
+ - listings cache (dircache) - for results of ls-like commands. `glob` calls *likely* don't affect the cache, 
+unlike `listdir` which we don't call currently, so all redun calls like `Dir.files()` are safe 
+ - file access (see https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access) -
+users should assume IO file operations are unsafe and avoid reads/writes to the same file from multiple redun tasks running
+on the same executor.
+
+### Credentials lookup
+
+For most cases, Azure SDKs should work without any issues with the default credentials, and that's what
+`adlfs` (`fsspec` implementation for Azure storage) does when not prompted otherwise. 
+
+Note that it **wasn't** tested on Azure Batch nodes, but `DefaultClientCredential` is expected to work inside
+any Azure compute with properly configured managed identity.
+
+However, inside AzureML compute that uses on-behalf-of authentication, DefaultClientCredential will not work.
+There is a special credential class that works, but works only in synchronous mode. `adlfs` works with both sync 
+and async credential classes, but it assumes custom credential classes are async. We do a small hack in the code 
+to assign credential class to a correct class property.
+
+Currently use a simple try-catch to determine if we can retrieve auth token from DefaultAzureCredential and use
+AML credential as fallback option. We can reconsider that in the future if we find a better way of determining
+when we run on Azure compute without managed identity (e.g. environment variables).
diff --git a/_sources/redun/redun.rst.txt b/_sources/redun/redun.rst.txt
@@ -38,6 +38,14 @@ redun.config module
    :undoc-members:
    :show-inheritance:
 
+redun.context module
+--------------------
+
+.. automodule:: redun.context
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 redun.db\_utils module
 ----------------------
 

diff --git a/_sources/scheduler.md.txt b/_sources/scheduler.md.txt
@@ -178,7 +178,7 @@ with CSE, that the cached value is appropriate to use.
 Task caching operates at the granularity of a single
 call to a `Task` with concrete arguments. Recall that the result of a `Task` might be a value,
 or another expression that needs further evaluation. In its normal mode, caching uses single
-  * [x] reductions, stepping through the evaluation. See the [Results caching](design.md#Result-caching)
+reductions, stepping through the evaluation. See the [Results caching](design.md#Result-caching)
 section, for more information on how this recursive checking works.
 
 Consider the following example: