-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/main' into lvreynoso/diamond_sen…
…sitive
- Loading branch information
Showing
61 changed files
with
20,264 additions
and
445 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
name: Index Generation NCBI Compress cargo tests | ||
|
||
on: | ||
push: | ||
paths: | ||
- 'workflows/index-generation/ncbi-compress/**' | ||
|
||
env: | ||
LC_ALL: C.UTF-8 | ||
LANG: C.UTF-8 | ||
DEBIAN_FRONTEND: noninteractive | ||
|
||
jobs: | ||
cargo-test: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: docker login ghcr.io | ||
uses: docker/login-action@v1 | ||
with: | ||
registry: ghcr.io | ||
username: ${{ github.repository_owner }} | ||
password: ${{ secrets.GITHUB_TOKEN }} | ||
- name: docker build + push to ghcr.io | ||
run: | | ||
TAG=$(git describe --long --tags --always --dirty) | ||
IMAGE_NAME=czid-index-generation-public | ||
IMAGE_URI="ghcr.io/${GITHUB_REPOSITORY}/${IMAGE_NAME}" | ||
CACHE_FROM=""; docker pull "$IMAGE_URI" && CACHE_FROM="--cache-from $IMAGE_URI" | ||
./scripts/docker-build.sh "workflows/index-generation" --tag "${IMAGE_URI}:${TAG}" $CACHE_FROM \ | ||
|| ./scripts/docker-build.sh "workflows/index-generation" --tag "${IMAGE_URI}:${TAG}" | ||
docker push "${IMAGE_URI}:${TAG}" | ||
if [[ ${GITHUB_REF##*/} == "main" ]]; then | ||
docker tag "${IMAGE_URI}:${TAG}" "${IMAGE_URI}:latest" | ||
docker push "${IMAGE_URI}:latest" | ||
fi | ||
echo "IMAGE_URI=${IMAGE_URI}" >> $GITHUB_ENV | ||
echo "TAG=${TAG}" >> $GITHUB_ENV | ||
- name: run cargo tests | ||
run: | | ||
make cargo-test-index-generation |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Index Generation | ||
|
||
### Directory overview: | ||
|
||
#### **index-generation.wdl**: | ||
workflow code to create the following assets: | ||
* NT and NR indexes from NCBI with redundant sequences removed - we use this to develop additional indexes (notebaly minimap and diamond indexes). | ||
Subsequently, we compile a "short DB" comprising solely the accessions that were identified from non-host alignment. This database is then used | ||
to blast (blastx for NR blastn for NT) assembled contigs during the mNGS postprocessing phase. | ||
|
||
* By using blastx we translates nucleotide hits from NT into protein sequences using the three different reading frames, we then search this sequence against the protein database (NR in this case). We also get out alignment statistics which we display in the mNGS sample report. | ||
* By using blastn we can find regions of similarity between nucleotide sequences (in this case our contigs and NT). We also get out alignment statistics which we display in the mNGS sample report. | ||
|
||
* nt_loc.marisa and nr_loc.marisa: | ||
* index to quickly access an accession and it's sequence from either NT or NR. | ||
* This is used in consensus genome and phylotree workflows to pull reference accessions (when CG run is kicked off from the mNGS report) | ||
* Generating the JSON file for read alignment visualizations | ||
* In postprocess step of mNGS (where we generate taxon_counts) this file is used to download accessions based on hit summary from diamond and minimap to create the "short DB" mentioned above. | ||
* nt_info.marisa and nr_info.marisa: | ||
* index to quickly go from accession ID to accession name, sequence length, and offset information from either NT and NR | ||
* mainly used to generate JSON files for coverage viz to be consumed by the web app. | ||
* Minimap indexes: | ||
* chunked minimap index - used for non host alignment to generate hits to NT | ||
* diamond indexes: | ||
* chunked diamond index - used for non host alignment to generate hits to NR | ||
* accession2taxid.marisa: | ||
* index to quickly go from accession ID to taxon ID | ||
* used for determine taxon assignment for each read from hits generated from minimap and diamond. | ||
* taxid-lineages.marisa: | ||
* index used to go from tax ID to taxonomy IDs (taxid for species, genus, family) | ||
* used for determining the optimal taxon assignment for each read from the alignment | ||
results (calling hits), for example if a read aligns to multiple distinct references, we need to assess at which level in the taxonomic hierarchy the multiple alignments reach consensus. | ||
* We also use this file for generating taxon counts in the postprocess step of mNGSs | ||
* deuterostome_taxids.txt - used to filter out eukaryotic sequences which helps narrow down taxon_counts to microbial DNA (bacteria, viruses, fungi, and parasites). | ||
* taxon_ignore_list.txt - taxa that we would like to ignore (synthetic, constructs, plasmids, vectors, etc) in taxon_counts from non host alignment | ||
|
||
|
||
#### **ncbi-compress**: | ||
compression code written in rust to remove redundant sequences from NT and NR | ||
|
||
#### **helpful_for_analysis** | ||
jupyter notebooks used for helpful common analysis steps including: | ||
* querying NCBI for an accession and lineage (used to investigate reads in the "all taxa with neither family nor genus classification" report for mNGS) | ||
* querying marisa trie files - notebook to easily query all marisa trie files generated from index generation workflow above. | ||
* compare non host alignment times between two projects - this was used to benchmark how long it took to do non host alignment on old and new indexes. | ||
* generate taxon lineage changelog for a sample report - get a readout on which reads from a sample report have a taxon / lineage change between new and old index runs. Used for comp bio validation purposes mainly. | ||
* checking sequence retention for different index compression runs - this notebook was handy for running multiple compression runs and summarizing which reads were dropped, helpful for early analysis and benchmarking of compression workflow. | ||
|
||
#### **ncbitax2lin.py** | ||
used to generate taxid-lineages.marisa | ||
|
||
#### **generate_accession2taxid.py** | ||
used to generate accession2taxis.marisa | ||
|
||
#### **generate_ncbi_db_index.py** | ||
used to generate nt_loc.marisa and nr_loc.marisa | ||
|
||
#### **generate_lineage_csvs.py** | ||
used to generate versioned-taxid-lineages.csv file used to populate taxon_lineage database table, also generates changelogs for deleted taxa, changed taxa, and new taxa. | ||
|
||
### Updating The Webapp to use the new Index Generation Assets: | ||
* Follow the wiki page [here](https://github.com/chanzuckerberg/czid-web-private/wiki/%5BDEV%5D-How-to-update-the-webapp-to-use-new-index-assets-and-taxon_lineage-table-update) to update the webapp to use new assets and switch over all projects to start pinning the new index version. | ||
|
||
### Debugging Notes for EC2: | ||
* usually you need to launch an EC2 instance to test this workflow out at scale: `aegea launch --instance-type i3.16xlarge --no-provision-user --storage /=128GB --iam-role idseq-on-call <instance-name>` | ||
* install rust: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh` | ||
* install python: | ||
``` | ||
curl -O https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh | ||
chmod u+x Anaconda3-2023.09-0-Linux-x86_64.sh | ||
./Anaconda3-2023.09-0-Linux-x86_64.sh | ||
``` | ||
|
||
* if running miniwdl: | ||
make sure the version is 1.5.1 (`pip3 install miniwdl==1.5.1`) | ||
* downgrade importlib-metadata: `pip3 install importlib-metadata==4.13.0` | ||
|
||
* if building the images on an EC2 machine: | ||
Follow [these instructions](https://docs.docker.com/engine/install/ubuntu/) to update docker before building | ||
|
||
* change docker build directory to be on /mnt: | ||
``` | ||
sudo systemctl stop docker | ||
sudo mv /var/lib/docker /mnt/docker | ||
sudo ln -s /mnt/docker /var/lib/docker | ||
sudo systemctl start docker | ||
``` | ||
|
||
* add the current user to the docker group: | ||
* `sudo usermod -aG docker $USER` | ||
* logout and login to reflect new group status | ||
|
||
* build index-generation docker image: `make rebuild WORKFLOW=index-generation` | ||
* run and exec into container: `docker run -v /mnt:/mnt --rm -it index-generation:latest bash` | ||
* to run a certain task in index-geneation.wdl with miniwdl: | ||
* `miniwdl run index-generation.wdl --task GenerateLocDB --input generate_loc_db_input.json` |
Oops, something went wrong.