Skip to content

Commit

Permalink
ENH only store megares mappings that will be tested in `tests/megares…
Browse files Browse the repository at this point in the history
…_mappings`

These mappings are aggregated in `construct_megares_mapping.py` and are initiatlly stored in './db_harmonisation/mapping/megares_resfinder_argannot_mapping.tsv`. This is then moved into `./tests/megaes_mappings/`
  • Loading branch information
Vedanth-Ramji committed Feb 9, 2025
1 parent 2c74ed2 commit d8bcfd5
Show file tree
Hide file tree
Showing 5 changed files with 682 additions and 8,782 deletions.
6 changes: 6 additions & 0 deletions db_harmonisation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,9 @@ The ResFinder v4.0 is a notable example as it contains forty instances of gene c
MEGARes v3.0 is composed of multiple public genomic repositories including ResFinder, ARG-ANNOT, the National Center for Biotechnology Information (NCBI) Lahey Clinic beta-lactamase archive, the Comprehensive Antibiotic Resistance Database (CARD), NCBI’s Bacterial Antimicrobial Resistance Reference Gene Database and BacMet. The CARD database and hence the ARO do not support metal and biocide resistance, and so all entries from BacMet are removed to construct the ARO mapping table for MEGARes. Entries from ResFinder and ARG-ANNOT are directly mapped using respective ARO annotation tables, while all other entries are mapped using RGI. All coding sequences and reverse complements are translated to amino acid sequences and processed by RGI using the ‘protein’ mode. All other sequences are processed by RGI using the ‘contig’ mode. The RGI outputs are modified as mentioned above and combined with ResFinder and ARG-ANNOT mappings to form automated annotation tables. Genes which were not given an ARO mapping by RGI are manually assigned an ARO accession. The manual curation and automated annotation tables are combined to produce ARO annotation tables.

> Note: While manually assigning ARO accessions to genes, if the gene cannot be found in the ARO and if the gene does not have a parent ARO accession, no ARO mapping will be provided.
## ./mapping/megares_resfinder_argannot_mapping.tsv

Megares annotations are derived from argannot and resfinder mappings. In `../tests/test_lib.py`, `test_megares_mappings()` checks to to see if megares mappings taken directly from argannot and resfinder are correct/updated because they can be missed when running `construct_megares_mappings.py` as it depends on existing argnorm mappings.

`test_megares_mappings()` depends on `./mapping/megaresmegares_resfinder_argannot_mapping.tsv`
7 changes: 4 additions & 3 deletions db_harmonisation/construct_megares_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ def search_argnorm_mappings(mappings, db):
aros.append(np.nan)

mappings['ARO'] = aros
mappings['Database'] = db
mappings.rename(columns={'MEGARes_header': 'Original ID'}, inplace=True)
return mappings

Expand Down Expand Up @@ -121,8 +122,7 @@ def setup_for_rgi():
get_argannot_mappings(megares_headers)
])
missing_mappings = get_missing_mappings(resfinder_argannot_mapping, megares_headers)
resfinder_argannot_mapping.drop(columns=['Source_header'], inplace=True)
resfinder_argannot_mapping.to_csv('./mapping/megares_card_resfinder_argannot_mapping.tsv', sep='\t', index=False)
resfinder_argannot_mapping.to_csv('./mapping/megares_resfinder_argannot_mapping.tsv', sep='\t', index=False)
return generate_missing_mappings_fasta(missing_mappings, './dbs/megares.fasta')

@TaskGenerator
Expand Down Expand Up @@ -159,7 +159,8 @@ def get_contig_rgi_output(contig_fasta):

@TaskGenerator
def merge_megares_mappings(cds_mapping, contig_mapping):
megares_mappings = pd.read_csv('./mapping/megares_card_resfinder_argannot_mapping.tsv', sep='\t')
megares_mappings = pd.read_csv('./mapping/megares_resfinder_argannot_mapping.tsv', sep='\t')
megares_mappings.drop(columns=['Source_header', 'Database'], inplace=True)
cds_mapping = pd.read_csv(cds_mapping, sep='\t')
contig_mapping = pd.read_csv(contig_mapping, sep='\t')

Expand Down
Loading

0 comments on commit d8bcfd5

Please sign in to comment.