Skip to content

Commit

Permalink
ADD groot normalier tests & documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Vedanth-Ramji committed Aug 7, 2024
1 parent 65c7e30 commit 2d62163
Show file tree
Hide file tree
Showing 30 changed files with 626 additions and 22 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

- argNorm has been included as an nf-core module: https://nf-co.re/modules/argnorm/
- Use atomic writing for outputs
- Support GROOT v1.1.2

### New Features

Expand Down
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ The `resistance_to_drug_classes` column will contain ARO numbers of the broader
- [ABRicate](https://github.com/tseemann/abricate) (v1.0.1) with NCBI (v3.6), ResFinder (v4.1.11), MEGARes (v2.0), ARG-ANNOT (v5), ResFinderFG (v2)
- [ResFinder](https://bitbucket.org/genomicepidemiology/resfinder/src/master/) (v4.0)
- [AMRFinderPlus](https://github.com/ncbi/amr) (v3.10.30)
- [GROOT](https://github.com/will-rowe/groot) (v1.1.2)

## Installation
argNorm can be installed using pip:
Expand Down Expand Up @@ -72,6 +73,7 @@ The only positional argument required is `tool` which can be:
- `abricate`
- `resfinder`
- `amrfinderplus`
- `groot`

The available options are:
- `-h` or `--help`: shows available options and exits.
Expand All @@ -82,6 +84,7 @@ The available options are:
- DeepARG (`deeparg`)
- MEGARes (`megares`)
- ARG-ANNOT (`argannot`)
- `groot-core-db`, `groot-db`, `groot-resfinder`, `groot-argannot`, `groot-card`
- `--hamronized`: use this if the input is hamronized by [hAMRonization](https://github.com/pha4ge/hAMRonization)
- `-i` or `--input`: path to the annotation result
- `-o` or `--output`: the file to save normalization results
Expand All @@ -90,17 +93,20 @@ Use `argnorm -h` or `argnorm --help` to see available options.

```bash
>argnorm -h
usage: argnorm [-h] [--db {sarg,ncbi,resfinder,deeparg,megares,argannot}] [--hamronized] [-i INPUT] [-o OUTPUT] {argsoap,abricate,deeparg,resfinder,amrfinderplus}
usage: argnorm [-h]
[--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}]
[--hamronized] [-i INPUT] [-o OUTPUT]
{argsoap,abricate,deeparg,resfinder,amrfinderplus,groot}

argNorm normalizes ARG annotation results from different tools and databases to the same ontology, namely ARO (Antibiotic Resistance Ontology).

positional arguments:
{argsoap,abricate,deeparg,resfinder,amrfinderplus}
{argsoap,abricate,deeparg,resfinder,amrfinderplus,groot}
The tool you used to do ARG annotation.

options:
optional arguments:
-h, --help show this help message and exit
--db {sarg,ncbi,resfinder,deeparg,megares,argannot}
--db {sarg,ncbi,resfinder,deeparg,megares,argannot,resfinderfg,groot-argannot,groot-resfinder,groot-db,groot-core-db,groot-card}
The database you used to do ARG annotation.
--hamronized Use this if the input is hamronized (processed using the hAMRonization tool)
-i INPUT, --input INPUT
Expand Down
41 changes: 37 additions & 4 deletions argnorm/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,24 @@
MAPPING_TABLE_ARO_COL = 'ARO'
TARGET_ARO_COL = 'ARO'

DATABASES = ['argannot', 'deeparg', 'megares', 'ncbi', 'resfinder', 'resfinderfg', 'sarg']
DATABASES = [
'argannot',
'deeparg',
'megares',
'ncbi',
'resfinder',
'resfinderfg',
'sarg',
'groot',
]

groot_ref_databases = [
'groot-db',
'groot-core-db',
'groot-argannot',
'groot-resfinder',
'groot-card',
]

_ROOT = os.path.abspath(os.path.dirname(__file__))
_ARO = None
Expand Down Expand Up @@ -62,24 +79,40 @@ def get_aro_mapping_table(database):
aro_mapping_table['ARO'] = aro_mapping_table['ARO'].map(lambda a: f'ARO:{a}', na_action='ignore')
return aro_mapping_table

def map_to_aro(gene, database):
def map_to_aro(gene, database, groot_ref_db=None):
"""
Description: Gets ARO mapping for a specific gene in a database.
Parameters:
gene (str): The original ID of the gene as mentioned in source database.
database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg and sarg
database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg, sarg, and groot
groot_ref_db (str, optional): name of reference db used by groot. Can be groot-argannot, groot-resfinder, groot-card, groot-core-db, or groot-db
Returns:
ARO[result] (pronto.term.Term): A pronto term with the ARO number of input gene. ARO number can be accessed using 'id' attribute and gene name can be accessed using 'name' attribute.
If ARO mapping is doesn't exist, None is returned.
"""

if database not in ['ncbi', 'deeparg', 'resfinder', 'sarg', 'megares', 'argannot']:
if database not in DATABASES:
raise Exception(f'{database} is not a supported database.')
if 'groot' in database and not groot_ref_db in groot_ref_databases:
raise Exception(f'{groot_ref_db} is not a valid groot reference database')

mapping_table = get_aro_mapping_table(database)

# Preprocess input gene & mapping table original ids if groot is being used
if 'groot' in database:
if groot_ref_db == 'groot-argannot':
gene = gene.split('~~~')[-1]
mapping_table.index = mapping_table.index.map(lambda x: ':'.join(str(x).split(':')[1:3]))
if groot_ref_db == 'groot-card':
gene = gene.split('.')[0]
if groot_ref_db in ['groot-db', 'groot-core-db']:
if 'card' in gene.lower():
gene = gene.split('|')[-1]
else:
gene = gene.split('__')[1]

try:
result = mapping_table.loc[gene, 'ARO']
Expand Down
8 changes: 4 additions & 4 deletions argnorm/normalizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,13 +182,13 @@ def get_input_ids(self, itable):
col = 0

if self.database == 'groot-argannot':
return itable[col].map(lambda x: x.split('~~~')[-1])
if self.database == 'groot-resfinder':
return itable[col]
return itable[col].map(lambda x: x.split('~~~')[-1])
if self.database == 'groot-card':
return itable[col].map(lambda x: x.split('.')[0])
if self.database == 'groot-db' or self.database == 'groot-core-db':
if self.database in ['groot-db', 'groot-core-db']:
return itable[col].map(self.preprocess_groot_db_inputs)

return itable[col]

def preprocess_ref_genes(self, ref_genes):
if self.database == 'groot-argannot':
Expand Down
14 changes: 10 additions & 4 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ A list of supported databases.
#### Parameters
* gene (str): The original ID of the gene as mentioned in source database.
* database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg and sarg
* groot_ref_db (str, optional): name of reference database used by groot. Can be: groot-argannot, groot-resfinder, groot-card, groot-db, or groot-core-db

#### Returns
* pronto.term.Term: A pronto term with the ARO number of input gene. ARO number can be accessed using 'id' attribute and gene name can be accessed using 'name' attribute.
Expand All @@ -20,15 +21,19 @@ A list of supported databases.
#### Example

```
# Mapping the `ARR-2_1_HQ141279` gene from the `resfinder` database to the ARO
from argnorm.lib import map_to_aro
# Mapping the `ARR-2_1_HQ141279` gene from the `resfinder` database to the ARO
print(map_to_aro('ARR-2_1_HQ141279', 'resfinder'))
# Mapping the `argannot~~~(Bla)cfxA4~~~AY769933:1-966` gene in `groot` using the `groot-argannot` reference database
print(map_to_aro('argannot~~~(Bla)cfxA4~~~AY769933:1-966', 'groot', 'groot-argannot'))
```

### argnorm.lib.get_aro_mapping_table(): gets ARO mapping table for a specific database

#### Parameters
* database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg and sarg
* database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg, sarg or groot

#### Returns
* pandas.DataFrame: A pandas dataframe with ARGs mapped to AROs.
Expand Down Expand Up @@ -81,17 +86,18 @@ print(drugs_to_drug_classes(['ARO:0000030', 'ARO:0000051', 'ARO:0000069', 'ARO:3
Normalizers classes for specific tools which normalize ARG annotation outputs. Same functionality as CLI.

All normalizers have 2 parameters:
* database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg and sarg.
* database (str): name of database. Can be: argannot, deeparg, megares, ncbi, resfinderfg, sarg, groot-db, groot-core-db, groot-card, groot-argannot, and groot-resfinder.
* is_hamronized (bool, False by default): whether or not the ARG annotation output has been processed by the hamronization package.

> Note: the database parameter only needs to be specified for AbricateNormalizer. ncbi, deeparg, resfinder, sarg, megares, argannot, resfinderfg are the supported databases.
> Note: the database parameter only needs to be specified for AbricateNormalizer and GrootNormalizer. ncbi, deeparg, resfinder, sarg, megares, argannot, resfinderfg are the supported databases for AbricateNormalizer and groot-db, groot-core-db, groot-argannot, groot-resfinder, and groot-card are the supported databases for GrootNormalizer.
Available normalizers:
* argnorm.normalizers.ARGSOAPNormalizer
* argnorm.normalizers.DeepARGNormalizer
* argnorm.normalizers.ResFinderNormalizer
* argnorm.normalizers.AMRFinderPlusNormalizer
* argnorm.normalizers.AbricateNormalizer
* argnorm.normalizers.GrootNormalizer

### Methods

Expand Down
15 changes: 15 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,4 +147,19 @@ argnorm resfinder -i examples/raw/resfinder.resfinder.reads.tsv -o outputs/raw/r
argnorm amrfinderplus -i examples/raw/amrfinderplus.ncbi.orfs.tsv -o outputs/raw/amrfinderplus.ncbi.orfs.tsv

argnorm amrfinderplus -i examples/hamronized/amrfinderplus.ncbi.orfs.tsv -o outputs/hamronized/amrfinderplus.ncbi.orfs.tsv
```

### GROOT
```bash
argnorm groot -i examples/raw/groot.argannot.tsv -o outputs/raw/groot.argannot.tsv --db groot-argannot
argnorm groot -i examples/raw/groot.resfinder.tsv -o outputs/raw/groot.resfinder.tsv --db groot-resfinder
argnorm groot -i examples/raw/groot.card.tsv -o outputs/raw/groot.card.tsv --db groot-card
argnorm groot -i examples/raw/groot.groot-db.tsv -o outputs/raw/groot.groot-db.tsv --db groot-db
argnorm groot -i examples/raw/groot.groot-core-db.tsv -o ouptuts/raw/groot.groot-core-db.tsv --db groot-core-db

argnorm groot -i examples/hamronized/groot.argannot.tsv -o outputs/hamronized/groot.argannot.tsv --db groot-argannot --hamronized
argnorm groot -i examples/hamronized/groot.resfinder.tsv -o outputs/hamronized/groot.resfinder.tsv --db groot-resfinder --hamronized
argnorm groot -i examples/hamronized/groot.card.tsv -o outputs/hamronized/groot.card.tsv --db groot-card --hamronized
argnorm groot -i examples/hamronized/groot.groot-db.tsv -o outputs/hamronized/groot.groot-db.tsv --db groot-db --hamronized
argnorm groot -i examples/hamronized/groot.groot-core-db.tsv -o outputs/hamronized/groot.groot-core-db.tsv --db groot-core-db --hamronized
```
13 changes: 13 additions & 0 deletions examples/hamronized/groot.argannot.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
input_file_name gene_symbol gene_name reference_database_name reference_database_version reference_accession analysis_software_name analysis_software_version genetic_variation_type antimicrobial_agent coverage_percentage coverage_depth coverage_ratio drug_class input_gene_length input_gene_start input_gene_stop input_protein_length input_protein_start input_protein_stop input_sequence_id nucleotide_mutation nucleotide_mutation_interpretation predicted_phenotype predicted_phenotype_confidence_level amino_acid_mutation amino_acid_mutation_interpretation reference_gene_length reference_gene_start reference_gene_stop reference_protein_length reference_protein_start reference_protein_stop resistance_mechanism strand_orientation sequence_identity
groot.argannot.tsv argannot~~~(Bla)cfiA9~~~AB087234:1-750 argannot~~~(Bla)cfiA9~~~AB087234:1-750 groot v argannot~~~(Bla)cfiA9~~~AB087234:1-750 groot v gene_presence_detected 79.0 750
groot.argannot.tsv argannot~~~(Tet)Tet-40~~~AM419751:14211-15431 argannot~~~(Tet)Tet-40~~~AM419751:14211-15431 groot v argannot~~~(Tet)Tet-40~~~AM419751:14211-15431 groot v gene_presence_detected 315.0 1221
groot.argannot.tsv argannot~~~(MLS)ErmF~~~M14730:241-1041 argannot~~~(MLS)ErmF~~~M14730:241-1041 groot v argannot~~~(MLS)ErmF~~~M14730:241-1041 groot v gene_presence_detected 321.0 801
groot.argannot.tsv argannot~~~(AGly)Aph7~~~GG774704:686456-687373 argannot~~~(AGly)Aph7~~~GG774704:686456-687373 groot v argannot~~~(AGly)Aph7~~~GG774704:686456-687373 groot v gene_presence_detected 254.0 918
groot.argannot.tsv argannot~~~(Bla)cfxA2~~~AF504910:1-966 argannot~~~(Bla)cfxA2~~~AF504910:1-966 groot v argannot~~~(Bla)cfxA2~~~AF504910:1-966 groot v gene_presence_detected 338.0 966
groot.argannot.tsv argannot~~~(MLS)ErmB~~~M11180:714-1451 argannot~~~(MLS)ErmB~~~M11180:714-1451 groot v argannot~~~(MLS)ErmB~~~M11180:714-1451 groot v gene_presence_detected 178.0 738
groot.argannot.tsv argannot~~~(Tet)TetQ~~~Z21523:362-2287 argannot~~~(Tet)TetQ~~~Z21523:362-2287 groot v argannot~~~(Tet)TetQ~~~Z21523:362-2287 groot v gene_presence_detected 539.0 1974
groot.argannot.tsv argannot~~~(Bla)cfxA5~~~AY769934:28-993 argannot~~~(Bla)cfxA5~~~AY769934:28-993 groot v argannot~~~(Bla)cfxA5~~~AY769934:28-993 groot v gene_presence_detected 449.0 966
groot.argannot.tsv argannot~~~(Bla)OXA-347~~~JN086160:1583-2407 argannot~~~(Bla)OXA-347~~~JN086160:1583-2407 groot v argannot~~~(Bla)OXA-347~~~JN086160:1583-2407 groot v gene_presence_detected 191.0 825
groot.argannot.tsv argannot~~~(Tet)TetW~~~AJ222769:3687-5606 argannot~~~(Tet)TetW~~~AJ222769:3687-5606 groot v argannot~~~(Tet)TetW~~~AJ222769:3687-5606 groot v gene_presence_detected 203.0 1920
groot.argannot.tsv argannot~~~(Tet)Tet-32~~~DQ647324:181-2100 argannot~~~(Tet)Tet-32~~~DQ647324:181-2100 groot v argannot~~~(Tet)Tet-32~~~DQ647324:181-2100 groot v gene_presence_detected 148.0 1920
groot.argannot.tsv argannot~~~(Bla)cfxA4~~~AY769933:1-966 argannot~~~(Bla)cfxA4~~~AY769933:1-966 groot v argannot~~~(Bla)cfxA4~~~AY769933:1-966 groot v gene_presence_detected 450.0 966
27 changes: 27 additions & 0 deletions examples/hamronized/groot.card.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
input_file_name gene_symbol gene_name reference_database_name reference_database_version reference_accession analysis_software_name analysis_software_version genetic_variation_type antimicrobial_agent coverage_percentage coverage_depth coverage_ratio drug_class input_gene_length input_gene_start input_gene_stop input_protein_length input_protein_start input_protein_stop input_sequence_id nucleotide_mutation nucleotide_mutation_interpretation predicted_phenotype predicted_phenotype_confidence_level amino_acid_mutation amino_acid_mutation_interpretation reference_gene_length reference_gene_start reference_gene_stop reference_protein_length reference_protein_start reference_protein_stop resistance_mechanism strand_orientation sequence_identity
groot.card.tsv Mef(En2) Mef(En2).3004659.AF251288 groot v Mef(En2).3004659.AF251288.1.794-2000.5539 groot v gene_presence_detected 135.0 1206
groot.card.tsv rrsB rrsB.3003410.U00096 groot v rrsB.3003410.U00096.4166659-4168200.3242 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrsB rrsB.3003396.U00096 groot v rrsB.3003396.U00096.4166659-4168200.3233 groot v gene_presence_detected 95.0 1542
groot.card.tsv Escherichia_coli_16S Escherichia_coli_16S.3003223.U00096 groot v Escherichia_coli_16S.3003223.U00096.4166659-4168200.3234 groot v gene_presence_detected 95.0 1542
groot.card.tsv ErmB ErmB.3000375.AF242872 groot v ErmB.3000375.AF242872.1.2131-2878.5430 groot v gene_presence_detected 194.0 747
groot.card.tsv rrnB rrnB.3003411.U00096 groot v rrnB.3003411.U00096.4166659-4168200.3236 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrsB rrsB.3003402.U00096 groot v rrsB.3003402.U00096.4166659-4168200.3235 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrnB rrnB.3003406.U00096 groot v rrnB.3003406.U00096.4166659-4168200.3237 groot v gene_presence_detected 95.0 1542
groot.card.tsv tet(40) tet(40).3000567.AM419751 groot v tet(40).3000567.AM419751.14210-15431.5150 groot v gene_presence_detected 315.0 1221
groot.card.tsv aadS aadS.3004683.M72415 groot v aadS.3004683.M72415.1.1120-1984.5568 groot v gene_presence_detected 199.0 864
groot.card.tsv CfxA4 CfxA4.3003005.AY769933 groot v CfxA4.3003005.AY769933.0-966.1592 groot v gene_presence_detected 450.0 966
groot.card.tsv rrnB rrnB.3003377.U00096 groot v rrnB.3003377.U00096.4166659-4168200.3239 groot v gene_presence_detected 95.0 1542
groot.card.tsv OXA-347 OXA-347.3001777.JN086160 groot v OXA-347.3001777.JN086160.1582-2407.4583 groot v gene_presence_detected 191.0 825
groot.card.tsv rrsB rrsB.3003376.U00096 groot v rrsB.3003376.U00096.4166659-4168200.3240 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrsB rrsB.3003408.U00096 groot v rrsB.3003408.U00096.4166659-4168200.3241 groot v gene_presence_detected 95.0 1542
groot.card.tsv tetQ tetQ.3000191.Z21523 groot v tetQ.3000191.Z21523.0-1974.476 groot v gene_presence_detected 539.0 1974
groot.card.tsv rrsB rrsB.3003399.U00096 groot v rrsB.3003399.U00096.4166659-4168200.3232 groot v gene_presence_detected 95.0 1542
groot.card.tsv tetW tetW.3000194.AJ222769 groot v tetW.3000194.AJ222769.3.3686-5606.5145 groot v gene_presence_detected 203.0 1920
groot.card.tsv CfxA5 CfxA5.3003096.AY769934 groot v CfxA5.3003096.AY769934.27-993.1669 groot v gene_presence_detected 449.0 966
groot.card.tsv CfxA3 CfxA3.3003003.AF472622 groot v CfxA3.3003003.AF472622.52-1018.1514 groot v gene_presence_detected 519.0 966
groot.card.tsv rrsB rrsB.3003405.U00096 groot v rrsB.3003405.U00096.4166659-4168200.3231 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrsH rrsH.3003372.U00096 groot v rrsH.3003372.U00096.223771-225312.3228 groot v gene_presence_detected 95.0 1542
groot.card.tsv rrsB rrsB.3003397.U00096 groot v rrsB.3003397.U00096.4166659-4168200.3230 groot v gene_presence_detected 95.0 1542
groot.card.tsv ErmF ErmF.3000498.M17124 groot v ErmF.3000498.M17124.1181-1982.593 groot v gene_presence_detected 321.0 801
groot.card.tsv rrsB rrsB.3003403.U00096 groot v rrsB.3003403.U00096.4166659-4168200.3238 groot v gene_presence_detected 95.0 1542
groot.card.tsv CfxA2 CfxA2.3003002.AF118110 groot v CfxA2.3003002.AF118110.1.71-1037.4470 groot v gene_presence_detected 450.0 966
Loading

0 comments on commit 2d62163

Please sign in to comment.