README

#######################################
# HOWARD                              #
# Highly Open and Valuable tool for   #
#    Variant Annotation & Ranking     #
#    for Discovery                    #
# Release : 0.9.15.5                  #
# Date : 20210721                     #
# Author: Antony Le Bechec            #
# Copyright: HUS                      #
# Licence: GNU AGPL V3                #
#######################################

###############
# DESCRIPTION #
###############

HOWARD annotates and prioritizes genetic variations, calculates and normalizes annotations, translates vcf format and generates variants statistics.
HOWARD annotation is mainly based on ANNOVAR and snpEff tools to annotate, using available databases (see ANNOVAR and snpEff) and home made databases. It also uses BCFTOOLS to annotate variants with a VCF file. ANNOVAR and snpEff databases are automatically downloaded if needed.
HOWARD calculation harmonizes allele frequency (VAF), extracts Nomen (transcript, cNomen, pNomen...) from HGVS fields with an optional list of personalized transcripts, generates VaRank format barcode.
HOWARD prioritization algorithm uses profiles to flag variants (as passed or filtered), calculate a prioritization score, and automatically generate a comment for each variants (example: 'polymorphism identified in dbSNP. associated to Lung Cancer. Found in ClinVar database').Prioritization profiles are defined in a configuration file. A profile is defined as a list of annotation/value, using wildcards and comparison options (contains, lower than, greater than, equal...). Annotations fields may be quality values (usually from callers, such as 'GQ', 'DP') or other annotations fields provided by annotations tools, such as HOWARD itself (example: COSMIC, Clinvar, 1000genomes, PolyPhen, SIFT). Multiple profiles can be used simultaneously, which is useful to define multiple validation/prioritization levels (example: 'standard', 'stringent', 'rare variants', 'low allele frequency').
HOWARD translates VCF format into TSV format, by sorting variants using specific fields (example : 'prioritization score', 'allele frequency', 'gene symbol'), including/excluding annotations/fields, including/excluding variants, adding fixed columns.
HOWARD generates statistics files with a specific algorithm, snpEff and BCFTOOLS.
HOWARD is multithreaded through the number of variants and by database (data-scaling).

################
# REQUIREMENTS #
################

- CentOS7 or higher
- YUM Packages "bc make wget perl-Switch perl-Digest-MD5 perl-Data-Dumper which zlib zlib2 bzip2 lzma xz"
- Java [1.8]
- bcftools/htslib [1.12]
- ANNOVAR [2019Oct24]
- snpEff [5.0e]

################
# INSTALLATION #
################

# Tools configuration
#######################
# Use env.sh and config.ini to configure tools used by HOWARD

# Annotation configuration
############################
# Use config.annotation.ini to configure annotations

# Prioritization configuration
################################
# Use config.prioritization.ini to configure filter for prioritization

# Download databases
######################
# Databases are downloaded automatically by using annotation configuratin file, or options in command line (--annovar_databases, --snpeff_databases, assembly...)
# Use a vcf file to download ANNOVAR databases (WITHOUT multithreading, "ALL" for all databases, "CORE" for core databases, "snpeff" for snpEff database, or a list of databases):
# ./HOWARD --input=docs/example.vcf --output=docs/example.annotated.vcf --annotation=ALL,snpeff [--annovar_databases=</path/to/annovar_databases> --snpeff_databases=</path/to/snpeff_databases> --assembly=<assembly> --thread=1 --verbose]
# Use this command multiple times for all needed assembly (such as hg19, hg38, mm9)
# For home made databases, refer to config.annotation.ini file to construct and configure your database
# Beware of proxy configuration!

#######################################
# HOWARD [0.9.15.5-18/07/2021]
# HOWARD Annotation, Calculation, Prioritization and Translation, based on ANNOVAR and snpEff, allowing multithreading 
# Antony Le Bechec @ HUS © GNU AGPL V3
#######################################
# RELEASE NOTES:
# 0.9b-07/10/2016:
#	Script creation
# 0.9.1b-11/10/2016:
#	Add Prioritization and Translation
# 0.9.1b-11/10/2016:
#	Add snpEff annotation and stats
# 0.9.8b-21/03/2017:
#	Add Multithreading on Prioritization and Translation
# 0.9.9b-18/04/2017:
#	Add Calculation step
# 0.9.10b-07/11/2017:
#	Add generic file annotation through --annotation option
#		No need to be in configuration file
#		Need to be in ANNOVAR database folder (file 'ASSEMBLY_ANN.txt' for annotation 'ANN')
#	Add options: --force , --split
#	Add options for VCFanotation.pl: --show_annoataion, --show_annotations_full
#	Add database download option nowget in VCFanotation.pl
#	Fixes: multithreading, VAF calculation, configuration and check dependencies
# 0.9.11b-07/05/2018:
#	Replace VCFTOOLS command to BCFTOOLS command
#	Release added into the output VCF
#	Update SNPEff options
#	Add VARTYPE, CALLING_QUALITY and CALLING_QUALITY_EXPLODE option on calculation
#	Add description on calculations
# 0.9.11.1b-14/05/2018:
#	Improve VCF validation
#	Fix snpEff annotation bug
# 0.9.11.2b-17/08/2018:
#	Add --vcf input vcf file option
#	Create Output file directory automatically
#	Improve Multithreading
# 0.9.12b-24/08/2018:
#	Improve Multithreading
#	Input VCF compressed with BGZIP accepted
#	Output VCF compression level
#	Add VCF input sorting and multiallele split step (by default)
#	Add VCF input normalization step with option --norm
#	Bug fixes
# 0.9.13b-04/10/2018:
#	Multithreading improved
#	Change default output vcf
#	Input vcf without samples allowed
#	VCF Validation with contig check
#	Add multi VCF in input option
#	Add --annotate option for BCFTOOLS annotation with a VCF and TAG (beta)
#	Remove no multithreading part code to multithreading with 1 thread
#	Remove --multithreading parameter, only --thread parameter to deal with multithreading
#	Replace --filter and --format parameters by --prioritization and --translation parameters
#	Add snpeff options to VCFannotation.pl
# 0.9.14b-21/01/2019:
#	Reorganization of folders (bin, config, docs, toolbox...).
#	Improve Translation (TSV or VCF, sort on fields, selection of fields, filtering on fields), especially memory efficiency
#	Change Number/Type/Description of new INFO/FORMAT header generated
#	Remove snpEff option --snpeff and --snpeff_hgvs. SnpEff is used through --annotation option
#	Add '#' to the TAB/TSV delimiter format header
#	Update dbNSFP config annotation file script
#	Change default configuration files for annotation (add dbSNFP 3.5a, update mcap and regspintron) and prioritization
#	Bug fixed: file identification in annotation configuration
#	Bug fixed: calculation INFO fields header, snpeff parameters options on multithreading
#	Bug fixed: snpeff parameters in command line
# 0.9.15b-19/09/2019:
#	Rename HOWARD.sh to HOWARD.
#	Add --nomen_fields parameter and update NOMEN calculation.
#	Add --bcftools_stats and --stats parameter.
#	Change PZScore, PZFlag, PZComment and PZInfos generation, adding default PZ and all PZ filters.
#	Bug fixed: translation fields identification.
# 0.9.15.1b-28/05/2020:
#	Bug fixed: NOMEN calculation clear previous NOMEN values if using force option.
# 0.9.15.2-12/10/2020:
#	Change --norm parameter by adding '--check-ref=s' in bcftools command.
#	Add --norm_options parameter.
#	force translation VCF by default.
#	Change VAF_stats and add DP_stats.
# 0.9.15.4-11/04/2021:
#	Remove --snpeff_threads parameter (for snpEff 5.0e compatibility) and improve --snpeff_stats.
#	Add a check and rehead INFO fields if necessary (prevent some incorrect INFO header format).
#	Fix --compress parameter and add --index parameter.
# 0.9.15.5-21/07/2021:
#	Add INFO description Type option, with autodetection.
#	Add prioritization mode 'VaRank'/'max' for score calculation.
#	Add calculation DP, AD, GQ and associated stats.
#	Fix INFO field type for VCF.

#######################################
# HOWARD [0.9.15.5-18/07/2021]
# HOWARD Annotation, Calculation, Prioritization and Translation, based on ANNOVAR and snpEff, allowing multithreading 
# Antony Le Bechec @ HUS © GNU AGPL V3
#######################################
# USAGE: HOWARD --input=<FILE>  [options...]
#
 
### Input/Output
# --input=<FILES>                       Input files in VCF format
#                                       Format *vcf, *vcf.gz or *bcf in BGZIP compression format
#                                       Input Files will be merged (all samples in the output)
# --output=<FILE>                       Output annotated file in defined format (see --format option).
#                                       Default input file with extension *howard.vcf
#
 
### Main Options
# --annotate=<LIST>                     Annotation with BCFTOOLS
#                                       List of VCF files with TAG to annotate
#                                       Format 'VCF:TAG;VCF:TAG...'
#                                       Example: 'annotate1.vcf:ID,QUAL,+INFO;annotate2.vcf:INFO/ANN;annotate3.vcf'.
#                                       Default TAG '+INFO'
# --annotation=<LIST>                   List of annotation (in the order to add into the input VCF, case sensitive)
#                                       Annotations sources, defined in the file 'config_annotation'
#                                       Example: 'symbol', 'symbol,hgvs', 'snpeff', 'snpeff_split'
#                                       Specific snpEff options (if snpeff_jar and snpeff_databases defined):
#                                          'snpeff' to annotate 'ANN' field
#                                          'snpeff_split' to annotate 'ANN' field and split snpEff annotation into:
#                                             'snpeff_hgvs' to annotate 'ANN' field and annotate 'snpeff_hgvs' field and add into 'hgvs'
#                                             'snpeff_gene_name' to annotate 'ANN' field and annotate 'snpeff_gene_name' field and add into 'symbol' if empty
#                                             'snpeff_annotation' to annotate 'ANN' field and annotate 'snpeff_annotation' field and add into 'location' and 'outcome' if empty
#                                             'snpeff_impact' to annotate 'ANN' field and annotate 'snpeff_impact' field
# --calculation=<LIST>                  List of calculation
#                                       Example: 'VAF', 'VAF,NOMEN'
#                                       Currently available:
#                                          'VAF' add VAF calculation for each sample, depending on information provided by callers (by order: FREQ, DP4, AD)
#                                          'VAF_STATS' add VAF statistics on INFO field, from VAF calculation for each sample, depending on information provided by callers (by order: VAF, FREQ, DP4, AD)
#                                          'DP_STATS' add DP statistics on INFO field, from DP calculation for each sample, depending on information provided by callers (by order: DP, DP4, AD)
#                                          'CALLING_QUALITY' Calling quality (FORMAT/*) of all samples in case of multiSample VCF, or all pipelines in case of multipipeline VCF
#                                          'CALLING_QUALITY_EXPLODE' Explode all Calling quality (FORMAT/*) in multiple fields in INFOS
#                                          'BARCODE' Calculate VaRank BarCode
#                                          'GENOTYPECONCORDANCE' If all samples with the same genotype
#                                          'FINDBYPIPELINES' Number of pipeline calling the variant
#                                          'VARTYPE' SNV if X>Y, MOSAIC if X>Y,Z or X,Y>Z, INDEL if XY>Z or X>YZ
#                                          'NOMEN' Find the NOMEN from HGVS annotation. Depend on transcript of reference.
#                                             If no transcript of reference, first transcript.
#                                             This option create annotations on INFO field:
#                                                NOMEN (full HGVS annotation)
#                                                CNOMEN (DNA level mutation 'c.', 'g.', 'm.')
#                                                RNOMEN (DNA level mutation 'r.', 'n.')
#                                                PNOMEN (Protein level mutation 'p.')
#                                                TVNOMEN (transcript with version)
#                                                TNOMEN (transcript)
#                                                VNOMEN (transcript version)
#                                                ENOMEN (exon, if any)
#                                                GNOMEN (gene, if any)
# --prioritization=<LIST OF FILE>       List of prioritization profiles (see prioritization config file)
# --translation=<STRING>                Output format, either TSV or VCF
#                                       Default: 'VCF'
#
 
### Annotation Options
# --assembly=<STRING>                   Genome assembly
#                                       Default 'hg19', 'default' to use the default assembly in the configuration file
# --snpeff_spliceSiteSize=<INTEGER>     Set size for splice sites (donor and acceptor) in bases
#                                       Default 3
# --snpeff_additional_options=<STRING>  Additional options for snpEff 
#                                       Format 'param1:value1|param2:value2'
#                                       Default ''
#
 
### Calculation Options
# --transcripts=<FILE>                  File containing default transcript for each gene
#                                       Format (second field optional):
#                                          TRANSCRIPT1	GENE1
#                                          TRANSCRIPT2	GENE2
#                                          TRANSCRIPT3	GENE2
#                                          ...
# --nomen_fields=<STRING>               List of names of annotation field to use to calculate/extract NOMEN annotation
#                                       Field order determine the priority
#                                       Format: field1,field2,...
#                                       Examples: 'hgvs', 'snpeff_hgvs', 'snpeff_hgvs,hgvs'
#                                       Default: 'hgvs'
# --trio=<STRING>                       List of sample to identify a trio
#                                       Will automatically calculate VaRank barcode, and add INFO/trio_variant_type
#                                       Format: 'father_sample_name,mother_sample_name,child_sample_name'
#                                       Either 'denovo', 'dominant', 'recessive':
#                                          denovo: '001'
#                                          dominant: '011','101','111','021','201','121','211'
#                                          recessive: '112','212','122','222'
#
 
### Prioritization Options
# --hard                                Remove not PASS variant in default priorisation filter (PZFlag)
# --pzfields=<STRING>                   List of prioritisation information to show
#                                       Default: 'PZScore,PZFlag'
#                                       Format: 'field1,field2...'
#                                       Example: 'PZScore,PZFlag,PZComment,PZInfos'
# --prioritization_score_mode=<STRING>  Prioritization mode
#                                       Default: 'HOWARD'
#                                       Either: 'HOWARD', 'VaRank
#                                          HOWARD': sum scores from prioritization definition
#                                          VaRank': choose max value score from prioritization definition
#
 
### Translation Options
# --fields=<STRING>                     List of annotations from INFO VCF field to include in the output file.
#                                       Use 'INFO' to insert a uniq field INFO in TSV format
#                                       Use 'ALL' to insert all other annotations in the list
#                                       Annotations considered only if present in the file
#                                       Default 'ALL'
#                                       Example: 'PZScore,PZFlag,PZComment,Symbol,hgvs,location,outcome,ALL'
# --sort=<STRING>                       Sort variants by a field and order (default '')
#                                       Format: 'field1:type:order,field2:type:order'
#                                          field considered only if present in the file
#                                          type 'n' for numeric, '?' for autodetect with VCF header. Default ''.
#                                          order either 'DESC' or 'ASC'. Default 'ASC'
#                                       Example: 'PZFlag::DESC,PZScore:n:DESC' (to sort and order by relevance)
# --sort_by=<STRING>                    Sort variants by a field (if no 'sort' option). Default ''
#                                       Example: 'PZFlag,PZScore' (to sort by relevance)
# --order_by=<STRING>                   Order variants by a field (if no 'sort' option). Default ''.
#                                       Example: 'DESC,DESC' (useful to sort by relevance)
# --bcftools_expression=<STRING>        bcftools include expression to filter variants (default '').
#                                       Example: "PZFlag='PASS'" (useful to hard filter), "QUAL>90", or "PZFlag='PASS'&&QUAL>90" to combine 
# --columns=<STRING>                    Additional columns with values
#                                       All additional columns will be added with the associated values
#                                       Available only for 'tsv' format translation
#                                       Format: 'column1:value1,column2:value2,...'
#                                       Example: 'run:runA,sample:sample1'
#
 
### Configuration
# --env=<FILE>                          Environment configuration for multithreading (BGZIP, TABIX, BCFTOOLS)
# --config=<FILE>                       Configuration file for main parameters
#                                       Default 'config.ini'
# --config_annotation=<FILE>            Configuration file for annotation parameters
#                                       Default 'config.annotation.ini'
# --config_prioritization=<FILE>        Configuration file for filter/prioritization parameters
#                                       Default 'config.prioritization.ini'
# --annovar_folder=<FOLDER>             Folder with ANNOVAR scripts
# --annovar_databases=<FOLDER>          Folder with ANNOVAR databases
# --snpeff_jar=<FILE>                   snpEff JAR file
# --snpeff_databases=<FOLDER>           Folder with snpEff databases
# --java=<FILE>                         java binary (needed if snpeff option on)
#                                       Default 'java'
# --java_flags=<STRING>                 java flags (needed if snpeff option on)
#                                       especially for configure proxy for databases donwload
#                                       Default ''
 
### Statistics
# --stats=<FILE>                        Generates HOWARD stats.
# --snpeff_stats=<FILE>                 Generates snpEff stats.
# --bcftools_stats=<FILE>               Generates BCFTOOLS stats.
#
 
### Other Options
# --force                               Force annotation, calculation, prioritization
#                                       Replace values even if already exist in VCF header
# --threads=<INTEGER>                   Threads: number of thread to use (default defined in environment variable THREADS, or 1)
# --split=<INTEGER>                     Split by group of variants (default (10000)
# --compress=<INTEGER>                  Compression level output file (0 to 9, -1 no compression by default)
# --index                               Indexation of output file with name <output>.idx (only for vcf and compress output)
# --norm=<FILE>                         Genome fasta file to normalize (beware of chromosome identification, either 'x' or 'chrx')
# --norm_options=<STRING>               Normalisation options (see bcftools options)
#                                       Format: '--option1=value1,--option2=value2,...' 
#                                       Default: '--multiallelics=-any,--rm-dup=exact' 
# --tmp=<FOLDER>                        Temporary folder (default /tmp)
# --verbose                             VERBOSE option
# --debug                               DEBUG option
# --release                             RELEASE option
# --help                                HELP option
#