Skip to content

User Preference

Nadia Tahiri, PhD edited this page Nov 14, 2023 · 8 revisions

aPhyloGeo Configuration

The aPhyloGeo software can be encapsulated in other applications and applied to other data by providing a YAML file. This file will include a set of parameters for easy handling.

Configuration File Example

file_name: './datasets/example/geo.csv'
specimen: 'id'
names: ['id', 'ALLSKY_SFC_SW_DWN', 'T2M', 'PRECTOTCORR', 'QV2M', 'WS10M']
bootstrap_threshold: 0
dist_threshold: 60
window_size: 200
step_size: 100
bootstrap_amount: 100
data_names: ['ALLSKY_SFC_SW_DWN_newick', 'T2M_newick', 'QV2M_newick', 'PRECTOTCORR_newick', 'WS10M_newick']
reference_gene_dir: './datasets/example'
reference_gene_file: 'sequences.fasta'
makeDebugFiles: True
alignment_method: '1' # 1:pairwiseAligner, 2:MUSCLE, 3:CLUSTALW, 4:MAFFT
distance_method: '1' # 1: Least-Square distance, 2: Robinson-Foulds distance, 3: Euclidean distance (DendroPY)
fit_method: '1' # 1:Wider Fit by elongating with Gap (starAlignment), 2:Narrow-fit prevent elongation with gap when possible
tree_type: '1' # 1: BioPython consensus tree, 2: FastTree application
rate_similarity: 90
method_similarity: '1' # 1: Hamming distance, 2: Levenshtein distance, 3: Damerau-Levenshtein distance, 4: Jaro similarity, 5: Jaro-Winkler similarity, 6: Smith–Waterman similarity, 7: Jaccard similarity, 8: Sørensen-Dice similarity

There are 11 main options accessible to the user in the YAML configuration file:

  1. Bootstrap Threshold: Number of replicates threshold to be generated for each sub-MSA (each position of the sliding window)
  2. Distance Threshold: Distance threshold between genetic tree and climatic tree for each sub-MSA (each position of the sliding window)
  3. Window Length: Size of the sliding window
  4. Step: Sliding window advancement step
  5. Distance Choice: Distance selection
    • '0' for all distances (options '1', '2', and '3')
    • '1' for Least Square (LS) distance (version 1.0)
    • '2' for Robinson and Foulds (RF) distance (+ normalization $2n-6$ with $n$ is the number of leaves on each tree)
    • '3' for Euclidean distance
  6. Distance Threshold: LS distance threshold at which the results are most significant
  7. Alignment Method: Algorithm selection for sequence alignment
    • '1' for pairwiseAligner
    • '2' for MUSCLE
    • '3' for CLUSTALW
    • '4' for MAFFT
  8. Fit Method: Gap selection elongation
    • '1' for Wider Fit by elongating with Gap (starAlignment)
    • '2' for Narrow-fit prevent elongation with gap when possible
  9. Tree Inference Method: The choice of inference methods
    • '1' for BioPython consensus tree
    • '2' for FastTree application
  10. Rate Similarity: The rate similarity between sequences to reduce and remove the sub-MSA with a high value of similarity
  11. Method Similarity: The choice of similarity methods
    • '1' for Hamming distance
    • '2' for Levenshtein distance
    • '3' for Damerau-Levenshtein distance
    • '4' for Jaro similarity
    • '5' for Jaro-Winkler similarity
    • '6' for Smith–Waterman similarity
    • '7' for Jaccard similarity
    • '8' for Sørensen-Dice similarity