Skip to content

Tutorial

Nadia Tahiri edited this page Aug 6, 2024 · 13 revisions

Description

Cumacea (Crustacea: Peracarida) from the deep North Atlantic to the Arctic Ocean

Description

The study area was located in a northern region of the North Atlantic, including the Icelandic Sea, the Denmark Strait, and the Norwegian Sea. The specimens examined were collected as part of the IceAGE project (Icelandic marine Animals: Genetic and Ecology; Cruise ship M85/3 in 2011), which studied the deep continental slopes and abyssal waters around Iceland Meißner et al., 2018. The sampling period for the included specimens was from August 30 to September 22, 2011, and they were collected at depths ranging from 316 to 2568 m. Information on the sampling plan, sample processing, DNA extraction steps, PCR amplification, sequencing, and extracted and aligned DNA sequences is available in the Uhlir et al., 2021 article. Refer to the example file and csv file in the datasets directory for guidance.


Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

In a previous study (Koshkarov et al., 2022), a total of 38 distinct genetic lineages were identified across the species' range. For this specific analysis, we focused on 5 of these lineages, selected for their pronounced regional characteristics and relevance to our research. Based on location information, complete nucleotide sequencing data for these 5 lineages was collected from the NCBI Virus website. In the case of the availability of multiple sequencing results for the same lineage in the same country, we selected the sequence whose collection date was closest to the earliest date presented. If there are several sequencing results for the same country on the same date, the sequence with the least number of ambiguous characters (N pernucleotide) is selected.

Although the selection of samples was based on the phylogenetic cluster of lineage and transmission, most of the sites involved represent different meteorological conditions. The 5 samples involved temperatures ranging from -4 C to 32.6 C, with an average temperature of 15.3 C. The Specific humidity ranged from 2.9 g/kg to 19.2 g/kg with an average of 8.3 g/kg. The variability of Wind speed and All sky surface shortwave downward irradiance was relatively small across samples compared to other parameters. The Wind speed ranged from 0.7 m/s to 9.3 m/s with an average of 4.0 m/s, and All sky surface shortwave downward irradiance ranged from 0.8 kW-hr/m2/day to 8.6 kW-hr/m2/day with an average of 4.5 kW-hr/m2/day. In contrast to the other parameters, 75% of the cities involved receive less than 2.2 mm of precipitation per day, and only 5 cities have more than 5 mm of precipitation per day. The minimum precipitation is 0 mm/day, the maximum precipitation is 12 mm/day, and the average value is 2.1 mm/day. Refer to the fasta file and csv file in the datasets directory for guidance.


Input

The algorithm takes two files as input with the following definitions:

  • 🧬 Genetic file with FASTA extension. The first file or set of files will contain the genetic sequence information of the species sets selected for the study. The name of the file must allow to know the name of the gene. It is therefore strongly recommended to follow the following nomenclature gene_name.fasta. It should contain genetic variants (e.g., SNPs) and their associated metadata (e.g., sample IDs, location information).
  • Climatic file with csv extension (Comma-Separated Values). The second file will contain the habitat information for the species sets selected for the study. Each row will represent the species identifier and each column will represent a climate condition. It should include relevant climatic variables (e.g., temperature, precipitation) for each geographic location represented in your genetic data and must be clearly labeled to match the expected format.

Preparing Your Data

Include relevant climatic variables (e.g., temperature, precipitation) for each geographic location represented in your genetic data. Column Headers: Must be clearly labeled to match the expected format. Refer to the example files in the datasets directory for guidance.

Output

The algorithm will return a csv file that contains information from all relevant MSAs (see Workflow Section for more details). The sliding windows of interest are those with interesting bootstrap support (i.e., indicating the robustness of the tree) and high similarity to the climate condition in question (i.e., based on the RF, RFnorm, LS, and Euclidean values). They will indicate, among other things, the name of the gene, the position of the beginning and end of the sliding window, the average bootstrap value, the LS value and finally the climatic condition for which this genetic zone would explain the adaptation of the species in a given environment. To sum up, aPhyloGeo generates an output.csv file containing analysis results. Additional visualizations (e.g., maps, plots) may be generated based on your configuration.

Prerequisites

System Requirements

  • Operating System: Windows, macOS, or Linux.
  • Python: Python 3.8 or higher.

Key Features of aPhyloGeo

  • Multi-Platform: Works seamlessly on Windows, macOS, and Linux.

  • Flexible Analysis: Supports various phylogeographic analyses, including:

    • Identifying genetic lineages and their geographic origins

    • Assessing the impact of climate on genetic diversity

    • Visualizing genetic and geographic relationships

  • Customizable: Tailor analyses using a configuration file to fit your specific research questions.

  • Open Source: Freely available and encourages contributions from the research community.

Before you begin, ensure you have the following installed:

pip install pandas aphylogeo

Step-by-Step Guide

Import Necessary Modules:

Start by importing the required modules:

import pandas as pd
import time
from aphylogeo.alignement import AlignSequences
from aphylogeo.params import Params
from aphylogeo import utils
from aphylogeo.genetic_trees import GeneticTrees

Load Parameters and Sequence File

Load parameters from a configuration file and load the sequence file:

Params.load_from_file()
sequenceFile = utils.loadSequenceFile(Params.reference_gene_filepath)
align_sequence = AlignSequences(sequenceFile)

Load Climatic Data

Load the climatic data using pandas:

climatic_data = pd.read_csv(Params.file_name)

Align Sequences

Align the sequences and measure the time taken for the alignment:

print("\nStarting alignment")
start_time = time.time()
alignements = align_sequence.align()
end_time = time.time()
elapsed_time = round(end_time - start_time, 3)
print(f"Elapsed time: {elapsed_time} seconds")

Generate Genetic Trees

Generate genetic trees from the aligned sequences:

geneticTrees = utils.geneticPipeline(alignements.msa)
trees = GeneticTrees(trees_dict=geneticTrees, format="newick")

Generate Climatic Trees and Filter Results

Generate climatic trees and filter the results:

climaticTrees = utils.climaticPipeline(climatic_data)
utils.filterResults(climaticTrees, geneticTrees, climatic_data)

Save Results

Save the aligned sequences and genetic trees to JSON files:

alignements.save_to_json(f"./results/aligned_{Params.reference_gene_file}.json")
trees.save_trees_to_json("./results/geneticTrees.json")

Configuration

Open the params.yaml file and adjust parameters as needed. This includes specifying the paths to your input files, choosing analysis methods, and setting visualization preferences.

Understanding geneticTrees.json

This file contains a collection of phylogenetic trees in the standard Newick format, representing hypothesized evolutionary relationships between organisms or genetic sequences. Each tree is inferred from genetic data within a specific window, with the window size and step size influencing the relationships depicted.

Key Elements of Newick Format

  • Nodes: Represent taxonomic units (species, lineages, individuals).
  • Branches: Lines connecting nodes, showing the evolutionary path between them.
  • Branch Lengths: Numbers on branches indicating the amount of genetic change or evolutionary time between nodes.
  • Parentheses: Groupings of nodes and branches to define relationships.
  • Bootstrap Values (Optional): Numbers after a colon next to a node, indicating the statistical support for that grouping (from 0 to 100).

Example Interpretation

Let's examine the tree labeled '0_199' which shows the evolutionary relationships inferred from the MSA segment starting at position 0 and ending at position 199.

(ON129429:0.00000,(OL989074:0.13065,OU471040:0.00000)100.00:0.06799,(ON134852:0.08040,OM739053:0.00000)88.00:0.00000):0.00000;

Figure 1. Visualization of phylogenetic tree in Newick format, generated by Trex

This tree suggests:

The tree labeled '0_199' reveals the following:

  • Two Main Groups: The analyzed sequences cluster into two distinct groups:

    • ON129429: This sequence is an outlier, diverging early in the tree and suggesting a more distant evolutionary relationship to the other sequences.
    • Two Clades: The remaining sequences form two well-supported clades (subgroups):
      • Clade 1: OL989074 and OU471040
      • Clade 2: ON134852 and OM739053
  • Closer Relationships Within Clades: The sequences within each clade share a closer evolutionary relationship than with sequences in the other clade.

  • Strong Statistical Support: The high bootstrap values of 100.00 and 88.00 indicate strong statistical confidence in the inferred relationships within the two clades, respectively.

Why Multiple Trees?

The file contains multiple phylogenetic trees due to the window size and step size parameters in the YAML file used to analyze the MSA. This approach allows for the exploration of different evolutionary scenarios and accounts for the inherent uncertainty in phylogenetic inference. Each tree represents a plausible hypothesis supported by the data, highlighting the range of possible evolutionary relationships between the sequences.

Tools for Visualization

Visualize these trees using:

Important Considerations

The presence of multiple phylogenetic trees in this file is a result of the sliding window approach and varying parameters employed during the analysis of the multiple sequence alignment (MSA). This methodological approach acknowledges the inherent uncertainty in reconstructing evolutionary histories from finite sequence data. Each tree represents a plausible hypothesis supported by the data within a specific window, and the collection of trees as a whole illustrates the range of potential evolutionary scenarios. Analyzing the consistent patterns and conflicting hypotheses across all trees will provide a more comprehensive understanding of the possible evolutionary relationships among the sequences.

Further Analysis (Optional)

  • Consensus Trees: Calculate a consensus tree that summarizes the most common patterns across all trees.
  • Identify Clades: Focus on specific groups (clades) of interest for further investigation.