diff --git a/Pipelines.md b/Pipelines.md index 6d4eaeb..b709ce1 100644 --- a/Pipelines.md +++ b/Pipelines.md @@ -6,6 +6,7 @@ Here three possible usage pipelines of MetaFast toolkit are presented. Each pipe * [Pipeline 1. Metagenomic distance estimation](#pipeline-1-metagenomic-distance-estimation) * [Pipeline 2. Unique metagenomic features finder](#pipeline-2-unique-metagenomic-features-finder) * [Pipeline 3. Specific metagenomic features counter](#pipeline-3-specific-metagenomic-features-counter) +* [Pipeline 4. Colored metagenomic features finder](#pipeline-4-colored-metagenomic-features-finder) * [Format conversion tools](#format-conversion-tools) ## Pipeline 1. Metagenomic distance estimation @@ -170,6 +171,49 @@ This tool is designed to be used with 3 categories of metagenomes and provide th java -jar metafast.jar -t kmers-multiple-filters -k -i -cd -uc -nonibd ` +## Pipeline 4. Colored metagenomic features finder + +Pipeline for extracting group-specific features from metagenomic samples and manipulating with them. Feature construction is based on k-mers occurencies in samples from different categories, represented as a colored nodes in de Bruijn graph (see figure below). + +![Colored graph](img/pipe4_colors.svg) + +The data analysis pipeline is symmetric to [unique features finder](#pipeline-2-unique-metagenomic-features-finder) with new steps for k-mers filtering and features extraction. Step-by-step data processing is presented on the image below. + +![Pipeline 4](img/pipe4.svg) + +Order of tools to run: + +1. **K-mers counter** +Extract k-mers from each metagenomic sample and saves in internal binary format for further processing (`workDir/kmers/*.kmers.bin`). This step can be performed separately for metagenomes with known and unknown categories. For the convenience of further explanations we will refer to samples with known categories as _group\_1.kmers.bin_ ... _group\_N.kmers.bin_ for N categories and _ungroupped.kmers.bin_ for samples with unknown category. +` +java -jar metafast.jar -t kmer-counter-many -k -i +` +2. **K-mers coloring (group frequencies counter)** +Count the occurence frequencies of each k-mer in each category of samples and saves in internal binary format for further processing (`workDir/colored_kmers/colored_kmers.kmers.bin`). Mandatory parameter `--class` requires a text file in tab-separated format with two columns: sample_name [string] and class [0|1|2]. If `val` vlag is SET count k-mer occurrence as total coverage in samples, otherwise as number of samples. +` +java -jar metafast.jar -t kmers-color -k -kf --class [-val] +` + +1. **Colored component extractor** +Extract graph components from tangled graph based on k-mers coloring. These subgraph components can be used as features specific for analyzed category (`workDir/colored-components/components_color_[0|1|2].bin`) +**_Parameters:_**\ +`--n_groups ` – number of classes (default: 3)\ +`--separate` – use only color-specific k-mers in components (does not work in linear mode)\ +`--linear` – choose best path on fork to create linear components\ +`--n_comps ` – select not more than X components for each class (default: -1, means all components)\ +`--perc` – relative abundance of k-mer in group to become color-specific (default: 0.9)\ +` +java -jar metafast.jar -t component-colored -k -i +` + +4. **Features calculator** +Counts coverage of each component (subgraph) by k-mers for each metagenomic sample independently. For each sample outputs numerical features vector of coverages (`workDir/vectors/*.vec`). Features vectors for samples with known categories can be further used to train machine learning model to predict categories for samples with unknown categories. +` +java -jar metafast.jar -t features-calculator -k -cm -ka <*.kmers.bin> +` + + + ## Format conversion tools #### Binary to Fasta convertor diff --git a/README.md b/README.md index 88fddee..a55e5d4 100644 --- a/README.md +++ b/README.md @@ -54,9 +54,9 @@ ant ~~~ -## MetaFast 1.5 +## MetaFast 1.3 -A new version of MetaFast software is being prepared for the release. New pipelines for comparative metagenomics data analysis have been implemented. Three recommended use cases (including the original one) and a detailed description of available tools are presented in [Pipelines.md](Pipelines.md) +A new version of MetaFast software is being prepared for the release. New pipelines for comparative metagenomics data analysis have been implemented. Four recommended use cases (including the original one) and a detailed description of available tools are presented in [Pipelines.md](Pipelines.md) ## Running instructions diff --git a/build.xml b/build.xml index ef28ce8..c33c521 100644 --- a/build.xml +++ b/build.xml @@ -1,5 +1,5 @@ - + diff --git a/img/pipe4.svg b/img/pipe4.svg new file mode 100644 index 0000000..439f9e4 --- /dev/null +++ b/img/pipe4.svg @@ -0,0 +1,3 @@ + + +
(1) k-mer counter
(1) k-mer counter
(2) k-mers coloring
(2) k-mers coloring
(1) k-mer counter
(1) k-mer counter
(3) colored component extractor
(3) colored component extractor
(4) features calculator
(4) features calculat...
reads
(fastq)
reads...
reads
(fastq)
reads...
k-mers
(binary)
k-mers...
k-mers
(binary)
k-mers...
colored
k-mers
(binary)
colored...
components
(binary)
components...
numeric
vectors
(tsv)
numeric...
data
(format)
data...
train
data
trai...
test
data
test...
method
method
Legend
Legend
Viewer does not support full SVG 1.1
\ No newline at end of file diff --git a/img/pipe4_colors.svg b/img/pipe4_colors.svg new file mode 100644 index 0000000..d4f04f0 --- /dev/null +++ b/img/pipe4_colors.svg @@ -0,0 +1,1813 @@ + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + group frequencies + + + + group frequencies + + + + + + + + + metagenomic + + + + metagenomic + + + + + + + + + + ... + + + + + ... + + + + + + + + + + ... + + + + + ... + + + + + + + + + + control group + + + + control group + + + + + + + + + disease group + + + + disease group + + + + + + + + + sample 1 + + + + sample 1 + + + + + + + + + + + + + + + + sample N + + + + sample N + + + + + + + + + + + + + + + + sample N+1 + + + + sample N+1 + + + + + + + + + + + + + + + + sample M + + + + sample M + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + colored de Bruijn graph + + + + colored de Bruijn graph + + + + + + + + + 100-300 bp read + + + + 100-300 bp read + + + + + + + + + ATCG...ATCG + + + + ATCG...A... + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + AT..CG + + + + AT..CG + + + + + + + + + + k + + + length + + + + k length + + + + + + + + + + + + + + + + + + + + + + dataset + + + + dataset + + + + + + + + + all k-mers with + + + + all k-mers with + + + + + + + Viewer does not support full SVG 1.1 + + + heterogeneous node:50% control50% disease + + homogeneous node:95% control5% disease + +