Skip to content

Commit

Permalink
Flye biotools (#1398)
Browse files Browse the repository at this point in the history
* Update macros.xml

add bio tools id

* Update flye.xml

update flye.xml with bio tools id

* Update flye.xml

FIX the WARNING (edam ontology comes before requirements)

* Update tools/flye/flye.xml

Co-authored-by: Bérénice Batut <[email protected]>

* Try to fix linting

* Try to fix flye tests

---------

Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Bérénice Batut <[email protected]>
  • Loading branch information
3 people authored Mar 18, 2024
1 parent 20eabb7 commit acf41fa
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 53 deletions.
88 changes: 35 additions & 53 deletions tools/flye/flye.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
<macros>
<import>macros.xml</import>
</macros>
<expand macro="requirements" />
<expand macro="edam_ontology"/>
<expand macro="xrefs"/>
<expand macro="requirements" />
<version_command>flye --version</version_command>
<command detect_errors="exit_code"><![CDATA[
#for $counter, $input in enumerate($inputs):
Expand Down Expand Up @@ -228,12 +229,12 @@
</output>
<output name="assembly_gfa" ftype="txt">
<assert_contents>
<has_size value="420252" delta="100"/>
<has_size value="419414" delta="100"/>
</assert_contents>
</output>
<output name="consensus" ftype="fasta">
<assert_contents>
<has_size value="427129" delta="100"/>
<has_size value="426277" delta="100"/>
</assert_contents>
</output>
</test>
Expand All @@ -252,17 +253,17 @@
</output>
<output name="assembly_graph" ftype="graph_dot">
<assert_contents>
<has_size value="1273" delta="100"/>
<has_size value="1500" delta="100"/>
</assert_contents>
</output>
<output name="assembly_gfa" ftype="txt">
<assert_contents>
<has_size value="420252" delta="100"/>
<has_size value="418422" delta="100"/>
</assert_contents>
</output>
<output name="consensus" ftype="fasta">
<assert_contents>
<has_size value="427129" delta="100"/>
<has_size value="425147" delta="200"/>
</assert_contents>
</output>
</test>
Expand All @@ -287,12 +288,12 @@
</output>
<output name="assembly_gfa" ftype="txt">
<assert_contents>
<has_size value="420252" delta="100"/>
<has_size value="418511" delta="100"/>
</assert_contents>
</output>
<output name="consensus" ftype="fasta">
<assert_contents>
<has_size value="427129" delta="100"/>
<has_size value="425267" delta="100"/>
</assert_contents>
</output>
</test>
Expand All @@ -301,7 +302,7 @@
<param name="inputs" ftype="fastq.gz" value="ecoli_hifi_01.fastq.gz,ecoli_hifi_02.fastq.gz,ecoli_hifi_03.fastq.gz,ecoli_hifi_04.fastq.gz,ecoli_hifi_05.fastq.gz,ecoli_hifi_06.fastq.gz,ecoli_hifi_07.fastq.gz,ecoli_hifi_08.fastq.gz,ecoli_hifi_09.fastq.gz"/>
<param name="mode" value="--nano-hq"/>
<param name="min_overlap" value="1000"/>
<param name="scaffolding" value="true"/>
<param name="scaffold" value="true"/>
<output name="assembly_info" ftype="tabular">
<assert_contents>
<has_size value="286" delta="100"/>
Expand All @@ -314,12 +315,12 @@
</output>
<output name="assembly_gfa" ftype="txt">
<assert_contents>
<has_size value="420252" delta="100"/>
<has_size value="419414" delta="1000"/>
</assert_contents>
</output>
<output name="consensus" ftype="fasta">
<assert_contents>
<has_size value="427129" delta="100"/>
<has_size value="426277" delta="1000"/>
</assert_contents>
</output>
</test>
Expand Down Expand Up @@ -353,8 +354,6 @@
</tests>
<help><![CDATA[
.. class:: infomark
**Purpose**
Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.
Expand All @@ -364,8 +363,6 @@ assembly.
----
.. class:: infomark
**Quick usage**
Input reads can be in FASTA or FASTQ format, uncompressed or compressed with gz. Currently, PacBio (raw, corrected, HiFi) and ONT reads
Expand All @@ -380,17 +377,13 @@ specifying *--asm-coverage* and *--genome-size* options. Typically, 40x coverage
----
.. class:: infomark
**Outputs**
The main output files are:
::
- Final assembly: contains contigs and possibly scaffolds (see below).
- Final repeat graph: note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges.
- Extra information about contigs (such as length or coverage).
* Final assembly: contains contigs and possibly scaffolds (see below).
* Final repeat graph: note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges.
* Extra information about contigs (such as length or coverage).
Each contig is formed by a single unique graph edge. If possible, unique contigs are extended with the sequence from flanking unresolved repeats on the graph. Thus,
a contig fully contains the corresponding graph edge (with the same id), but might be longer then this edge. This is somewhat similar to unitig-contig relation in
Expand All @@ -402,53 +395,42 @@ assembly_info.txt file (below) contains additional information about how scaffol
Extra information about contigs/scaffolds is output into the assembly_info.txt file. It is a tab-delimited table with the columns as follows:
::
* Contig/scaffold id
* Length
* Coverage
* Is circular, (Y)es or (N)o
* Is repetitive, (Y)es or (N)o
* Multiplicity (based on coverage)
* Alternative group
* Graph path (graph path corresponding to this contig/scaffold).
- Contig/scaffold id
- Length
- Coverage
- Is circular, (Y)es or (N)o
- Is repetitive, (Y)es or (N)o
- Multiplicity (based on coverage)
- Alternative group
- Graph path (graph path corresponding to this contig/scaffold).
Scaffold gaps are marked with ?? symbols, and * symbol denotes a terminal graph node. Alternative contigs (representing alternative haplotypes) will have the same alt.
group ID. Primary contigs are marked by *.
Scaffold gaps are marked with `??` symbols, and `*` symbol denotes a terminal graph node. Alternative contigs (representing alternative haplotypes) will have the same alt.
group ID. Primary contigs are marked by `*`.
----
.. class:: infomark
**Algorithm Description**
This is a brief description of the Flye algorithm. Please refer to the manuscript for more detailed information. The draft contig extension is organized as follows:
::
- K-mer counting / erroneous k-mer pre-filtering
- Solid k-mer selection (k-mers with sufficient frequency, which are unlikely to be erroneous)
- Contig extension. The algorithm starts from a single read and extends it with a next overlapping read (overlaps are dynamically detected using the selected solid k-mers).
* K-mer counting / erroneous k-mer pre-filtering
* Solid k-mer selection (k-mers with sufficient frequency, which are unlikely to be erroneous)
* Contig extension. The algorithm starts from a single read and extends it with a next overlapping read (overlaps are dynamically detected using the selected solid k-mers).
Note that we do not attempt to resolve repeats at this stage, thus the reconstructed contigs might contain misassemblies. Flye then aligns the reads on these draft
contigs using minimap2 and calls a consensus. Afterwards, Flye performs repeat analysis as follows:
::
- Repeat graph is constructed from the (possibly misassembled) contigs
- In this graph all repeats longer than minimum overlap are collapsed
- The algorithm resolves repeats using the read information and graph structure
- The unbranching paths in the graph are output as contigs
* Repeat graph is constructed from the (possibly misassembled) contigs
* In this graph all repeats longer than minimum overlap are collapsed
* The algorithm resolves repeats using the read information and graph structure
* The unbranching paths in the graph are output as contigs
If enabled, after resolving bridged repeats, Trestle module attempts to resolve simple unbridged repeats (of multiplicity 2) using the heterogeneities between repeat copies.
Finally, Flye performs polishing of the resulting assembly to correct the remaining errors:
::
- Alignment of all reads to the current assembly using minimap2
- Partition the alignment into mini-alignments (bubbles)
- Error correction of each bubble using a maximum likelihood approach
* Alignment of all reads to the current assembly using minimap2
* Partition the alignment into mini-alignments (bubbles)
* Error correction of each bubble using a maximum likelihood approach
The polishing steps could be repeated, which might slightly increase quality for some datasets.
Expand Down
5 changes: 5 additions & 0 deletions tools/flye/macros.xml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@
<edam_operation>operation_0525</edam_operation>
</edam_operations>
</xml>
<xml name="xrefs">
<xrefs>
<xref type="bio.tools">Flye</xref>
</xrefs>
</xml>
<xml name="citations">
<citations>
<citation type="doi">10.1073/pnas.1604560113</citation>
Expand Down

0 comments on commit acf41fa

Please sign in to comment.