Skip to content
cpockrandt edited this page Dec 1, 2020 · 15 revisions

If any of your questions are not answered here, feel free to send an e-mail to pockrandt [ÄT] jhu.edu.

Which algorithm to choose for indexing?

GenMap offers two algorithms for suffix array construction (needed for building the FM index) with different resource requirements and running times:

For fasta file(s) with n < 2 GB libdivsufsort (-A divsufsort) needs about 6n bytes of main memory (RAM). For larger files with n > 2 GB it needs 10n bytes of main memory. Skew7 (-A skew) needs less main memory, but at least 20-25n of secondary memory. In terms of running time, libdivsufsort is significantly faster. Space consumption can vary depending on the number of sequences and the length of the individual sequences.

Example: Indexing the human genome takes about half an hour with libdivsufsort and over 2 hours with Skew7. The main memory peak for libdivsufsort is 31 GB, for Skew7 it is only 8 GB.

If you want to reduce the main memory consumption of libdivsufsort, you can try to incresae the parameter -S 10 (up to 64). This will also make the entire index smaller and increase the running time of the mappability computation. I am currently working on significantly reducing the main memory requirements for libdivsufsort in the next release of GenMap.

If you build the index with both algorithms on the same data, the indices won't be identical. This is because both algorithms use a different order of suffices with respect to the sentinels, but the mappability output of GenMap will always be identical for both indices.

Which alphabets are supported?

GenMap can only handle nucleotide sequences (A, C, G, T/U, N). If you load files including other letters (such as IUPAC notation for ambiguous bases), GenMap will print a warning and convert them to N.

How to load raw files (.map, .freq8, .freq16) in C++?

If you want to load the raw output with the frequency or mappability vector into your program, you can use the following snippet:

#include <vector>
#include <fstream>
#include <iostream>
#include <iterator>

template <typename value_t>
void load(std::vector<value_t> & vec, std::string && path)
{
    std::ifstream file(path, std::ios::binary);
    if (!file.eof() && !file.fail())
    {
        file.seekg(0, std::ios_base::end);
        std::streampos fileSize = file.tellg();
        vec.resize(fileSize / sizeof(value_t));
        file.seekg(0, std::ios_base::beg);
        file.read(reinterpret_cast<char*>(&vec[0]), fileSize);
        file.close();
        return;
    }
    // something went wrong ...
}

int main(int argc, char ** argv)
{
    // Example: load the mappability vector
    std::vector<float> mappability;
    load(mappability, "c.map");
    // print mappability vector
    std::copy(mappability.begin(), mappability.end(), std::ostream_iterator<float>(std::cout, " "));
    std::cout << '\n';

    // Example: load the frequency vector (for freq16 files please use uint16_t)
    std::vector<uint8_t> frequency;
    load(frequency, "c.freq8");
    // print frequency vector
    std::copy(frequency.begin(), frequency.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << '\n';

    return 0;
}

How to compute the mappability on multiple fasta files or genomes?

GenMap can compute the mappability of more than one genome or fasta file. The index can be constructed on multiple files by passing the directory (genmap index -FD /path/to/directory/with/fasta/files -I /multi/fasta/index/output). It will index all fa/fasta/fastq files in the directory, but it will not traverse the directory recursively, i.e., fasta files in sub-directories will not be considered.

The mappability is computed just like on a single fasta file by passing the index (genmap map -I /multi/fasta/index/output ...). It will count for each k-mer in every fasta file how often it occurs in all indexed fasta files. Mappability files are written to the output directory for each indexed fasta file separately.

There is an additional feature to only count an occurrence of a k-mer only once per fasta file. By passing --exclude-pseudo the mappability algorithm will only count the number of indexed fasta files / genomes that contain the k-mer at least once, i.e., if a k-mer occurs (with errors) multiple times in a single fasta file, it is only counted once.

CSV format for detailed mappability information

Running GenMap with --csv will output a csv file for every fasta file that was indexed. It gives detailed information where every k-mer maps to with up to e errors.

Consider a single fasta file with one sequence. We take the example from the README file.

Example mappability

The csv file will list the occurrences for all k-mers, one k-mer per line. Positions are represented as tuples. The first value refers to the sequence number (since this example only has one sequence, it will be 0), the second one to the position within the corresponding sequence. All indices are 0-based.

The second line in the csv file (for position 1 with the 4-mer TCTA), would look like this:

0,1;0,1|0,6|0,11

The first column 0,1 indicates that this line lists the approximate occurrences of the k-mer from the first (0th) sequence at position 1. The second column 0,1|0,6|0,11 lists all of its approximate occurrences separated by vertical bars |, i.e., positions 1, 6 and 11 in the first sequence.

Sequences are numbered in the way they occur in the fasta file.

If the flag for the reverse complement is set, the csv file will have a third column, such that occurrences of both strands are separated. Assuming the fasta file contains sequences in 5′-to-3′ direction, the 2nd column will contain the occurrences on the plus strand (5′-to-3′) and the 3rd column will contain the occurrences on the minus strand (3′-to-5′). The position for k-mer occurrences on the negative strand refers to the beginning of the k-mer counting from the 5'-end. In other words, it refers to the location of the reverse complement of the k-mer in the original sequence, i.e., the matching k-mer 0,3 on the reverse strand is rev_compl(T[3..6]) = rev_compl(TAGG) = CCTA.

The csv line above would look as follows with a match on the minus strand at position 3:

0,1;0,1|0,6|0,11;0,3