Skip to content

smudgeplot hetkmers

Kamil S. Jaron edited this page Jan 8, 2025 · 6 revisions

This is an algorithm that extracts kmer pairs from a FastK k-mer database. The most computationally relevant parameter is L, which is the threshold for considering k-mers as genomic k-mers,usually the value would be dividing well the errors and the first genomic peak of the k-mer spectrum. Look at wikipage chosing L and U for details.

Usage

usage: smudgeplot hetkmers [-h] [-L L] [-t T] [-o O] [-tmp TMP] [--verbose] [infile]

Calculate unique kmer pairs from FastK k-mer database.

positional arguments:
  infile      Input FastK database (.ktab) file.

options:
  -h, --help  show this help message and exit
  -L L        Count threshold below which k-mers are considered erroneous
  -t T        Number of threads (default 4)
  -o O        The pattern used to name the output (kmerpairs).
  -tmp TMP    Directory where all temporary files will be stored (default /tmp).
  --verbose   verbose mode

Output

The output file is <output_pattern>_text.smu. The coverage file has the following format

10	10	9196
10	11	15000
10	12	12912
11	11	6324
10	13	10440
11	12	10526
10	14	8578
...

where the three columns correspond to covB (the one of a pair with lower coverage), covA (higher coverage) and freq, which is how many k-mer pairs have been seen with these two k-mer coverages respectively. The less covered k-mer is always in the first column. At this point, it is impossible to retrieve sequences of the k-mers.