Skip to content

autosome-ru/kmer-motif-enrichment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

K-mer motif enrichment

A script for selecting sequences enriched with k-mers from a sequence set.

Requirements

  • Python 3
  • NumPy
  • Pandas
  • SciPy
  • Matplotlib (for charting)
  • Seaborn (for charting)
  • BioPython

How to use

Command syntax:

extract_topk.py [-h] [--fastq fastq] [--out fasta] [-d] [--sample N] [-k K] [-a P] [-b P] [-n N] [-f F] [--prefix sequence] [--suffix sequence] [-i] [--chart svg/png/etc] [--title TITLE] [--seed SEED] [-v]

Finds top unique sequences by their highest k-mer enrichment for further motif discovery. The script does NOT support ambiguous IUPAC codes (ATGC only).

Arguments:
  -h, --help           show this help message and exit
  --fastq fastq        input fastq file (default: stdin)
  --out fasta          output fasta file (default: stdout)
  -d, --dinucl         use dinucleotide shuffling instead of mononucleotide
  --sample N           size of sequence subset to analyze (default: 0 [no sampling])
  -k K                 length of a k-mer (default: 4)
  -a P, --alpha P      quantile level of 'baseline' sequences [0 to 1) or the number
                       of non-baseline sequences (1, 2, ...) (default: 100'000).
                       'Single-hit' sequences are always excluded.
  -b P, --beta P       threshold 'baseline' level for 'non-baseline' [0 to 1) or
                       the number of non-baseline sequences (1, 2, ...) (default: 0.99)
  -n N                 length of N-flank to be added (default: 0)
  -f F                 length of non-N-flank to be added. If either prefix or suffix
                       are shorter than F, N-flanks are extended to match the length.
                       To avoid this, use default F (-1), the flanks will be added as-is.
  --prefix sequence    prefix sequence. Only the last F letters are prepended to
                       the output, unless F is not -1.
  --suffix sequence    suffix sequence. Only the first F letters are appended to
                       the output, unless F is not -1.
  -i, --noinvert       don't summarize counts of the reverse-complement k-mers in the score
  --chart svg/png/etc  trace an enrichment-relevance chart (optional)
  --title TITLE        use custom title for the chart (used only when the parameter
                       --chart is specified)
  --seed SEED          seed for RNG (default: 13)
  -v, --verbose        verbose output

Example usage

  cat selex_cycle4.fastq.gz | extract_topk.py -v > kmer_ext.fasta 2> k-mer_ext.log
  extract_topk.py --fastq selex_cycle4.fastq --out kmer_ext.fasta
  extract_topk.py --fastq selex_cycle4.fastq --out kmer_ext.fasta -k 5 --chart selex_cycle4_5mers.png --title "Cycle 4 5-mer enrichment"
  extract_topk.py --fastq selex_cycle4.fastq.gz --out lo-fi_kmer_ext.fasta -a 0.8 -b 0.8
  extract_topk.py --fastq selex_cycle4.fastq.gz --out flanked_kmer_ext.fasta -n 5 -f 3 --prefix TGACTA --suffix AAGATC

Citing and authors

This script was used for motif analysis of HT-SELEX data for HOCOMOCO 12 database. Please cite:

HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors

Ilya E Vorontsov, Irina A Eliseeva, Arsenii Zinkevich, Mikhail Nikonov, Sergey Abramov, Alexandr Boytsov, Vasily Kamenets, Alexandra Kasianova, Semyon Kolmykov, Ivan S Yevshin, Alexander Favorov, Yulia A Medvedeva, Arttu Jolma, Fedor Kolpakov, Vsevolod J Makeev, Ivan V Kulakovskiy

Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D154–D163

doi: 10.1093/nar/gkad1077

The algorithm was initially proposed by Philipp Bucher, and the further modifications and changes were proposed by Ivan Kulakovskii and Arsenii Zinkevich. The final script was written by Arsenii Zinkevich.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages