Scripts and background to determine short tandem repeat (STR) densities in genes and genomes
Genes contain the information necessary for living cells to create proteins. Read this wikiversity article to find out more about how genes are made up of exons, introns, and more.
There are around 20'000 protein coding genes in the human genome. What may be surprising is that there are many more proteins than there are genes, with estimates of the number of human proteins ranging between 80'000 and 400'000. This means that a single gene has to be able to produce several different protein products. This is possible through a process known as alternative splicing. The different protein products generated from the same gene are known as isoforms. Wikipedia has a good overview of alternative splicing.
Short tandem repeats (STRs) are consecutive repetitions of 1-6 basepair (bp) motifs. They are estimated to make up around 3% of the human genome. Below is an example STR locus:
STRs are a rich source of genomic variation, with some loci having mutation rates up to a 10000 times higher than point mutations. STR mutations are typically the result of DNA polymerase slippage during replication, where strand misalignment after polymerase detachment results in the insertion or deletion of one or more repeat units at an STR locus:
For more background on STRs, the way they mutate, and why they are relevant, you can have a look at our recent review: Mutation and selection processes regulating short tandem repeats give rise to genetic and phenotypic diversity across species (there's also a pdf in the literature folder).
We are interested in determining whether the evolutionary age of genes (where they first emerged in the tree of life) has a relationship to the number of STR loci. Specifically, the goal of this project is to write software that calculates the number and density of STRs in genes. This would provide clues to understanding the mechanisms behind how genes emerge. Ideally, the software should provide specific statistics for intronic, exonic, and/or coding regions.
In order to write software to do this, it is important to understand how genetic data is represented computationally, and how we can work with such data. The most important formats to be aware of initially are FASTA, BED, and GFF/GTF. For each of these, different libraries exist that can be used to parse them.
- GitHub: Source control is critical in any non-trivial project.
- miniconda: Super useful for managing and separating work environments. Is used mainly for setting up Python environments, but can also be used for R.
- bedtools: "...a swiss-army knife of tools for a wide-range of genomics analysis tasks.". Useful command line tools for working with bed files, fasta files, etc.
- Biopython: Library for biological computation in Python. The SeqIO module can be useful to read and write sequences (fasta, fastq, etc.), although I tend to prefer pysam these days (see below).
- pysam: Python wrapper for the original htslib C library. Focused on reading/writing alignment files, but also supports reading and writing of fasta, vcf, and more.
- gtftools: Software that can parse GTF files and provides functionality useful for this project. Importantly, gtftools can merge the exonic regions for different isoforms of a gene. This will be useful when determining the overlap between STRs, genes, and their exons. There is a paper describing gtftools in the literature folder.
Goal: use some test files located (in data) to get familiar with the problem we are trying to address.
A good starting point will be to determine the number of STR positions across all STRs specified in this BED file, overlap the gene region of the SAMD11 gene.
The coordinates of the SAMD11 gene are specified in this GTF file.
According to my tests, there are 68 nucleotides in the entire SAMD11 gene sequence that consist of STRs specified in the test.bed
file.
If you manage to get to the same value, you can try the next challenge, which will be to determine the amount of overlap between STRs in test.bed
and the exonic regions of the SAMD11 gene.
According to my tests, the answer here should be 24 nucleotides.
I will leave it up to you to figure out how you want to tackle this, but it may be useful to merge all overlapping exons from the different SAMD11 transcripts using.
This could be done using gtftools, for example, although you will need to figure out how to install it.
The most important part: don't get too frustrated or discouraged if something doesn't work. It is expected that it will not work perfectly the first time. This will be an iterative process where we figure out the best way to do this. Don't hestitate to reach out if you have any questions or are stuck.
Goal 1: create a standalone python script that calculates STR densities in genes and has a command-line interface (CLI). To implement the CLI, you can use python's argparse library. There are many tutorials online that show how to use it. The CLI should take at least two arguments: one that specifies the filepath to a GTF file with gene annotations, and one that specifies the filepath to a BED file that contains STR regions. The script should then calculate the STR density of each gene and return this to the user.
Goal 2: create a folder in this GitHub repository called scripts
that we can use to store our program in.
The script that you create in Goal 1
should be stored here, for example.
To learn more about Git, take a look at this page.
For now, you should propably only use the Add, Commit, and Push commands.
You can use these commands either from the command line or through VSCode (again, there are many tutorials online for this).
Good job on implementing the script!
I made a few small changes to the calculate_density()
function so it runs a bit faster.
I also propose some changes to the format in which the results are returned by the script:
- Currently, the output is just written to stdout using
print()
, please also add some functionality to write the results to an output file specified via a--output
commmand line argument. - Please return the output in CSV format so it will be easier to work with in the future.
You can make a pandas DataFrame that contains the results and then use
pd.DataFrame.to_csv()
to write it to a file. - Currently, the script only returns the
gene_name
. Please also return thegene_id
as well.
Once you've implemented these changes, you can run the script on the full dataset.
You can find the BED file we'll use here (the file you need is called 'GRCh38_repeats.bed.gz').
You can download it and then uncompress it at the command line using gunzip GRCh38_repeats.bed.gz
.
You can download a complete GTF file of the human genome directly from GENCODE (download the 'Basic gene annotation' GTF file for the 'CHR' regions).
Get the files, run the script, and store the output.
Then we can get started on the analysis!
By now you should have the STR density of all human genes in the GENCODE GTF file.
This is the point where we switch from pure coding to investigating biological questions.
The main motivation for this project was to determine if there is a link between the evolutionary age of a gene and it's STR content.
To find the evolutionary ages of the genes in our dataset, you can use GenOrigen.
Specifically, this CSV file contains the gene ages for human protein coding genes.
You can map the gene ages contained in this CSV to the genes you already analysed by matching on the ensembl_gene_id
.
Hint: you might have to remove the ENSEMBL version numbers from the gene ids in your dataset first. For example, ENSG00000290825.1 -> ENSG00000290825