diff --git a/doc/examples/pairtools_phase_walkthrough.ipynb b/doc/examples/pairtools_phase_walkthrough.ipynb index ad0db57..279de7c 100644 --- a/doc/examples/pairtools_phase_walkthrough.ipynb +++ b/doc/examples/pairtools_phase_walkthrough.ipynb @@ -7,17 +7,21 @@ "source": [ "# Pairtools phase walkthrough\n", "\n", - "Welcome to the pairtools phase walkthrough!\n", + "## Pairtools Phase Walkthrough\n", "\n", - "Haplotype-resolved Hi-C is a popular technique that helps you to resolve contacts of homologous chromosomes. \n", - "It relies on a simple idea tha homologous chromosomes have variations (e.g., SNPs) that are inherited together as **haplotypes**. DNA reads in Hi-C will have the SNVs from one of two haplotypes, which can be used to distinguish the contacts on the same chromosome (*cis-homologous*) and contacts connecting two homologs (*trans-homologous*). \n", + "Welcome to the pairtools phase walkthrough! This notebook will guide you through the process of resolving contacts between homologous chromosomes using haplotype-resolved Hi-C analysis.\n", "\n", - "The experimental challenge of the haplotype-resolved Hi-C is to increase the number of SNPs that distinguish reads from different chromosomes. This can be dome by mating highly diverged. \n", + "## What is haplotype-resolved Hi-C?\n", "\n", - "- Erceg et al. 2019 create highly heterozygous embryos of *Drosophila* [1] \n", - "- Collombet et al. 2020 create highly polymorphic F1 hybrid embryos obtained by crossing female *Mus musculus domesticus* (C57Bl/6J) with male *Mus musculus castaneus* CAST/EiJ) to resolve structures of individual chromosomes in the zygote and embryos [2] \n", - "- Tan et al. 2018 uses available heterozygous positions to infer the 3D structures of single chromosomes by single-cell variant of the protocol Dip-C [3] \n", - "- Duan et al. use dikaryonic nuclei of fungi with 0.7% heterozygosity [4]" + "Haplotype-resolved Hi-C distinguishes interactions within individual chromosomes (cis-homolog contacts) from those between homologous chromosomes (trans-homolog contacts). This separation is possible because homologous chromosomes carry variations (e.g. single nucleotide variants, or, SNVs) that can be used to tell them apart.\n", + "\n", + "The experimental challenge of haplotype-resolved Hi-C is to increase the number of SNVs that are essential to distinguish reads from different chromosomes. This can be done by mating highly diverged homozygous strains and studying their F1 progeny. \n", + "\n", + "Several studies have successfully leveraged haplotype-resolved Hi-C for novel insights:\n", + "1. Erceg et al. (2019) explored chromosome pairing in Drosophila embryos [1].\n", + "2. Collombet et al. (2020) studied chromosomal organization during early mammalian embryogenesis [2].\n", + "3. Tan et al. 2018 uses available heterozygous positions to infer the 3D structures of single chromosomes by single-cell variant of the protocol Dip-C [3]\n", + "4. Duan et al. use dikaryonic nuclei of fungi with 0.7% heterozygosity [4]" ] }, { @@ -30,18 +34,18 @@ "id": "c3795661-e308-44e6-9b0f-3f0396541250", "metadata": {}, "source": [ - "In `pairtools` we implement an approach to resolving haplotypes from Erceg et al. The outline of haplotype-resolved parsing of pairs:\n", - "\n", - "1. [Create the reference genome](#Create-the-reference-genome): create the concatenated reference genomes from two haplotypes. \n", + "Several approaches have been developed to process Hi-C data from haplotype-resolved experiments. In `pairtools`, we implement the approach that was used in Erceg et al. Here is its brief outline:\n", "\n", - " Usually the SNVs are known and can be obtained in .vsf format. We will incorporate the SNVs by [bcftools](https://samtools.github.io/bcftools/bcftools.html) into the reference and create updated fasta files with haplotype-corrected sequences.\n", - " For each homologue we will add the suffixes that identify the type of homologue (`_hap1` or `_hap2`).\n", + "1. [Create the reference genome](#Create-the-reference-genome): create a \"concatenated\" reference genome that contains sequences of both homologs of each chromosome. \n", "\n", - "2. Map the Hi-C data to the concatenated reference and parse allowing multimappers (mapq 0). \n", + " - Incorporate known SNVs (usually in .vcf format) into the reference genome using [bcftools](https://samtools.github.io/bcftools/bcftools.html) to create FASTA files with the sequences of both homologs.\n", + " - Add suffixes to the name of each homolog that identify the type (`_hap1` or `_hap2`).\n", "\n", - " We will also need the mapper to report two suboptimal alignments (aka the second and the third hit).\n", - " When the Hi-C read is mapped to some location in the genome, it will have the suffix of the homologue reported as part of chromosome name.\n", - " However, the true resolved pairs are not yet known at this step. \n", + "2. Map the Hi-C data to the concatenated reference and parse resulting alignment into Hi-C pairs. Compared to the standard Hi-C pipeline, this step would contain a couple of modifications:\n", + " - parse allowing multimappers (mapq 0). \n", + " - make the aligner report two suboptimal alignments (aka the second and the third hit).\n", + " \n", + " Note that, upon mapping to the homolog-resolved genome, Hi-C reads will report the identity of their homologue as the suffix of the chromosome name.\n", " \n", " See sections:\n", " \n", @@ -64,13 +68,12 @@ " '1' (second haplotype). \n", " \n", " \n", - " \n", " Phasing schema: \n", " \n", "![image.png](attachment:62e74fba-c1c1-44b5-a3e2-3699c3cac7ce.png)\n", "\n", "\n", - "4. Post-procesing. Do sorting, dedup and stats, as usual. \n", + "4. Post-procesing. Sort and dedup Hi-C pairs and calculate stats, similarly to the standard Hi-C pipeline. \n", "\n", " See sections:\n", " \n", @@ -98,8 +101,9 @@ "id": "a0b4c550-8168-4780-82e0-1e18493135af", "metadata": {}, "source": [ - "We will test on a sample from Collombet et al. 2019 [2], example of mouse single-cell Hi-C on embryos obtained from highly heterozygous parents. We will take some cell from the dataset, GSM3691125_2CSE_70. \n", - "Note that becuase the procedure is not strictly Hi-C, the properties of this dataset may differ from what you may obtain on bulk data. " + "We will test this pipeline on a sample from Collombet et al. 2019 [2], which is a great example of single-cell Hi-C obtained on mice hybrids of highly heterozygous parents. For the sake of brevity, we will focus on just one cell from the dataset, GSM3691125_2CSE_70. \n", + "\n", + "Note that, because this is a single-cell sample, the properties of this dataset may differ from what you may obtain on bulk data. " ] }, { @@ -109,24 +113,63 @@ "source": [ "## Create the reference genome\n", "\n", - "For phasing, map the data to the concatenated genome with two haplotypes. \n", - "Obtaining such genome is not a simple task. You will need a reference genome, and one or two lists of mutations to instroduce to the reference.\n", + "To phase input reads, we need to map the data to the concatenated genome with two haplotypes. \n", + "Below, we will generate such genome in several steps. You will need a reference genome, and one or two lists of mutations to instroduce to the reference.\n", "\n", "#### Download reference genome" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "9ec0743f-a299-43f0-b568-7e963ed95df8", "metadata": { "tags": [ "hide-output" ] }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-03-18 13:18:25-- https://ftp.ensembl.org/pub/release-68/fasta/mus_musculus/dna/Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz\n", + "Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169\n", + "Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 861993605 (822M) [application/x-gzip]\n", + "Saving to: ‘Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz’\n", + "\n", + "Mus_musculus.GRCm38 100%[===================>] 822.06M 2.43MB/s in 5m 35s \n", + "\n", + "2024-03-18 13:24:01 (2.45 MB/s) - ‘Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz’ saved [861993605/861993605]\n", + "\n" + ] + } + ], + "source": [ + "! wget https://ftp.ensembl.org/pub/release-68/fasta/mus_musculus/dna/Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz" + ] + }, + { + "cell_type": "markdown", + "id": "6f31f657", + "metadata": {}, + "source": [ + "The genome must be indexed for subsequent steps. However, indexing requires the genome to be compressed with bgzip." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "79e6471d", + "metadata": {}, "outputs": [], "source": [ - "! wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa" + "%%bash\n", + "zcat Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz | bgzip -c > Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.bgz\n", + "samtools faidx Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.bgz\n", + "rm Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.gz" ] }, { @@ -139,16 +182,31 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "4a347a3b-2ee7-4824-a209-8377edddf640", "metadata": { "tags": [ "hide-output" ] }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-03-18 13:18:14-- https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.165\n", + "Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.165|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 785849127 (749M) [application/x-gzip]\n", + "Saving to: ‘CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz.2’\n", + "\n", + " CAST_EiJ.mgp.v5.sn 0%[ ] 4.82M 2.42MB/s ^C\n" + ] + } + ], "source": [ - "! wget ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz" + "! wget https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz" ] }, { @@ -161,7 +219,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "84cebce3-29c6-42df-98bf-5388a51fb268", "metadata": {}, "outputs": [], @@ -174,30 +232,130 @@ "id": "2dd599a0-64f9-4c8b-b78f-8eabf49c052e", "metadata": {}, "source": [ - "#### Introduce the variants into the genome\n", - "\n", - "Note that you may select the variants that are only SNPs but not SNVs (deletions/insertions) by using `--include` parameter of `bcftools consensus` (e.g. `--include '(STRLEN(REF)=1) & (STRLEN(ALT[0])=1)'`).\n", - "This will make sure that the genomic coorditates correspond between the haplotypes. \n", - "Correspondence of coordinates is not a requirement, but might be important for downstream analysis. " + "#### Introduce the variants into the genome" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "id": "848c9fe5-a632-4139-ba56-60871d8d1eb4", "metadata": { - "tags": [ - "hide-output" - ] + "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning: Sequence \"JH584295.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584292.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456368.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456396.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456359.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456382.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456392.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456394.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456390.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456387.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456381.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456370.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456372.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456389.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456378.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456360.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456385.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456383.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456213.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456239.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456367.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456366.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456393.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456216.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456379.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584304.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456212.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584302.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584303.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456210.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456219.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584300.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584298.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584294.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456354.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584296.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584297.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456221.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584293.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456350.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456211.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584301.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456233.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584299.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584295.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584292.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456368.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456396.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456359.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456382.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456392.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456394.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456390.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456387.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456381.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456370.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456372.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456389.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456378.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456360.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456385.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456383.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456213.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456239.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456367.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456366.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456393.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456216.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456379.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584304.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456212.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584302.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584303.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456210.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456219.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584300.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584298.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584294.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456354.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584296.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584297.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456221.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584293.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456350.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456211.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584301.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"GL456233.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n", + "Warning: Sequence \"JH584299.1\" not in CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz\n" + ] + } + ], "source": [ "%%bash\n", - "bcftools consensus --fasta-ref GRCm38_68.fa.gz \\\n", - " --haplotype 1 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap1/' | bgzip -c > GRCm38_EiJ_snpsonly_hap1.fa.gz\n", + "bcftools consensus --fasta-ref Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.bgz \\\n", + " --haplotype 1 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz | sed -E 's/(>[^[:space:]]+).*/\\1_hap1/g' | bgzip -c > GRCm38_EiJ_snpsonly_hap1.fa.gz\n", "\n", - "bcftools consensus --fasta-ref GRCm38_68.fa.gz \\\n", - " --haplotype 2 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap2/' | bgzip -c > GRCm38_EiJ_snpsonly_hap2.fa.gz\n" + "bcftools consensus --fasta-ref Mus_musculus.GRCm38.68.dna_sm.toplevel.fa.bgz \\\n", + " --haplotype 2 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz | sed -E 's/(>[^[:space:]]+).*/\\1_hap2/g' | bgzip -c > GRCm38_EiJ_snpsonly_hap2.fa.gz\n" + ] + }, + { + "cell_type": "markdown", + "id": "af1f6027", + "metadata": {}, + "source": [ + "Note that some of these inserted variants may change the total number of nucleotides. This would introduce differences between the coordinate systems of two homologs and complicate downstream analyses. Thus, to make your analyses simpler.you may want insert only single-nuleotide variants and exclude \n", + "by using `--include` parameter of `bcftools consensus` (e.g. `--include '(STRLEN(REF)=1) & (STRLEN(ALT[0])=1)'`).\n", + "This will make sure that the genomic coorditates correspond between the haplotypes. \n", + "Correspondence of coordinates is not a requirement, but might be important for downstream analysis. " ] }, { @@ -213,19 +371,153 @@ "id": "99d28f6f-b754-4a95-95d5-9e5e51d14571", "metadata": {}, "source": [ - "Concatenate the genomes and index them together. Note that [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) produces [very similar results to bwa mem](https://github.com/open2c/pairtools/discussions/118), while being [x2-3 times faster](https://github.com/bwa-mem2/bwa-mem2#performance). We highly recommend to use it instead of bwa!" + "Concatenate the genomes and index them together. " ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "id": "92ff8a4f-2115-4131-8c4a-cbd040dcdffb", "metadata": { "tags": [ "hide-output" ] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[bwa_index] Pack FASTA... 62.34 sec\n", + "[bwa_index] Construct BWT for the packed sequence...\n", + "[BWTIncCreate] textLength=10923487096, availableWord=780616804\n", + "[BWTIncConstructFromPacked] 10 iterations done. 99999992 characters processed.\n", + "[BWTIncConstructFromPacked] 20 iterations done. 199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 30 iterations done. 299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 40 iterations done. 399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 50 iterations done. 499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 60 iterations done. 599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 70 iterations done. 699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 80 iterations done. 799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 90 iterations done. 899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 100 iterations done. 999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 110 iterations done. 1099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 120 iterations done. 1199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 130 iterations done. 1299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 140 iterations done. 1399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 150 iterations done. 1499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 160 iterations done. 1599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 170 iterations done. 1699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 180 iterations done. 1799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 190 iterations done. 1899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 200 iterations done. 1999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 210 iterations done. 2099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 220 iterations done. 2199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 230 iterations done. 2299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 240 iterations done. 2399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 250 iterations done. 2499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 260 iterations done. 2599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 270 iterations done. 2699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 280 iterations done. 2799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 290 iterations done. 2899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 300 iterations done. 2999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 310 iterations done. 3099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 320 iterations done. 3199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 330 iterations done. 3299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 340 iterations done. 3399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 350 iterations done. 3499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 360 iterations done. 3599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 370 iterations done. 3699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 380 iterations done. 3799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 390 iterations done. 3899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 400 iterations done. 3999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 410 iterations done. 4099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 420 iterations done. 4199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 430 iterations done. 4299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 440 iterations done. 4399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 450 iterations done. 4499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 460 iterations done. 4599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 470 iterations done. 4699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 480 iterations done. 4799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 490 iterations done. 4899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 500 iterations done. 4999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 510 iterations done. 5099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 520 iterations done. 5199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 530 iterations done. 5299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 540 iterations done. 5399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 550 iterations done. 5499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 560 iterations done. 5599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 570 iterations done. 5699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 580 iterations done. 5799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 590 iterations done. 5899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 600 iterations done. 5999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 610 iterations done. 6099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 620 iterations done. 6199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 630 iterations done. 6299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 640 iterations done. 6399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 650 iterations done. 6499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 660 iterations done. 6599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 670 iterations done. 6699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 680 iterations done. 6799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 690 iterations done. 6899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 700 iterations done. 6999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 710 iterations done. 7099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 720 iterations done. 7199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 730 iterations done. 7299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 740 iterations done. 7399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 750 iterations done. 7499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 760 iterations done. 7599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 770 iterations done. 7699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 780 iterations done. 7799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 790 iterations done. 7899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 800 iterations done. 7999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 810 iterations done. 8099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 820 iterations done. 8199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 830 iterations done. 8299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 840 iterations done. 8399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 850 iterations done. 8499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 860 iterations done. 8599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 870 iterations done. 8699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 880 iterations done. 8799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 890 iterations done. 8899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 900 iterations done. 8999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 910 iterations done. 9099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 920 iterations done. 9199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 930 iterations done. 9299999992 characters processed.\n", + "[BWTIncConstructFromPacked] 940 iterations done. 9399999992 characters processed.\n", + "[BWTIncConstructFromPacked] 950 iterations done. 9499999992 characters processed.\n", + "[BWTIncConstructFromPacked] 960 iterations done. 9599999992 characters processed.\n", + "[BWTIncConstructFromPacked] 970 iterations done. 9699999992 characters processed.\n", + "[BWTIncConstructFromPacked] 980 iterations done. 9799999992 characters processed.\n", + "[BWTIncConstructFromPacked] 990 iterations done. 9899999992 characters processed.\n", + "[BWTIncConstructFromPacked] 1000 iterations done. 9999999992 characters processed.\n", + "[BWTIncConstructFromPacked] 1010 iterations done. 10099999992 characters processed.\n", + "[BWTIncConstructFromPacked] 1020 iterations done. 10199999992 characters processed.\n", + "[BWTIncConstructFromPacked] 1030 iterations done. 10298557624 characters processed.\n", + "[BWTIncConstructFromPacked] 1040 iterations done. 10387491112 characters processed.\n", + "[BWTIncConstructFromPacked] 1050 iterations done. 10466531432 characters processed.\n", + "[BWTIncConstructFromPacked] 1060 iterations done. 10536778632 characters processed.\n", + "[BWTIncConstructFromPacked] 1070 iterations done. 10599210536 characters processed.\n", + "[BWTIncConstructFromPacked] 1080 iterations done. 10654696152 characters processed.\n", + "[BWTIncConstructFromPacked] 1090 iterations done. 10704007912 characters processed.\n", + "[BWTIncConstructFromPacked] 1100 iterations done. 10747832296 characters processed.\n", + "[BWTIncConstructFromPacked] 1110 iterations done. 10786779480 characters processed.\n", + "[BWTIncConstructFromPacked] 1120 iterations done. 10821391832 characters processed.\n", + "[BWTIncConstructFromPacked] 1130 iterations done. 10852151336 characters processed.\n", + "[BWTIncConstructFromPacked] 1140 iterations done. 10879486440 characters processed.\n", + "[BWTIncConstructFromPacked] 1150 iterations done. 10903777896 characters processed.\n", + "[BWTIncConstructFromPacked] 1160 iterations done. 10923487096 characters processed.\n", + "[bwt_gen] Finished constructing BWT in 1160 iterations.\n", + "[bwa_index] 4810.53 seconds elapse.\n", + "[bwa_index] Update BWT... 27.18 sec\n", + "[bwa_index] Pack forward-only FASTA... 48.19 sec\n", + "[bwa_index] Construct SA from BWT and Occ... 1602.31 sec\n", + "[main] Version: 0.7.17-r1188\n", + "[main] CMD: bwa index GRCm38_EiJ_snpsonly.fa.gz\n", + "[main] Real time: 6563.846 sec; CPU: 6550.547 sec\n" + ] + } + ], "source": [ "%%bash\n", "cat GRCm38_EiJ_snpsonly_hap1.fa.gz GRCm38_EiJ_snpsonly_hap2.fa.gz > GRCm38_EiJ_snpsonly.fa.gz\n", @@ -242,7 +534,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "id": "69489018-edde-4aa0-b7ac-7c7b4351764c", "metadata": {}, "outputs": [], @@ -265,7 +557,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "f4e310c0-2d16-4e7d-87d7-44feec8e6256", "metadata": {}, "outputs": [], @@ -275,59 +567,99 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "571e94fb-3dec-4042-9e21-6c39802ed8df", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SRR8811373_1.fastq.gz SRR8811373_2.fastq.gz\n" + ] + } + ], "source": [ "! ls SRR8811373*.fastq.gz" ] }, { "cell_type": "markdown", - "id": "2ce00436-bbc7-4241-a41b-12c99c708180", - "metadata": { - "tags": [] - }, + "id": "a8a9fd19", + "metadata": {}, "source": [ - "## Map data with bwa mem to diploid genome\n", + "# Map data with bwa mem to diploid genome\n", + "In homolog-resolved Hi-C experiments, reads are first aligned against the reference genome and then parsed with pairtools, similar to the standard pairtools-based Hi-C pipeline. However, an additional challenge arises in distinguishing between un-phaseable reads (reads that map equally well to two homologous locations on two homologous chromosomes) and multimappers (reads that map to repeats in the genome).\n", "\n", - "Note that you may use [bwa mem2](https://github.com/bwa-mem2/bwa-mem2), which is x2 times faster. \n", - "It [proved to produce](https://github.com/open2c/pairtools/discussions/118) results very similar to bwa mem.\n", + "To differentiate between these cases, we can examine the top three candidate alignments for each read:\n", + "- If the top two alignments map to two homologs of the same chromosome with identical scores, and the third-best alignment has a lower score, the read is considered un-phaseable.\n", + "- Conversely, if the third-best alignment has the same score as the first two, the read is classified as a multimapper.\n", "\n", - "There are two modes to work with phasing. \n", + "bwa mem provides information on the top three alignments, but the exact usage depends on the version:\n", "\n", - "1. Github mode with XB bwa tag. This is the most precise algorithm that operates based on alignment scores of optimal alignment (best hit), and two suboptimal ones.\n", + "1. Regular bwa binary (using XA alignment tag):\n", "\n", - " Download and install [bwa](https://github.com/lh3/bwa) from GitHub.\n", - " Map with:\n", - " ```bash\n", - "./bwa/bwa mem -SPu -t 5 mm10_EiJ_snpsonly.fa.gz test.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XB.bam\n", - " ```\n", + "The latest official bwa release (0.7.17-r1188, as of March 2024) includes the XA tag, which lists alternative (secondary) alignments, their CIGARs, and the number of mismatches relative to the reference. pairtools parse can parse this tag along with other alignment characteristics (AS, XS, and NM tags) to infer the scores of the top three alignments. In this case, no extra mapping flags are required besides the standard Hi-C flags -SP:\n", "\n", + "```bash\n", + "bwa mem -SP -t 5 mm10_EiJ_snpsonly.fa.gz est.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XA.bam\n", + "```\n", "\n", - "2. Regular mode with XA bwa tag. \n", + "2. Cutting-edge bwa version from GitHub (using XB alignment tag):\n", "\n", - " This is simplified version that operates on number of mismatches for the suboptimal alignments.\n", + "The latest unreleased version of bwa, available on GitHub, can directly report the scores of all secondary alignments using the XB tag, providing more precise results. To use this option, download bwa's source code, compile it manually, and align reads with an additional -u flag:\n", "\n", - " ```bash\n", - "bwa mem -SP -t 5 mm10_EiJ_snpsonly.fa.gz est.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XA.bam\n", - " ```\n", + "```bash\n", + "./bwa/bwa mem -SPu -t 5 mm10_EiJ_snpsonly.fa.gz test.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XB.bam\n", + "```\n", "\n", + "In this tutorial, we will use the first, simpler option.\n", "\n", - "We will try the second option for the simplicity: " + "Note: [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) produces [very similar results to bwa mem](https://github.com/open2c/pairtools/discussions/118) while being [x2-3 times faster](https://github.com/bwa-mem2/bwa-mem2#performance). pairtools are compatible with bwa-mem2, and its use is highly recommended for improved performance." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "12f8a13d-fba6-45f7-8112-291fb883d7d0", "metadata": { "tags": [ "hide-output" ] }, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[M::bwa_idx_load_from_disk] read 0 ALT contigs\n", + "[M::process] read 331126 sequences (50000026 bp)...\n", + "[M::process] read 127590 sequences (19265939 bp)...\n", + "[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 238, 0, 0)\n", + "[M::mem_pestat] skip orientation FF as there are not enough pairs\n", + "[M::mem_pestat] analyzing insert size distribution for orientation FR...\n", + "[M::mem_pestat] (25, 50, 75) percentile: (70, 98, 141)\n", + "[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 283)\n", + "[M::mem_pestat] mean and std.dev: (109.44, 53.37)\n", + "[M::mem_pestat] low and high boundaries for proper pairs: (1, 354)\n", + "[M::mem_pestat] skip orientation RF as there are not enough pairs\n", + "[M::mem_pestat] skip orientation RR as there are not enough pairs\n", + "[M::mem_process_seqs] Processed 331126 reads in 276.310 CPU sec, 55.358 real sec\n", + "[W::bseq_read] the 1st file has fewer sequences.\n", + "[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 98, 0, 0)\n", + "[M::mem_pestat] skip orientation FF as there are not enough pairs\n", + "[M::mem_pestat] analyzing insert size distribution for orientation FR...\n", + "[M::mem_pestat] (25, 50, 75) percentile: (73, 107, 164)\n", + "[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 346)\n", + "[M::mem_pestat] mean and std.dev: (118.69, 58.74)\n", + "[M::mem_pestat] low and high boundaries for proper pairs: (1, 437)\n", + "[M::mem_pestat] skip orientation RF as there are not enough pairs\n", + "[M::mem_pestat] skip orientation RR as there are not enough pairs\n", + "[mem_sam_pe] paired reads have different names: \"SRR8811373.22935\", \"SRR8811373.229358\"\n", + "\n" + ] + } + ], "source": [ "%%bash\n", "bwa mem -SP -t 5 GRCm38_EiJ_snpsonly.fa.gz SRR8811373_1.fastq.gz SRR8811373_2.fastq.gz \\\n", @@ -341,23 +673,56 @@ "source": [ "## pairtools parse\n", "\n", - "For phasing, we need additional tags and no filtering by mapq.\n", - "\n", - "`--min-mapq` is 1 by default, which removes all multiply mapped sequences. However, we need this information for phasing to distinguish true multiply mapped pairs from pairs mapped to both haplotypes:" + "In order to be phased, parsed Hi-C pairs need to (a) be parsed without mapq filtering and (b) contain a few additional tags (XA,NM,AS,XS). \n", + "The former modification (a) is needed, because, the default `--min-mapq` value of 1 removes all multiply mapped sequences. This also removes all un-phaseable reads, as they map equally well to both homologs and thus have mapq of 0. Since we would like to keep un-phaseable reads, we need to set --min-mapq 0." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "efc63459-aa2f-44f5-804e-a2346d2b7820", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[E::idx_find_and_load] Could not retrieve index file for 'mapped.XA.bam'\n" + ] + } + ], "source": [ "%%bash\n", - "pairtools parse --add-columns XA,NM,AS,XS --min-mapq 0 --drop-sam --walks-policy all \\\n", + "pairtools parse --min-mapq 0 --add-columns XA,NM,AS,XS --drop-sam --walks-policy all \\\n", " -c GRCm38_EiJ_snpsonly.chromsizes mapped.XA.bam -o unphased.XA.pairs.gz" ] }, + { + "cell_type": "markdown", + "id": "c39c62f0", + "metadata": {}, + "source": [ + "count the number of pairs in the output file" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "e1788b32", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "253813\n" + ] + } + ], + "source": [ + "!zcat unphased.XA.pairs.gz | grep -v ^\\# | wc -l" + ] + }, { "cell_type": "markdown", "id": "c90ff16b-bb5b-4ceb-8fe3-feeae8ada021", @@ -365,12 +730,12 @@ "source": [ "## pairtools phase\n", "\n", - "Phasing will remove the tags \"\\_1\" and \"\\_2\" from chromosome names and add a separate field for the phase:" + "Phasing will remove the tags \"\\_hap1\" and \"\\_hap2\" from chromosome names and add a separate field for the phase:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "id": "6c8deaee-cb68-4b53-b306-bf223523ab45", "metadata": {}, "outputs": [], @@ -391,7 +756,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "id": "6aabbc13-a8d4-43f2-b388-62e7b3b576ab", "metadata": {}, "outputs": [], @@ -410,7 +775,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "id": "9fd3b266-4faa-4fc0-974d-b0ca9bbeb961", "metadata": {}, "outputs": [], @@ -421,6 +786,58 @@ " -o phased.sorted.XA.nodup.pairs.gz phased.sorted.XA.pairs.gz" ] }, + { + "cell_type": "code", + "execution_count": 17, + "id": "0e485084", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total\t232761\n", + "total_unmapped\t29132\n", + "total_single_sided_mapped\t44444\n", + "total_mapped\t159185\n", + "total_dups\t5219\n", + "total_nodups\t153966\n", + "cis\t17712\n", + "trans\t136254\n", + "pair_types/MM\t21281\n", + "pair_types/NM\t7241\n", + "pair_types/NN\t610\n", + "pair_types/NU\t19507\n", + "pair_types/MU\t22736\n", + "pair_types/UU\t153966\n", + "pair_types/UM\t2201\n", + "pair_types/DD\t5219\n", + "cis_1kb+\t9470\n", + "cis_2kb+\t9225\n", + "cis_4kb+\t8839\n", + "cis_10kb+\t8161\n", + "cis_20kb+\t7590\n", + "cis_40kb+\t7162\n", + "summary/frac_cis\t0.11503838509800865\n", + "summary/frac_cis_1kb+\t0.06150708598002156\n", + "summary/frac_cis_2kb+\t0.05991582557187951\n", + "summary/frac_cis_4kb+\t0.05740877856150059\n", + "summary/frac_cis_10kb+\t0.053005208942234\n", + "summary/frac_cis_20kb+\t0.049296597950196794\n", + "summary/frac_cis_40kb+\t0.046516763441279245\n", + "summary/frac_dups\t0.032785752426422086\n", + "summary/complexity_naive\t2374298.3333298806\n", + "chrom_freq/1/1\t4333\n", + "chrom_freq/10/10\t7537\n", + "chrom_freq/1/6\t173\n", + "chrom_freq/1/7\t121\n" + ] + } + ], + "source": [ + "!cat phased.XA.dedup.stats | head -35" + ] + }, { "cell_type": "markdown", "id": "d7ae3575-aef8-4a8b-9707-b37627653ba9", @@ -441,7 +858,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 30, "id": "727a9d2b-5977-4763-81e5-64589c067688", "metadata": {}, "outputs": [], @@ -450,7 +867,7 @@ "pairtools select '(phase1==\"0\") and (phase2==\"0\")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase0.pairs.gz\n", "pairtools select '(phase1==\"1\") and (phase2==\"1\")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase1.pairs.gz\n", "pairtools select '(phase1==\".\") or (phase2==\".\")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.unphased.pairs.gz\n", - "pairtools select '(phase1!=phase2) and (phase1!=\".\") and (phase2!=\".\")' phased.sorted.XA.nodup.pairs.gz \\\n", + "pairtools select '(phase1!=phase2) and (phase1!=\".\") and (phase2!=\".\") and (phase1!=\"!\") and (phase2!=\"!\")' phased.sorted.XA.nodup.pairs.gz \\\n", " -o phased.XA.trans-phase.pairs.gz" ] }, @@ -464,10 +881,26 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 31, "id": "1172f899-41d6-4ca2-ab21-a283340011f8", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/users/anton.goloborodko/src/pairtools/pairtools/cli/stats.py:192: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " for chunk in pd.read_table(body_stream, names=cols, chunksize=100_000):\n", + "\n", + "WARNING:py.warnings:/users/anton.goloborodko/src/pairtools/pairtools/cli/stats.py:192: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " for chunk in pd.read_table(body_stream, names=cols, chunksize=100_000):\n", + "\n", + "WARNING:py.warnings:/users/anton.goloborodko/src/pairtools/pairtools/lib/stats.py:880: RuntimeWarning: divide by zero encountered in double_scalars\n", + " complexity = float(nseq / seq_to_complexity) # clean np.int64 data type\n", + "\n" + ] + } + ], "source": [ "%%bash\n", "pairtools stats phased.XA.phase0.pairs.gz -o phased.XA.phase0.stats\n", @@ -476,6 +909,172 @@ "pairtools stats phased.XA.trans-phase.pairs.gz -o phased.XA.trans-phase.stats" ] }, + { + "cell_type": "markdown", + "id": "71a53db7", + "metadata": {}, + "source": [ + "These stats show that rather few reads end up being phased, and that the vast majority of reads are unphased. \n", + "\n", + "Furthermore, the number of phased reads is comparable between the two haplotypes. The minor mismatch rate is likely to be caused by imperfect annotation of SNVs." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "2567ccea", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total\t574\n", + "total_unmapped\t0\n", + "total_single_sided_mapped\t0\n", + "total_mapped\t574\n", + "total_dups\t13\n", + "total_nodups\t561\n", + "cis\t535\n", + "trans\t26\n", + "pair_types/UU\t561\n", + "pair_types/DD\t13\n", + "cis_1kb+\t68\n", + "cis_2kb+\t67\n", + "cis_4kb+\t60\n", + "cis_10kb+\t56\n", + "cis_20kb+\t48\n", + "cis_40kb+\t42\n", + "summary/frac_cis\t0.9536541889483066\n", + "summary/frac_cis_1kb+\t0.12121212121212122\n", + "summary/frac_cis_2kb+\t0.11942959001782531\n", + "summary/frac_cis_4kb+\t0.10695187165775401\n" + ] + } + ], + "source": [ + "!cat phased.XA.phase0.stats | head -20" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8f521224", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total\t485\n", + "total_unmapped\t0\n", + "total_single_sided_mapped\t0\n", + "total_mapped\t485\n", + "total_dups\t9\n", + "total_nodups\t476\n", + "cis\t394\n", + "trans\t82\n", + "pair_types/UU\t476\n", + "pair_types/DD\t9\n", + "cis_1kb+\t53\n", + "cis_2kb+\t51\n", + "cis_4kb+\t50\n", + "cis_10kb+\t44\n", + "cis_20kb+\t41\n", + "cis_40kb+\t38\n", + "summary/frac_cis\t0.8277310924369747\n", + "summary/frac_cis_1kb+\t0.11134453781512606\n", + "summary/frac_cis_2kb+\t0.10714285714285714\n", + "summary/frac_cis_4kb+\t0.10504201680672269\n" + ] + } + ], + "source": [ + "!cat phased.XA.phase1.stats | head -20" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "3bb8d589", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total\t241\n", + "total_unmapped\t0\n", + "total_single_sided_mapped\t0\n", + "total_mapped\t241\n", + "total_dups\t0\n", + "total_nodups\t241\n", + "cis\t177\n", + "trans\t64\n", + "pair_types/UU\t241\n", + "cis_1kb+\t85\n", + "cis_2kb+\t83\n", + "cis_4kb+\t81\n", + "cis_10kb+\t75\n", + "cis_20kb+\t70\n", + "cis_40kb+\t64\n", + "summary/frac_cis\t0.7344398340248963\n", + "summary/frac_cis_1kb+\t0.35269709543568467\n", + "summary/frac_cis_2kb+\t0.34439834024896265\n", + "summary/frac_cis_4kb+\t0.3360995850622407\n", + "summary/frac_cis_10kb+\t0.3112033195020747\n" + ] + } + ], + "source": [ + "!cat phased.XA.trans-phase.stats | head -20" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "29e85a23", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total\t200077\n", + "total_unmapped\t0\n", + "total_single_sided_mapped\t42192\n", + "total_mapped\t157885\n", + "total_dups\t5197\n", + "total_nodups\t152688\n", + "cis\t52294\n", + "trans\t100394\n", + "pair_types/UU\t152688\n", + "pair_types/MU\t21277\n", + "pair_types/NU\t18827\n", + "pair_types/UM\t2088\n", + "pair_types/DD\t5197\n", + "cis_1kb+\t29483\n", + "cis_2kb+\t28885\n", + "cis_4kb+\t27905\n", + "cis_10kb+\t26158\n", + "cis_20kb+\t24601\n", + "cis_40kb+\t23101\n", + "summary/frac_cis\t0.3424892591428272\n" + ] + } + ], + "source": [ + "!cat phased.XA.unphased.stats | head -20" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2369ced7", + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "id": "25fdebb4-24ca-4280-950e-baa9cc92d28e", @@ -497,7 +1096,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "ed403d73-7b5f-432b-9e91-e8c70906d31b", "metadata": {}, "outputs": [ @@ -541,9 +1140,9 @@ ], "metadata": { "kernelspec": { - "display_name": "test", + "display_name": "main", "language": "python", - "name": "test" + "name": "main" }, "language_info": { "codemirror_mode": { @@ -555,7 +1154,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.8" + "version": "3.10.4" } }, "nbformat": 4,