Include pairtools protocol #162

nservant · 2023-05-05T16:27:44Z

Description of feature

Add a new paramater --processing hicpro or --processing pairtools

propose an alternative to HiC-Pro with the analysis protocol proposed by 'dovotail' and based on bwa-mem2 and pairtools with the following steps :

Mapping with bwa-mem (version 2 instead of 1)
pairtools parse - get valid ligation product / parameters --min_mapq
pairtools sort - sorting pairs
paritools merge - merge pairs
pairtools dedup - remove PCR dups / parameters --keep_dups
pairtools split - generate pairs files (and final bam if useful ?)
pairtools select - filter pairs
pairtools stats - generate final stats

To validate ;

Remove multi-hits - parameter --keep_multi ?
Compatibility of pairtools with --digestion, --restriction_site, --ligation_site, --chromosome_size, --restriction_fragments parameters ?
Filtering on fragment size with pairtools ? options --max_insert_size --min_insert_size --max_restriction_fragment_size --min_fragment_size
Mode --dnase with pairtools ?
Filtering based on the distance (--min_cis_dist) for --dnase mode with --pairtools

Other ideas:

Replace --dnase option by --no_digestion for DNAseq, microC, etc.

The text was updated successfully, but these errors were encountered:

nservant · 2023-06-15T17:22:35Z

First test version available with --processing pairtools

nservant · 2023-06-15T17:25:19Z

The options --keep_multi / --min_mapq / --min_cis_dist / --save_interaction_bam have been managed

nservant · 2023-06-21T13:32:01Z

First version tested.

To further validate before release ... still need to find a way to use the options :

--min_insert_size / --max_insertisize
--min_restriction_fragment_size / --max_restriction_fragment_size

       ext.args = { [
            "(mapq1>${params.min_mapq} and mapq2>${params.min_mapq})",
            params.min_cis_dist > 0 ? " and (abs(pos1-pos2) < ${params.min_cis_dist})" : '',
            params.keep_multi ? " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU') or (pair_type=='MM') or (pair_type=='MU'))" : 
                                " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU'))",
            params.dnase ? '' : " and (abs(int(rfrag1) - int(rfrag2)) > 1)",
            //params.min_insert_size > 0 ?  " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) > ${params.min_insert_size}" : '',
            //params.max_insert_size > 0 ? " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) < ${params.max_insert_size}" : '',
            //params.min_restriction_fragment_size > 0 ? " -t ${params.min_restriction_fragment_size}" : '',
            //params.max_restriction_fragment_size > 0 ? " -m ${params.max_restriction_fragment_size}" : '',
        ].join(' ').trim() }

jeremymsimon · 2025-01-16T21:12:47Z

Hi @nservant
I have been using this version as a means of processing HiChIP data, with the eventual goal of running FitHiChIP on the outputs here, consistent with what is described in this vignette.

The pipeline runs fine for me with the following executable:

nextflow run nf-core-hic \
   -r dev \
   -c /jsimonlab/pipelines/nfcore/nextflow.config \
   --input HiChIP_nextflow_samplesheet.csv \
   --processing pairtools \
   --save_pairs_intermediates \
   --fasta GRCh38.primary_assembly.genome.fa \
   --bwa_index bwa_v0717/hg38 \
   --no_digestion \
   --min_cis_dist 1000 \
   --bin_size '5000,10000,20000,50000,100000,500000,1000000' \
   --outdir HiChIP_nextflow_hic_bwamem_pairtools

and in my pairtools directory, I have

*split.pairs.gz
*unselected.pairs.gz
*selected.pairs.gz.px2
*selected.pairs.gz
stats/

I thought that using the *selected.pairs.gz file as input to FitHiChIP would then work equivalently as the allValidPairs output from HiC-Pro, but I'm getting an error this way; I think it could be because these two files are formatted very slightly differently:

$ head /path/to/output.allValidPairs
GWNJ-0842:984:GW2205304471st:4:1216:15483:48459	GL000008.2	998	+	GL000008.2	1000	-	NA	NA	NA	42	23	
GWNJ-0842:984:GW2205304471st:4:1202:27590:7023	GL000008.2	1561	-	GL000008.2	3229	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:1218:16782:47123	GL000008.2	1819	+	GL000008.2	2246	-	NA	NA	NA	25	42	
GWNJ-0842:984:GW2205304471st:4:1216:27874:61169	GL000008.2	2089	-	GL000008.2	2103	+	NA	NA	NA	40	42	
GWNJ-0842:984:GW2205304471st:4:2212:12357:64843	GL000008.2	2315	-	GL000008.2	2499	-	NA	NA	NA	23	40	
GWNJ-0842:984:GW2205304471st:4:1121:3021:68535	GL000008.2	2382	-	GL000008.2	2759	-	NA	NA	NA	40	23	
GWNJ-0842:984:GW2205304471st:4:2219:15179:69203	GL000008.2	2440	-	GL000008.2	2838	-	NA	NA	NA	42	42	
GWNJ-0842:984:GW2205304471st:4:1204:7415:51711	GL000008.2	2466	+	KI270722.1	180699	+	NA	NA	NA	24	26	
GWNJ-0842:984:GW2205304471st:4:2118:24688:47843	GL000008.2	2491	+	GL000008.2	3247	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:2120:14519:44925	GL000008.2	2518	-	GL000008.2	2720	-	NA	NA	NA	42	24

$ zcat pairtools/DTG-HiChIP-pooled.selected.pairs.gz | grep -v '#' | head
GWNJ-0842:984:GW2205304471st:4:1119:13717:67937	GL000008.2	24	GL000008.2	1209	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:11231:42464	GL000008.2	139	GL000008.2	1787	-	+	uU	55	22
GWNJ-0842:984:GW2205304471st:4:1116:24089:7989	GL000008.2	447	GL000008.2	3216	+	-	uU	17	37
GWNJ-0842:984:GW2205304471st:4:1224:27255:72121	GL000008.2	567	GL000008.2	1754	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:16782:23460	GL000008.2	699	GL000008.2	2404	+	-	UU	31	60
GWNJ-0842:984:GW2205304471st:4:2219:13565:61239	GL000008.2	841	GL000008.2	2937	-	-	Uu	42	25
GWNJ-0842:984:GW2205304471st:4:2208:12997:16340	GL000008.2	1030	GL000008.2	5498	-	+	Uu	36	29
GWNJ-0842:984:GW2205304471st:4:1222:3123:30668	GL000008.2	1045	GL000008.2	111679	-	-	UU	60	22
GWNJ-0842:984:GW2205304471st:4:2218:27762:32461	GL000008.2	1091	GL000008.2	3762	+	+	uU	15	60
GWNJ-0842:984:GW2205304471st:4:1121:8643:35168	GL000008.2	1182	GL000008.2	3302	-	+	UU	60	38

Note the '#' header lines, but aside from those, the number of columns is different (12 vs 10) and the column with strand identifiers isn't the same

Is there a more appropriate file that the pipeline here creates that is equivalent to the allValidPairs format? Or do you have other suggestions for how to take the existing output here and run FitHiChIP?

Another option is to use the contact_maps/cool/*.cool files created here if the others are not compatible, which does seem to work if need be

Thanks!

nservant added the enhancement New feature or request label May 5, 2023

nservant changed the title ~~Include Dovotail protocol~~ Include pairtools protocol May 5, 2023

nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023

[MODIF] add pairtools nf-core#162

8d40b36

nservant pinned this issue Jun 15, 2023

nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023

[MODIF] add pairtools nf-core#162

a4a546e

nservant added this to the version-2.2.0 milestone Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include pairtools protocol #162

Include pairtools protocol #162

nservant commented May 5, 2023 •

edited

Loading

nservant commented Jun 15, 2023

nservant commented Jun 15, 2023

nservant commented Jun 21, 2023 •

edited

Loading

jeremymsimon commented Jan 16, 2025 •

edited

Loading

Include pairtools protocol #162

Include pairtools protocol #162

Comments

nservant commented May 5, 2023 • edited Loading

Description of feature

nservant commented Jun 15, 2023

nservant commented Jun 15, 2023

nservant commented Jun 21, 2023 • edited Loading

jeremymsimon commented Jan 16, 2025 • edited Loading

nservant commented May 5, 2023 •

edited

Loading

nservant commented Jun 21, 2023 •

edited

Loading

jeremymsimon commented Jan 16, 2025 •

edited

Loading