Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include pairtools protocol #162

Open
4 of 5 tasks
nservant opened this issue May 5, 2023 · 4 comments
Open
4 of 5 tasks

Include pairtools protocol #162

nservant opened this issue May 5, 2023 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@nservant
Copy link
Collaborator

nservant commented May 5, 2023

Description of feature

Add a new paramater --processing hicpro or --processing pairtools

propose an alternative to HiC-Pro with the analysis protocol proposed by 'dovotail' and based on bwa-mem2 and pairtools with the following steps :

  • Mapping with bwa-mem (version 2 instead of 1)
  • pairtools parse - get valid ligation product / parameters --min_mapq
  • pairtools sort - sorting pairs
  • paritools merge - merge pairs
  • pairtools dedup - remove PCR dups / parameters --keep_dups
  • pairtools split - generate pairs files (and final bam if useful ?)
  • pairtools select - filter pairs
  • pairtools stats - generate final stats

To validate ;

  • Remove multi-hits - parameter --keep_multi ?
  • Compatibility of pairtools with --digestion, --restriction_site, --ligation_site, --chromosome_size, --restriction_fragments parameters ?
  • Filtering on fragment size with pairtools ? options --max_insert_size --min_insert_size --max_restriction_fragment_size --min_fragment_size
  • Mode --dnase with pairtools ?
  • Filtering based on the distance (--min_cis_dist) for --dnase mode with --pairtools

Other ideas:

  • Replace --dnase option by --no_digestion for DNAseq, microC, etc.
@nservant nservant added the enhancement New feature or request label May 5, 2023
@nservant nservant changed the title Include Dovotail protocol Include pairtools protocol May 5, 2023
nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023
@nservant
Copy link
Collaborator Author

First test version available with --processing pairtools

@nservant
Copy link
Collaborator Author

The options --keep_multi / --min_mapq / --min_cis_dist / --save_interaction_bam have been managed

@nservant nservant pinned this issue Jun 15, 2023
nservant added a commit to nservant/nf-core-hic that referenced this issue Jun 15, 2023
@nservant
Copy link
Collaborator Author

nservant commented Jun 21, 2023

First version tested.

To further validate before release ... still need to find a way to use the options :

  • --min_insert_size / --max_insertisize
  • --min_restriction_fragment_size / --max_restriction_fragment_size
       ext.args = { [
            "(mapq1>${params.min_mapq} and mapq2>${params.min_mapq})",
            params.min_cis_dist > 0 ? " and (abs(pos1-pos2) < ${params.min_cis_dist})" : '',
            params.keep_multi ? " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU') or (pair_type=='MM') or (pair_type=='MU'))" : 
                                " and ((pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU'))",
            params.dnase ? '' : " and (abs(int(rfrag1) - int(rfrag2)) > 1)",
            //params.min_insert_size > 0 ?  " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) > ${params.min_insert_size}" : '',
            //params.max_insert_size > 0 ? " and ( (rfrag_end1 - r1pos) + (rfrag_end2 - r2pos)) < ${params.max_insert_size}" : '',
            //params.min_restriction_fragment_size > 0 ? " -t ${params.min_restriction_fragment_size}" : '',
            //params.max_restriction_fragment_size > 0 ? " -m ${params.max_restriction_fragment_size}" : '',
        ].join(' ').trim() }

@nservant nservant added this to the version-2.2.0 milestone Jan 26, 2024
@jeremymsimon
Copy link

jeremymsimon commented Jan 16, 2025

Hi @nservant
I have been using this version as a means of processing HiChIP data, with the eventual goal of running FitHiChIP on the outputs here, consistent with what is described in this vignette.

The pipeline runs fine for me with the following executable:

nextflow run nf-core-hic \
   -r dev \
   -c /jsimonlab/pipelines/nfcore/nextflow.config \
   --input HiChIP_nextflow_samplesheet.csv \
   --processing pairtools \
   --save_pairs_intermediates \
   --fasta GRCh38.primary_assembly.genome.fa \
   --bwa_index bwa_v0717/hg38 \
   --no_digestion \
   --min_cis_dist 1000 \
   --bin_size '5000,10000,20000,50000,100000,500000,1000000' \
   --outdir HiChIP_nextflow_hic_bwamem_pairtools

and in my pairtools directory, I have

*split.pairs.gz
*unselected.pairs.gz
*selected.pairs.gz.px2
*selected.pairs.gz
stats/

I thought that using the *selected.pairs.gz file as input to FitHiChIP would then work equivalently as the allValidPairs output from HiC-Pro, but I'm getting an error this way; I think it could be because these two files are formatted very slightly differently:

$ head /path/to/output.allValidPairs
GWNJ-0842:984:GW2205304471st:4:1216:15483:48459	GL000008.2	998	+	GL000008.2	1000	-	NA	NA	NA	42	23	
GWNJ-0842:984:GW2205304471st:4:1202:27590:7023	GL000008.2	1561	-	GL000008.2	3229	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:1218:16782:47123	GL000008.2	1819	+	GL000008.2	2246	-	NA	NA	NA	25	42	
GWNJ-0842:984:GW2205304471st:4:1216:27874:61169	GL000008.2	2089	-	GL000008.2	2103	+	NA	NA	NA	40	42	
GWNJ-0842:984:GW2205304471st:4:2212:12357:64843	GL000008.2	2315	-	GL000008.2	2499	-	NA	NA	NA	23	40	
GWNJ-0842:984:GW2205304471st:4:1121:3021:68535	GL000008.2	2382	-	GL000008.2	2759	-	NA	NA	NA	40	23	
GWNJ-0842:984:GW2205304471st:4:2219:15179:69203	GL000008.2	2440	-	GL000008.2	2838	-	NA	NA	NA	42	42	
GWNJ-0842:984:GW2205304471st:4:1204:7415:51711	GL000008.2	2466	+	KI270722.1	180699	+	NA	NA	NA	24	26	
GWNJ-0842:984:GW2205304471st:4:2118:24688:47843	GL000008.2	2491	+	GL000008.2	3247	-	NA	NA	NA	23	42	
GWNJ-0842:984:GW2205304471st:4:2120:14519:44925	GL000008.2	2518	-	GL000008.2	2720	-	NA	NA	NA	42	24
$ zcat pairtools/DTG-HiChIP-pooled.selected.pairs.gz | grep -v '#' | head
GWNJ-0842:984:GW2205304471st:4:1119:13717:67937	GL000008.2	24	GL000008.2	1209	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:11231:42464	GL000008.2	139	GL000008.2	1787	-	+	uU	55	22
GWNJ-0842:984:GW2205304471st:4:1116:24089:7989	GL000008.2	447	GL000008.2	3216	+	-	uU	17	37
GWNJ-0842:984:GW2205304471st:4:1224:27255:72121	GL000008.2	567	GL000008.2	1754	+	-	UU	60	60
GWNJ-0842:984:GW2205304471st:4:2205:16782:23460	GL000008.2	699	GL000008.2	2404	+	-	UU	31	60
GWNJ-0842:984:GW2205304471st:4:2219:13565:61239	GL000008.2	841	GL000008.2	2937	-	-	Uu	42	25
GWNJ-0842:984:GW2205304471st:4:2208:12997:16340	GL000008.2	1030	GL000008.2	5498	-	+	Uu	36	29
GWNJ-0842:984:GW2205304471st:4:1222:3123:30668	GL000008.2	1045	GL000008.2	111679	-	-	UU	60	22
GWNJ-0842:984:GW2205304471st:4:2218:27762:32461	GL000008.2	1091	GL000008.2	3762	+	+	uU	15	60
GWNJ-0842:984:GW2205304471st:4:1121:8643:35168	GL000008.2	1182	GL000008.2	3302	-	+	UU	60	38

Note the '#' header lines, but aside from those, the number of columns is different (12 vs 10) and the column with strand identifiers isn't the same

Is there a more appropriate file that the pipeline here creates that is equivalent to the allValidPairs format? Or do you have other suggestions for how to take the existing output here and run FitHiChIP?

Another option is to use the contact_maps/cool/*.cool files created here if the others are not compatible, which does seem to work if need be

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants