-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diploid, haplotype genome, output result #98
Comments
Hi @jinhua2024,
Yes, the Hi-C data can only be used for chromosome-level phasing, whereas the trio mode can phase the contigs into paternal and maternal sequences at the whole-genome level. Therefore, the results of your k-mer analysis are reasonable, and your choice to use trio mode for hifiasm is also appropriate in this case. A similar comparison can be found in hifiasm's paper:
HapHiC does not infer haplotypes right now, as the process could be problematic and misleading. Current Hi-C scaffolders cannot guarantee perfect chromosome assignment results. Some output scaffolds (groups) may originate from the same chromosome, while a single scaffold may contain sequences from different chromosomes. Therefore, it is essential to verify the results in Juicebox first.
In your case, since you have provided the GFA files for both haplotypes, the contigs from different parents should not be clustered together. Consequently, you can differentiate each scaffold as paternal or maternal based on the contig IDs within them.
This is an alternative approach. We often adopt it when the Hi-C data after filtering are insufficient for scaffolding due to low heterozygosity. This strategy can preserve many more Hi-C links when filtering alignments using the criterion MAPQ ≥ 1. However, aligning Hi-C links originating from two haplotypes to a single haplotype may lead to some problems if there are some structural variations between the haplotypes. I hope my answer can resolve your confusion. Best regards, |
非常感谢曾老师!!感谢您的指导!!
我想继续询问一下曾老师,我是可以根据 contig IDs将04.build/scaffolds.fa分开为两个单倍型基因组吗? 对于04.build/scaffolds.fa划分为两个单倍型基因组,曾老师有什么工具建议进行划分吗?haphic有没有相关的命令? 我输出的结果文件04.build/scaffolds.agp |
Yes.
No. The AGP file is very easy to understand. For example, all the contigs in group1 start with 'h2', so this group should be assigned to haplotype 2. Similarly, group2 and group3 should be assigned to haplotype 1, as the contigs in these two groups start with 'h1'. |
非常感谢曾老师!我还是有点疑惑,继续询问一下曾老师!目前我还没有根据contig ID 将04.build/scaffolds.fa区分为hap1.fa和hap2.fa。 2、运行haphic 下游分析可视化 我运行juicebox.sh之后,再运行haphic plot out_JBAT.liftover.agp contact_matrix.pkl --bin_size 1000 --min_len 1 感觉juicebox.sh前后的结果差异,好像之前更好看一些?您能告诉我该怎么调节吗? 感谢曾老师的指导! |
I have explained this in my previous answer. The issue is not with hifiasm; it simply arises because Hi-C data cannot be used to differentiate between paternal and maternal sequences. Therefore, if you have trio data, using it for phasing haplotypes would be a better choice, as trio data can achieve whole-genome phasing (and also a lower switch error rate).
If GFA files are not provided, HapHiC will perform chromosome-level phasing through chromosome clustering. However, when GFA files are provided and
Yes. There is no reason or necessity for HapHiC to modify the sequence IDs output by hifiasm.
You used an incorrect AGP file for the HapHiC plot (
# Generate the final FASTA file for the scaffolds
$ /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa
Yes, but you might also detect and correct some phasing errors made by hifiasm during manual curation in Juicebox, which means that there could be groups containing contigs that start with both |
If you have changed your AGP file (from
$ grep "^group" scaffolds.agp.txt | awk '$5=="W"{print $1"\t"substr($6, 0, 2)"\t"$3-$2+1}' | awk '{sum[$1"\t"$2]+=$3}END{for(h in sum){print h"\t"sum[h]}}' | sort -k1.6,1n Output:
I don't think there are many differences between these two strategies. It is important to ensure that both haplotypes are drawn within the same contact map, as this approach can help identify potential phasing errors between the haplotypes. |
谢谢曾老师!非常感谢!我是刚刚自学的,会遇到一些问题。 |
These empty signals are often caused by the multiple mapping of Hi-C reads to repetitive sequences or very similar regions between haplotypes. Hi-C reads that map to multiple locations are typically filtered out using a MAPQ>=1 criterion. For an understanding of multiple mapping, please refer to: https://samtools.github.io/hts-specs/SAMv1.pdf. This issue arises from intrinsic genomic characteristics and limitations of sequencing technologies and should not be addressed by altering contigs, such as breaking them or making any other modifications. Stepping back, even if it is necessary to break contigs (to correct chimeric contigs), the |
谢谢曾老师!十分感谢您!我已经使用了--correct_nrounds 2的参数,我将按照您的建议,不对这些空白的地方做出改变。 下面这个命令输出了需要的contigID 清单?就是可以输出需要的 --corrected_ctgs corrected_ctgs.txt文件? 但是您提到If assembly correction has been performed, use corrected_asm.fa as input FASTA file instead of the first asm.fa. Additionally, specify the corrected contig list corrected_ctgs.txt using the --corrected_ctgs parameter. Otherwise, the YaHS-style scaffolds.raw.agp generated may be incorrect. 谢谢曾老师! |
Your thoughts are all in a mess, so let's start from the beginning. You simply want to get two FASTA files and two contact maps corresponding to the two haplotypes, right? (1) To check whether all the contigs in each group belong to the same haplotype: $ grep "^group" scaffolds.agp.txt | awk '$5=="W"{print $1"\t"substr($6, 0, 2)"\t"$3-$2+1}' | awk '{sum[$1"\t"$2]+=$3}END{for(h in sum){print h"\t"sum[h]}}' | sort -k1.6,1n > stat.txt The output should look like this: $ cat stat.txt
group1 h2 227146140
group2 h1 221162681
group3 h1 214953513
group4 h2 207079995
group5 h1 175217546
... If each group appears only once, this indicates that each group is assigned exclusively to either h1 or h2. (2) Find haplotype-specific groups for each haplotype (e.g., for h1): $ grep h1 stat.txt | cut -f 1 | xargs echo | sed 's/ /,/g' Output:
(3) Extract haplotype-specific sequences for each haplotype (e.g., for h1) using tools such as seqkit: $ seqkit grep scaffolds.fa -p "group2,group3,group5,..." > hap1.fa (4) Generate a contact map for each haplotype (e.g., for h1) using $ haphic plot scaffolds.agp.txt HiC.filtered.bam --specified_scaffolds "group2,group3,group5,..." |
谢谢曾老师!感谢您的指导! group1 h2 227146140 |
Haplotype 1:
Haplotype 2:
|
曾老师,您好!非常高兴您能开发出haphic。我是一名刚刚入门正在学习组装基因组和泛基因组构建的学生,使用haphic遇到疑惑,想咨询一下您。
一我的信息
1.物种:猪,染色体2n=38(包括XY性染色体)
2.运行haphic参数
/HapHiC/haphic pipeline FM2_allhap.fa HiC_fm2.filtered.bam 38 --gfa "/FM2_l2/FM2_l2.asm.dip.hap1.p_ctg.gfa,/FM2_l2/FM2_l2.asm.dip.hap2.p_ctg.gfa" --threads 36 --processes 36 --correct_nrounds 2 --RE "AAGCTT"
3.用于 Hi-C 读取映射和筛选的方法
bwa mem -5SP -t 36 FM2_allhap.fa FM2_1_clean.fq.gz FM2_2_clean.fq.gz | samblaster | samtools view - -@ 28 -S -h -b -F 3340 -o HiC_fm2.bam
4.用于基因组组装的方法
我的是一个家系包含子代和父母本。子代使用hifiasm trio模式,父母本则用hifiasm hic模式。hap1.p_ctg和hap2.p_ctg使用cat合并后FM2_allhap.fa,作为haphic和bwa的输入文件
5.过滤hic.bam的命令
HapHiC/utils/filter_bam HiC_fm2.bam 1 --nm 3 --threads 28 | samtools view - -b -@ 28 -o HiC_fm2.filtered.bam
二我的目的
hifiasm组装基因组,挂载染色体之后,构建家系个体的泛基因组,找结构变异SV。
hifiasm已经为每个个体组装出两个单倍型基因组,我希望每个个体挂载出两个染色体水平单倍型基因组。
三我的输出结果
0.4 build
HapHiC_build.log juicebox.sh scaffolds.agp scaffolds.fa scaffolds.raw.agp
四我的问题
hifiasm的hic模式的结果单倍型存在交错。运行haphic前我合并了hap1.fa和hap2.fa,运行haphic的结果仅输出一个基因组scaffolds.fa。
如果要获得两个单倍型基因组,输出结果0.4 build/scaffolds.fa能否分开呢?
如果不能,是不是只能分开运行haphic,分别对hap1和hap2单倍型进行挂载呢?
谢谢曾老师!希望您方便时解答一下!
The text was updated successfully, but these errors were encountered: