Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diploid, haplotype genome, output result #98

Open
jinhua2024 opened this issue Jan 7, 2025 · 13 comments
Open

diploid, haplotype genome, output result #98

jinhua2024 opened this issue Jan 7, 2025 · 13 comments

Comments

@jinhua2024
Copy link

jinhua2024 commented Jan 7, 2025

曾老师,您好!非常高兴您能开发出haphic。我是一名刚刚入门正在学习组装基因组和泛基因组构建的学生,使用haphic遇到疑惑,想咨询一下您。
一我的信息
1.物种:猪,染色体2n=38(包括XY性染色体)
2.运行haphic参数
/HapHiC/haphic pipeline FM2_allhap.fa HiC_fm2.filtered.bam 38 --gfa "/FM2_l2/FM2_l2.asm.dip.hap1.p_ctg.gfa,/FM2_l2/FM2_l2.asm.dip.hap2.p_ctg.gfa" --threads 36 --processes 36 --correct_nrounds 2 --RE "AAGCTT"
3.用于 Hi-C 读取映射和筛选的方法
bwa mem -5SP -t 36 FM2_allhap.fa FM2_1_clean.fq.gz FM2_2_clean.fq.gz | samblaster | samtools view - -@ 28 -S -h -b -F 3340 -o HiC_fm2.bam
4.用于基因组组装的方法
我的是一个家系包含子代和父母本。子代使用hifiasm trio模式,父母本则用hifiasm hic模式。hap1.p_ctg和hap2.p_ctg使用cat合并后FM2_allhap.fa,作为haphic和bwa的输入文件
5.过滤hic.bam的命令
HapHiC/utils/filter_bam HiC_fm2.bam 1 --nm 3 --threads 28 | samtools view - -b -@ 28 -o HiC_fm2.filtered.bam

二我的目的
hifiasm组装基因组,挂载染色体之后,构建家系个体的泛基因组,找结构变异SV。
hifiasm已经为每个个体组装出两个单倍型基因组,我希望每个个体挂载出两个染色体水平单倍型基因组。

三我的输出结果
0.4 build
HapHiC_build.log juicebox.sh scaffolds.agp scaffolds.fa scaffolds.raw.agp

四我的问题
hifiasm的hic模式的结果单倍型存在交错。运行haphic前我合并了hap1.fa和hap2.fa,运行haphic的结果仅输出一个基因组scaffolds.fa。
如果要获得两个单倍型基因组,输出结果0.4 build/scaffolds.fa能否分开呢?
如果不能,是不是只能分开运行haphic,分别对hap1和hap2单倍型进行挂载呢?

谢谢曾老师!希望您方便时解答一下!

@jinhua2024 jinhua2024 changed the title hifiasm trio or hic, diploid, haplotype genome, output result diploid, haplotype genome, output result Jan 8, 2025
@zengxiaofei
Copy link
Owner

Hi @jinhua2024,

hifiasm的hic模式的结果单倍型存在交错

Yes, the Hi-C data can only be used for chromosome-level phasing, whereas the trio mode can phase the contigs into paternal and maternal sequences at the whole-genome level. Therefore, the results of your k-mer analysis are reasonable, and your choice to use trio mode for hifiasm is also appropriate in this case. A similar comparison can be found in hifiasm's paper:

image

运行haphic前我合并了hap1.fa和hap2.fa,运行haphic的结果仅输出一个基因组scaffolds.fa

HapHiC does not infer haplotypes right now, as the process could be problematic and misleading. Current Hi-C scaffolders cannot guarantee perfect chromosome assignment results. Some output scaffolds (groups) may originate from the same chromosome, while a single scaffold may contain sequences from different chromosomes. Therefore, it is essential to verify the results in Juicebox first.

如果要获得两个单倍型基因组,输出结果0.4 build/scaffolds.fa能否分开呢?

In your case, since you have provided the GFA files for both haplotypes, the contigs from different parents should not be clustered together. Consequently, you can differentiate each scaffold as paternal or maternal based on the contig IDs within them.

如果不能,是不是只能分开运行haphic,分别对hap1和hap2单倍型进行挂载呢?

This is an alternative approach. We often adopt it when the Hi-C data after filtering are insufficient for scaffolding due to low heterozygosity. This strategy can preserve many more Hi-C links when filtering alignments using the criterion MAPQ ≥ 1. However, aligning Hi-C links originating from two haplotypes to a single haplotype may lead to some problems if there are some structural variations between the haplotypes.

I hope my answer can resolve your confusion.

Best regards,
Xiaofei

@jinhua2024
Copy link
Author

非常感谢曾老师!!感谢您的指导!!

In your case, since you have provided the GFA files for both haplotypes, the contigs from different parents should not be clustered together. Consequently, you can differentiate each scaffold as paternal or maternal based on the contig IDs within them.

我想继续询问一下曾老师,我是可以根据 contig IDs将04.build/scaffolds.fa分开为两个单倍型基因组吗?
这是我的04.build/scaffolds.agp部分信息,是根据这个文件进行划分吗?
image

对于04.build/scaffolds.fa划分为两个单倍型基因组,曾老师有什么工具建议进行划分吗?haphic有没有相关的命令?

我输出的结果文件04.build/scaffolds.agp
scaffolds.agp.txt

@zengxiaofei
Copy link
Owner

这是我的04.build/scaffolds.agp部分信息,是根据这个文件进行划分吗?

Yes.

对于04.build/scaffolds.fa划分为两个单倍型基因组,曾老师有什么工具建议进行划分吗?haphic有没有相关的命令?

No. The AGP file is very easy to understand. For example, all the contigs in group1 start with 'h2', so this group should be assigned to haplotype 2. Similarly, group2 and group3 should be assigned to haplotype 1, as the contigs in these two groups start with 'h1'.

@jinhua2024
Copy link
Author

jinhua2024 commented Jan 11, 2025

非常感谢曾老师!我还是有点疑惑,继续询问一下曾老师!目前我还没有根据contig ID 将04.build/scaffolds.fa区分为hap1.fa和hap2.fa。
1、hifiasm的hic模式并没有很好区分来自父本或母本的单倍型。但是haphic不推断单倍型,那么原来的单倍型来自父母本的片段还是会在同一个单倍型基因组里?04.build/scaffolds.fa中的的contigID和我输入的FM2_allhap.fa的ID是一致的?

2、运行haphic 下游分析可视化
代码::haphic plot scaffolds.raw.agp HiC_fm2.filtered.bam --bin_size 1000 --min_len 1
contact_map_raw.pdf

我运行juicebox.sh之后,再运行haphic plot out_JBAT.liftover.agp contact_matrix.pkl --bin_size 1000 --min_len 1
contact_map.pdf

感觉juicebox.sh前后的结果差异,好像之前更好看一些?您能告诉我该怎么调节吗?
3、我这里有点疑惑,是juicebox手动校正之后,再根据contig ID 将最终的scaffolds.fa区分为hap1.fa和hap2.fa吗?

感谢曾老师的指导!

@zengxiaofei
Copy link
Owner

hifiasm的hic模式并没有很好区分来自父本或母本的单倍型。

I have explained this in my previous answer. The issue is not with hifiasm; it simply arises because Hi-C data cannot be used to differentiate between paternal and maternal sequences. Therefore, if you have trio data, using it for phasing haplotypes would be a better choice, as trio data can achieve whole-genome phasing (and also a lower switch error rate).

但是haphic不推断单倍型,那么原来的单倍型来自父母本的片段还是会在同一个单倍型基因组里?

If GFA files are not provided, HapHiC will perform chromosome-level phasing through chromosome clustering. However, when GFA files are provided and --phasing_weight is set to 1 (the default value), HapHiC will fully trust the phasing results from hifiasm, whether they represent chromosome-level or whole-genome level phasing. This means that contigs starting with h1tg should not be clustered with those starting with h2tg; each group (chromosome) should contain only contigs from a single parent phased by hifiasm.

04.build/scaffolds.fa中的的contigID和我输入的FM2_allhap.fa的ID是一致的?

Yes. There is no reason or necessity for HapHiC to modify the sequence IDs output by hifiasm.

感觉juicebox.sh前后的结果差异,好像之前更好看一些?您能告诉我该怎么调节吗?

You used an incorrect AGP file for the HapHiC plot (out_JBAT.liftover.agp). You should use the final AGP file output by juicer post (*.final.agp). Please refer to our documentation for more information:

After manual curation in Juicebox, you will obtain the modified assembly file out_JBAT.review.assembly . To generate the final FASTA file for the scaffolds:

# Generate the final FASTA file for the scaffolds
$ /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa

我这里有点疑惑,是juicebox手动校正之后,再根据contig ID 将最终的scaffolds.fa区分为hap1.fa和hap2.fa吗?

Yes, but you might also detect and correct some phasing errors made by hifiasm during manual curation in Juicebox, which means that there could be groups containing contigs that start with both h1tg and h2tg.

@jinhua2024
Copy link
Author

jinhua2024 commented Jan 12, 2025

谢谢曾老师,谢谢您的耐心指导,使我对haphic进一步认识和使用它。我想分享我的使用的命令于此处,希望能帮到像我一样什么都不懂的同学。同时,也想请曾老师看看我的命令是否合理呢。

1,hifiasm trio模式
hifiasm -l1 -o FM2_l2.asm -t 30 -1 pat.yak -2 mat.yak FM2_hifi.fq.gz --trio-dual
#猪基因组2.5g,当我使用-l0参数时,生成的hap1和hap2大约为3.5g。我使用-l1是,输出结果大约2.4g。

2,hifiasm hic模式
hifiasm -o Y72207_hic.asm -t 40 --h1 Y72207_L01_1_clean.fq.gz --h2 Y72207_L01_2_clean.fq.gz /Y72207_hifi.fq.gz
3,gfa2fa
awk '/^S/{print ">"$2;print $3}' hap1.gfa > hap1.fa
awk '/^S/{print ">"$2;print $3}' hap2.gfa > hap2.fa
4,合并hap1和hap2
cat hap1.fa hap2.fa > FM2_allhap.fa
5, align Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
bwa index FM2_allhap.fa
bwa mem -5SP -t 36 FM2_allhap.fa FM2_1_clean.fq.gz FM2_2_clean.fq.gz | samblaster | samtools view - -@ 28 -S -h -b -F 3340 -o HiC_fm2.bam
6,过滤bam
HapHiC/utils/filter_bam HiC_fm2.bam 1 --nm 3 --threads 28 | samtools view - -b -@ 28 -o HiC_fm2.filtered.bam
7,One-line command运行haphic
HapHiC/haphic pipeline FM2_allhap.fa HiC_fm2.filtered.bam 38 --gfa "FM2_hap1.p_ctg.gfa,FM2_hap2.p_ctg.gfa" --threads 36 --processes 36 --correct_nrounds 2 --RE "AAGCTT"
38为染色体数目
#--RE为HindIII对应的酶切位点

8,运行haphic生成的juicebox.sh,生成juicebox软件需要的输入文件
cd 04.build && bash juicebox.sh
9,生成手动调整前的hic互作热图
cd 04.build
HapHiC/haphic plot scaffolds.raw.agp HiC_fm2.filtered.bam --bin_size 5000 --min_len 5
10,juicebox软件手动调整的hic互作热图以及输出最终的结果,out_JBAT.review.assembly
11,HapHiC/utils/juicer post生成手动调整后的基因组FM2_FINAL.fa,画热图需要的文件
HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp FM2_allhap.fa
12,画最终的hic互作热图
HapHiC/haphic plot out_JBAT.FINAL.agp contact_matrix.pkl --bin_size 1000 --min_len 1
13,根据ut_JBAT.FINAL.agp中的单倍型基因组contig ID信息 ,把FM2_FINAL.fa划分为最终的hap1和hap2。

我也想问一下曾老师,如何快速根据contig ID将hap1和hap2分开呢?我一直在思考。
我使用juicebox软件准备手动调整的hic互作热图,发现热图不是很好,我还没进行任何调整的热图,感觉是不是先把04.build/scaffolds.fa分开为两个单倍型,再进行手动调整比较好?我的输入文件是out_JBAT.hic和out_JBAT.assembly。
image

曾老师,谢谢您的指导!

@zengxiaofei
Copy link
Owner

12,画最终的hic互作热图
HapHiC/haphic plot out_JBAT.FINAL.agp contact_matrix.pkl --bin_size 1000 --min_len 1

If you have changed your AGP file (from scaffolds.raw.agp to out_JBAT.FINAL.agp), you need to input HiC_fm2.filtered.bam instead of contact_matrix.pkl at the first time. This is necessary because the contact matrix has also been changed.

如何快速根据contig ID将hap1和hap2分开呢?

$ grep "^group" scaffolds.agp.txt | awk '$5=="W"{print $1"\t"substr($6, 0, 2)"\t"$3-$2+1}' | awk '{sum[$1"\t"$2]+=$3}END{for(h in sum){print h"\t"sum[h]}}' | sort -k1.6,1n

Output:

group1	h2	227146140
group2	h1	221162681
group3	h1	214953513
group4	h2	207079995
group5	h1	175217546
...

我使用10,juicebox软件手动调整的hic互作热图,这个是还没分开为两个单倍型,还没进行任何调整的热图,感觉是不是先把04.build/scaffolds.fa分开为两个单倍型,再进行手动调整比较好?

I don't think there are many differences between these two strategies. It is important to ensure that both haplotypes are drawn within the same contact map, as this approach can help identify potential phasing errors between the haplotypes.

@jinhua2024
Copy link
Author

谢谢曾老师!非常感谢!我是刚刚自学的,会遇到一些问题。
下面是还没进行juicebox软件收动调整的热图,蓝色方框框出是我发现的同一个contig内会有一些没有互作区域,这些应该是组装错误?假如我将 --phasing_weight 参数设置为0或0.5,而不是默认值1,haphic是不是会打断这些contig,纠正这些contig呢?这样是不是说contig的序列是不是发生了改变了?

separate_plots_00
separate_plots.pdf

@zengxiaofei
Copy link
Owner

These empty signals are often caused by the multiple mapping of Hi-C reads to repetitive sequences or very similar regions between haplotypes. Hi-C reads that map to multiple locations are typically filtered out using a MAPQ>=1 criterion. For an understanding of multiple mapping, please refer to: https://samtools.github.io/hts-specs/SAMv1.pdf. This issue arises from intrinsic genomic characteristics and limitations of sequencing technologies and should not be addressed by altering contigs, such as breaking them or making any other modifications.

Stepping back, even if it is necessary to break contigs (to correct chimeric contigs), the --correct_nrounds parameter should be used rather than the --phasing_weight parameter (if you are unclear about the differences among various misassembled contigs, please refer to our paper and #79 (comment)). The --phasing_weight parameter only indicates the degree to which HapHiC trusts the phasing results from the upstream genome assembler (achieved by hard-modifying the contact matrix) and only affects the clustering stage without making any changes to the contig sequences themselves.

@jinhua2024
Copy link
Author

谢谢曾老师!十分感谢您!我已经使用了--correct_nrounds 2的参数,我将按照您的建议,不对这些空白的地方做出改变。
曾老师,我将最后的挂载基因组分离为单倍型基因组FINAL.hap1和FINAL.hap2,想针对两个单倍型基因组重新制作热图,
您在说明提供的这个命令适合吗?
/path/to/HapHiC/haphic build corrected_asm.fa asm.fa HiC.filtered.bam final_tours/group*.tour --corrected_ctgs corrected_ctgs.txt

下面这个命令输出了需要的contigID 清单?就是可以输出需要的 --corrected_ctgs corrected_ctgs.txt文件?
grep "^group" scaffolds.agp.txt | awk '$5=="W"{print $1"\t"substr($6, 0, 2)"\t"$3-$2+1}' | awk '{sum[$1"\t"$2]+=$3}END{for(h in sum){print h"\t"sum[h]}}' | sort -k1.6,1n

但是您提到If assembly correction has been performed, use corrected_asm.fa as input FASTA file instead of the first asm.fa. Additionally, specify the corrected contig list corrected_ctgs.txt using the --corrected_ctgs parameter. Otherwise, the YaHS-style scaffolds.raw.agp generated may be incorrect.
我不能使用这个scaffolds.raw.agp绘制热图了?

谢谢曾老师!

@zengxiaofei
Copy link
Owner

Your thoughts are all in a mess, so let's start from the beginning. You simply want to get two FASTA files and two contact maps corresponding to the two haplotypes, right?

(1) To check whether all the contigs in each group belong to the same haplotype:

$ grep "^group" scaffolds.agp.txt | awk '$5=="W"{print $1"\t"substr($6, 0, 2)"\t"$3-$2+1}' | awk '{sum[$1"\t"$2]+=$3}END{for(h in sum){print h"\t"sum[h]}}' | sort -k1.6,1n > stat.txt

The output should look like this:

$ cat stat.txt
group1	h2	227146140
group2	h1	221162681
group3	h1	214953513
group4	h2	207079995
group5	h1	175217546
...

If each group appears only once, this indicates that each group is assigned exclusively to either h1 or h2.

(2) Find haplotype-specific groups for each haplotype (e.g., for h1):

$ grep h1 stat.txt | cut -f 1 | xargs echo | sed 's/ /,/g'

Output:

group2,group3,group5,...

(3) Extract haplotype-specific sequences for each haplotype (e.g., for h1) using tools such as seqkit:

$ seqkit grep scaffolds.fa -p "group2,group3,group5,..." > hap1.fa

(4) Generate a contact map for each haplotype (e.g., for h1) using haphic plot:

$ haphic plot scaffolds.agp.txt HiC.filtered.bam --specified_scaffolds "group2,group3,group5,..."

@jinhua2024
Copy link
Author

jinhua2024 commented Jan 20, 2025

谢谢曾老师!感谢您的指导!
按照您的指导,我已经将scaffolds.fa跟据stat.tx分开两个单倍型基因组。但是我研究的物种是猪38条染色体。分开之后的单倍型基因组hap1中只有18个group,而hap2中却有20个group,这可能是异常的?

Image

Image

group1 h2 227146140
group2 h1 221162681
group3 h1 214953513
group4 h2 207079995
group5 h1 175217546
group6 h1 171146207
group7 h2 161151901
group8 h2 159963983
group9 h1 159275610
group10 h2 156662785
group11 h2 154595119
group12 h1 150422391
group13 h1 149796624
group14 h1 143814839
group15 h2 141602814
group16 h2 140543074
group17 h1 139236454
group18 h2 137004421
group19 h1 135283199
group20 h2 134407243
group21 h1 134401554
group22 h1 134248480
group23 h2 134197600
group24 h2 133128117
group25 h1 133102488
group26 h2 112538628
group27 h1 111495183
group28 h1 109867575
group29 h2 109162319
group30 h2 100198070
group31 h1 89968598
group32 h2 89161096
group33 h2 83481298
group34 h1 83454331
group35 h2 74424528
group36 h1 64431249
group37 h2 55641335
group38 h2 52829079

@zengxiaofei
Copy link
Owner

Haplotype 1:

  1. Split group3 into three chromosomes
  2. Combine groups 6 and 28 into a single chromosome

Haplotype 2:

  1. Combine groups 18, 30, and 38 into a single chromosome
  2. A very short chromosome can be reconstructed from groups 4, 10, and 26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants