-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-Range Error #29
Comments
Hello, I have identified the issue, but my new error message is: terminate called after throwing an instance of 'std::runtime_error' what(): ERROR: no alts found by Shasta node name convention, try using homology instead? After checking, I found that I was reading the r.utg.gfa file from Hifiasm, while GFAse seems to be designed for Shasta. Although they both follow the GFA 1.0 format, I noticed that their GFA files have significant differences. I would like to know how I can run the GFA file provided by Hifiasm. |
Hi, in this case I would recommend using --use_homology and --skip_unzip. The reason this is necessary is because GFAse needs to infer where the bubbles are in the graph, from scratch. I also recommend --skip_unzip because unfortunately hifiasm has an unresolved issue with its GFA being incorrect that makes chaining and unzipping impossible. As far as I know this correction was never added to HiFiasm, despite what the issue says. With these arguments, you will get a phase assignment for all the contigs, but they won’t be chained together into longer scaffolds. |
Hi,rlorigro: Firstly, I would like to sincerely thank you for your detailed and insightful response to my query. As you mentioned in your paper, I also noticed the issue with overly long overlaps in tools like hifiasm and verkko, which clearly justifies the need for the --skip_unzip parameter. However, I'm still a bit confused about the necessity of the --use_homology parameter. According to your paper, --use_homology is designed for cases where "Edges (L lines) are not required." The r.utg.gfa file I provided was generated by hifiasm, and it should contain the proximity ligation data you referenced. Additionally, the CSV file I submitted meets the format requirements. Yet, I still encounter the following error: After reviewing your code, it appears that this issue might stem from the lack of .0 or .1 information, but I also faced the same error using Shasta’s test_data. Could you please advise on how I can generate a result file that can proceed with scaffolding? Your guidance would be greatly appreciated! Thank you again for your support, and I look forward to your help in resolving this issue. This is the data content of the CSV file I provided: And upon reviewing the various GFA files generated by Shasta, it seems that none of them meet the requirement for 'alts' as mentioned in the error message from GFAse. Could you kindly advise me on how I should modify my input files or what approach I should take in order to generate results with scaffolds? Thank you very much for your guidance. |
First, I should clarify that is not the only use for the --use_homology option. It was actually designed for Verkko graphs which do not have well behaved bubble topology and therefore need a looser definition, which comes from the homology method. Shasta Mode 2 assembly is the only type of Assembly that is currently supported for node name parsing, and homology was used in every Verkko and Hifiasm result in our publication. If you provide a GFA, it will still use the edges of the GFA as constraints in how the bubbles are constructed, but it will also be capable of rescuing "island" nodes, as is common in Verkko. To explain in more detail: a necessary step for constructing our phasing model is that you must have already defined your units, which are "flipped" between orientation 0 and 1 to decide the membership of their constituent nodes in the 2 partitions. In this model, the "unit" is a bubble, which means that GFAse must first identify partially phased units (bubbles) before optimization can begin. Since Shasta Mode 2 assemblies conveniently provide this information in the names of the nodes (S lines of the GFA), then we are able to trust that information and transmit it directly to the model building step. In practice, a "bubble" may be any bipartite subgraph, defined by homology search. It may have an unequal number of nodes on each side of the bubble. This is important for when contiguity breaks are one-sided in the assembly (as with Verkko). However, this form of node name parsing is not implemented for any other assemblers. If you are suggesting that it is possible to obtain bubble information from Hifiasm node names, then it would be very easy to add that support to the repository. However, the homology based bubble finding step generally works well, and I recommend trying it. One case where homology may fall short of your expectations is when the assembly is highly fragmented and the node sequences cannot easily be compared to one another.
Only Shasta Mode 2 assemblies have well defined bubble topology and node names. For Shasta Mode 3 we have dropped support for node names and instead use homology. The result of Mode 3 development and use in combination with GFAse was presented in recent conferences and can be found here (in the Shasta README) |
I should also add that you likely want to use the p_utg GFA from Hifiasm, if i remember correctly, which is emitted after some rounds of bubble popping and pruning. Otherwise your graph will likely be very fragmented. |
Hello,
Thank you for providing this software! However, I encountered the following error when using phase_contacts_with_monte_carlo:
[0h 0m 0s] Loading GFA...
No index found, generating .gfai for "chr1.asm.bp.r_utg.gfa" ... done
Creating nodes...
Creating edges...
[0h 0m 12s] Writing IDs to file...
[0h 0m 12s] Loading alignments as contact map...
terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at
The GFA file I provided was generated by Hifiasm in default mode, resulting in an r.utg.gfa file. The data is from the diploid human chromosome 1, and the format of my CSV file is as follows:
utg000025l utg000015l 2
utg000034l utg000014l 2
utg000047l utg000029l 2
utg000047l utg000034l 3
Do you know what might be causing this error? I suspect it's an out-of-range issue, but after checking, I'm confident that my utg_id values match the sequence IDs in the GFA file perfectly with no discrepancies.
I would appreciate any help in resolving this issue. Thank you so much!
Best regards,
Yichen
The text was updated successfully, but these errors were encountered: