-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting blocks of unaligned reads from CRAM #316
Comments
- Adds a configuration option to explore alternatives to our current bgzip/grabix preparation for fastqs to see if we can improve speed. - Add indexing option with rtg SDF files, which improves speed at the cost of a 3x larger disk footprint.
On Tue, Jan 19, 2016 at 08:51:38AM -0800, Sven-Eric Schelhorn wrote:
It's doable in an around-the-houses sort of way. If you build an index of the cram file then the .crai file contains I've done this manually to produce small cram files demonstrating The .crai file is just gzipped text. I wouldn't guarantee on it never James James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova The Wellcome Trust Sanger Institute is operated by Genome Research |
James -- thanks so much. This sounds promising although the bits about producing valid cram streams with dd is beyond my CRAM skill level, in terms of producing a proof of principle implementation for bcbio. I could do the crai parsing bits and test it if I had a better interface or guidance how to do the retrieval. Thanks again for the ideas. |
I admit it's esoteric! But here's a worked example.
These are reference numbers (counting from 0, so 19 being chr20 of this sample). Staden io_lib has a cram_dump utility (hideously written, but a debugging aid for me primarily) that reveals a bit more about the structure of this file.
This is showing the reference locations as well as the file offsets. Reference locations "id 19 pos 18868 + 136639" is the same data you're seeing in the first line of the index file. File locations "pos 7362 size 103935" are in there too. It's a bit different for the container size. The difference in file offset between 1st and 2nd container is 111322-7362 == 103960. This isn't very helpful, but the difference is due to whether or not it includes the container header structure itself. Therefore the easy solution is just to subtract start locations. To stitch a valid cram file together, we need CRAM header, BAM header (ie all the bits between first container - the first 7362 bytes in my example), a series of containers, and the footer - the EOF block which for cram v3 is 38 bytes long.
Yes it's hideous, but it demonstrates in principle cut-n-shunt of CRAM can work. However I'm not sure if we'll get time to make a proper API for doing container splitting in CRAM in a graceful way soon. I'm not promoting this as an official way of parallelising CRAM processing and there's no check in here for cram version number, whether the EOF is valid, even the .crai file format may change at some stage in the future. Hence the need for an abstracted API (one day). But meanwhile a home-brew hacky solution is possible. |
@jkbonfield Can I resurrect this thread? The 100Gbp+ data coming off the PacBio Sequel II and ONT PromethION means that we are getting large unaligned files at the moment to begin the assembly process (BAM for PacBio at the moment, and maybe CRAM with https://github.com/EGA-archive/ont2cram for ONT). Many operations that were fine to be parallelised per-run for the old Sequel I and MinION yields would benefit from being able to shard these now much larger files into chunks of N reads. The thread was originally for CRAM, but ideally would work for BAM via the |
This was implemented, but not in htslib. See io_lib's It's a bit clunky as it doesn't have an obvious way to tell you how many containers exist (other than when you get zero reads back I guess you know you're at the end), but for example:
Note you can do Doing it in BAM is harder. We can't index it if it isn't sorted and there's no way to do random access on unsorted files. In theory we could use the bzi index (from bgzip), but the BAM format has no way to indicate which blocks start on read boundaries and which do not. It's basically a bad format for this type of work. Edit: I take that back, an unaligned file is "sorted" in as far as it'll permit bai/csi indexing. So maybe it's possible. I know people (not us) wrote tools to do chunking of BAM based on genomic coords and the indices, but I don't think anyone managed it for arbitrary slices of Nth read. Technically possible, but a mess. |
Re BAM, this would probably be a use case for the proposed splitting index (see samtools/hts-specs#321) that hasn't been finalised (or implemented in HTSlib / samtools). |
I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.
This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.
Note that this is different from extracting reads that align to a window of the reference genome; I would rather expect to be able to extract the unaligned reads in the order of the FASTQ file that has been used for generating the unaligned CRAM. A nice addition would be an option for filtering by mate ID so that one could specify which end (or both) should be returned. May this perhaps be possible while also limiting IO by using the blocks/slices of the CRAM format?
The text was updated successfully, but these errors were encountered: