Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding Cram files and inputting multiple datasets simultaneously #29

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

virajbdeshpande
Copy link

Hello Vadim,

Sorry for the delay, I was distracted by some other stuff. The patch I have added is a very small one.
The only external change to CramTools is that if the name for a read is not preserved, then the name(number) of the outputted read is now prefixed with shard number.

The modifications to Bam2Cram.java are aimed to uniformly shard the reads based on their position on the reference.
The reason I choose to shard the file independently on each work unit (processor) is so that when a downstream application wants to process this data in a parallel fashion, it does not have to do a RPC to retrieve the relevant shard and parallel disk copy to multiple nodes is easier.
There is no special modification for Cram2Bam, because this can achieved by simply concatenating the decompressed bam file.

Let me know you comments about the same and if you would like a different design for this. Also looking forward to the release of Cram C/Java API.

Viraj

PS: I hope I have added the copyright notice in the right fashion, though that should not be important as it will be reverted back to EBI with the next revisions of the files.

PS: Comments from the commit:
cramtools command:
bam:
New input arguments:
--input-bamlist-file (optional alternative for --input-bam-file)
--output-cramlist-file
--read-annolist-file
--stats-outlist-file
--tram-outlist-file
--work-unit
--num-work-units

cram:
New input argument:
--read-name-prefix (should be unique for different shards)

New requirement:
Dictionary for reference sequence is now required.
If ref.fa or ref.fasta is the reference-fasta-file, then ref.dict should be created.
Use picard tools library to create ref.dict file:
java -jar picard_tools/CreateSequenceDictionary.jar R=ref.fa O=ref.dict

vadim added 30 commits March 30, 2011 16:07
Fixed the way long and int values are written to bit stream
Added Subexp codec
Added readLongBits method to BitInputStream and DefaultBitInputStream
Added range tests to read methods of DefaultBitInputStream
Added NumberCodecFactory, which creates appropriate codec stub given
an encoding
contains unfinished staff, watch out...
untracked junit
vadim and others added 30 commits February 8, 2012 16:15
removed 'experiment' from build.xml 
updated version in CramIndexer
rebuild the cramtools.jar
Picard 1.66
Cram2Bam performance
Huffman performance
Versioning and build number
Complete BAM header
Code cleanup and minor adjustments
Bam2Cram: added NCBI binning scheme
reads with no read features)
Fixed: "No real operator (M|I|D|N) in CIGAR" in restored BAMs
Fixed: read groups missing from header
Switched readLong() implementation in DBIS for faster.
Added Apache2.0 license
read names preservation
removed beanutils and compress jars from dependencies
option for direct byte streaming
minor changes to the format -> incompatible with previous versions
Added support for sharding each bamfile into mutiple cram files in a distributed fashion.

cramtools command:
bam:
New input arguments:
--input-bamlist-file
--output-cramlist-file
--read-annolist-file
--stats-outlist-file
--tram-outlist-file
--work-unit
--num-work-units

cram:
New input argument:
--read-name-prefix (should be unique for different shards)

New requirement:
Dictionary for reference sequence is now required.
If ref.fa or ref.fasta is the reference-fasta-file, then ref.dict should be created.
Use picard tools library to create ref.dict file:
java -jar picard_tools/CreateSequenceDictionary.jar R=ref.fa O=ref.dict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants