Sharding Cram files and inputting multiple datasets simultaneously #29

virajbdeshpande · 2012-10-06T02:23:22Z

Hello Vadim,

Sorry for the delay, I was distracted by some other stuff. The patch I have added is a very small one.
The only external change to CramTools is that if the name for a read is not preserved, then the name(number) of the outputted read is now prefixed with shard number.

The modifications to Bam2Cram.java are aimed to uniformly shard the reads based on their position on the reference.
The reason I choose to shard the file independently on each work unit (processor) is so that when a downstream application wants to process this data in a parallel fashion, it does not have to do a RPC to retrieve the relevant shard and parallel disk copy to multiple nodes is easier.
There is no special modification for Cram2Bam, because this can achieved by simply concatenating the decompressed bam file.

Let me know you comments about the same and if you would like a different design for this. Also looking forward to the release of Cram C/Java API.

Viraj

PS: I hope I have added the copyright notice in the right fashion, though that should not be important as it will be reverted back to EBI with the next revisions of the files.

PS: Comments from the commit:
cramtools command:
bam:
New input arguments:
--input-bamlist-file (optional alternative for --input-bam-file)
--output-cramlist-file
--read-annolist-file
--stats-outlist-file
--tram-outlist-file
--work-unit
--num-work-units

cram:
New input argument:
--read-name-prefix (should be unique for different shards)

New requirement:
Dictionary for reference sequence is now required.
If ref.fa or ref.fasta is the reference-fasta-file, then ref.dict should be created.
Use picard tools library to create ref.dict file:
java -jar picard_tools/CreateSequenceDictionary.jar R=ref.fa O=ref.dict

int and long

Fixed the way long and int values are written to bit stream

Added Subexp codec

Added readLongBits method to BitInputStream and DefaultBitInputStream Added range tests to read methods of DefaultBitInputStream

Added NumberCodecFactory, which creates appropriate codec stub given an encoding

contains unfinished staff, watch out...

untracked junit

reference sequence.

broken CRAM file

unfinished code.

removed 'experiment' from build.xml updated version in CramIndexer rebuild the cramtools.jar

Picard 1.66 Cram2Bam performance Huffman performance Versioning and build number Complete BAM header Code cleanup and minor adjustments

Bam2Cram: added NCBI binning scheme

reads with no read features) Fixed: "No real operator (M|I|D|N) in CIGAR" in restored BAMs Fixed: read groups missing from header

Switched readLong() implementation in DBIS for faster.

Added Apache2.0 license read names preservation removed beanutils and compress jars from dependencies option for direct byte streaming minor changes to the format -> incompatible with previous versions

Tagged as v0.9

Added support for sharding each bamfile into mutiple cram files in a distributed fashion. cramtools command: bam: New input arguments: --input-bamlist-file --output-cramlist-file --read-annolist-file --stats-outlist-file --tram-outlist-file --work-unit --num-work-units cram: New input argument: --read-name-prefix (should be unique for different shards) New requirement: Dictionary for reference sequence is now required. If ref.fa or ref.fasta is the reference-fasta-file, then ref.dict should be created. Use picard tools library to create ref.dict file: java -jar picard_tools/CreateSequenceDictionary.jar R=ref.fa O=ref.dict

vadim added 30 commits March 30, 2011 16:07

Initial commit

6e27fdb

Unary codec

0f366f6

BitOutputStream refactored to allow different write method for byte,

d410072

int and long

Added GammaCodec

4c751c4

Fixed the way long and int values are written to bit stream

Added Golomb codec + tests

7c4cbd2

Added Subexp codec

Finished and tested SubexpCodec

7046c2d

Added readLongBits method to BitInputStream and DefaultBitInputStream Added range tests to read methods of DefaultBitInputStream

Added NumberCodecStub and implementations for all Long codecs

49968cd

Added NumberCodecFactory, which creates appropriate codec stub given an encoding

version 0.2

65a692e

contains unfinished staff, watch out...

Minor change to Bam2Cram

791fbf6

Minor fixes

00f7371

Added build and run notes to readme

8fea3fa

Added a pre-built jar

ba52574

0.25

ef9360f

0.25

9f86847

Excluded TestArithmeticCode from compilation

ae5c23c

Initial commit of version 0.3

14a964a

Removed import from snappy lib, which was used for some experiments.

60d5edb

Removed snappy deps

a7739c2

untracked junit

Fixed the bug when soft-clipped bases (insertions to read) were beyond

ad91e38

reference sequence.

prebuilt runnable jar for fixed 'soft-clipped off the sequence' bug

56e039f

Bug fix: unclosed output stream in Bam2Cram with --gzip option lead to

bd6b7f2

broken CRAM file

Added SAMUtils and ReadAnnotationReader

13e2b19

Added ReadAnnotation

6d76026

Added ReadAnnotationCodec

b7a2a88

Removed read annotations from Bam2Cram, it was a mistake to push

57c6242

unfinished code.

Still fixing it...

55bbcdf

Prebuilt runnable jar with recent fixes.

5d14633

pre 0.5, expect fixes

c115b21

adding CramIndexer

67b35f5

Added descriptions to parameters in Cram2Bam, CramIndexer and Bam2Cram

e92ad1e

vadim and others added 30 commits February 8, 2012 16:15

initial v0.6

702d94f

Updated README for 0.6

c0d7914

0.65, pass through tags support

e24a369

v0.7

d212079

updated README for v0.7

af3c766

Updated version number.

93e9ce0

Fixes:

6c780e7

removed 'experiment' from build.xml updated version in CramIndexer rebuild the cramtools.jar

Added 5 missing files to git index.

bd3abb7

Switching to Picard 1.63

f0ad692

Added sorting routines to Utils (for range codec in future)

61efed4

a test commit, please ignore

60e1e30

Version 0.8 RC:

5e369bd

Picard 1.66 Cram2Bam performance Huffman performance Versioning and build number Complete BAM header Code cleanup and minor adjustments

cramtools.jar and updated README

7303a68

Fixed NPE if RecordCodecFactory.CodecStats

2997828

Fix for issue 25

03ef09b

Random access fix (issue 25)

9eba2b4

Fix for issue 26, duplicate fields in BAM header.

fe9dba0

Added SliceView

3e06504

Added SliceView, rebuilt jar

c63a4af

Added BAM stdout stream to SliceView

afe73e2

Fix for BAM streaming out in SliceView

f646a09

SliceView: fixed missing SQ

8855f4e

Bam2Cram: added NCBI binning scheme

Fixed: zero quality scores in ViewSam2 and SliceView (appeared only for

6181ce2

reads with no read features) Fixed: "No real operator (M|I|D|N) in CIGAR" in restored BAMs Fixed: read groups missing from header

Removed some garbage in the output

44f3a9b

Quick fix for reads mapped beyond sequence.

c8698f4

Switched readLong() implementation in DBIS for faster.

pre 0.9 pack

97768ad

Added Apache2.0 license read names preservation removed beanutils and compress jars from dependencies option for direct byte streaming minor changes to the format -> incompatible with previous versions

Soft clips as insertions

fd2f61e

Tagged as v0.9

Multi cram

fac3f14

Updating copyright notice

4786bfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding Cram files and inputting multiple datasets simultaneously #29

Sharding Cram files and inputting multiple datasets simultaneously #29

virajbdeshpande commented Oct 6, 2012

Sharding Cram files and inputting multiple datasets simultaneously #29

Are you sure you want to change the base?

Sharding Cram files and inputting multiple datasets simultaneously #29

Conversation

virajbdeshpande commented Oct 6, 2012