Skip to content

Commit

Permalink
15.0.68
Browse files Browse the repository at this point in the history
  • Loading branch information
Divon Lan committed Oct 13, 2024
1 parent 30a6d33 commit e055f22
Show file tree
Hide file tree
Showing 97 changed files with 2,512 additions and 1,471 deletions.
9 changes: 7 additions & 2 deletions LICENSE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean Legal Entity (possibly an individual) exercising permissions granted
by this License.

"Recognized Academic Research Institution" shall mean a Legal Entity that contributes to the
scientific record by regularly publishing papers in scientific journals AND that grants academic
degrees which are recongnized as such by the competent authority in the country in which said Legal
Entity is organized.

"Derivative Works" shall mean any work that is based on (or derived from) Genozip and for which the
editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an
original work of authorship. For the purposes of this License, Derivative Works shall not include
Expand Down Expand Up @@ -46,7 +51,7 @@ purpose attached to that particular License Type, and subject to the terms and c
License agreement:

a. Academic License: Using Genozip Executables for academic research, educational or training
purposes provided that You are a recognized academic research institution which is not a hospital,
purposes provided that You are a Recognized Academic Research Institution which is not a hospital,
or a registered student at such an institution, but excluding use with Your Commercial Data, and
limited to a total of 10,000 files per institution.

Expand Down Expand Up @@ -159,5 +164,5 @@ ABOVE STATED REMEDY FAILS OF ITS ESSENTIAL PURPOSE.

END OF TERMS AND CONDITIONS

Genozip license version: 15.0.67
Genozip license version: 15.0.68

7 changes: 7 additions & 0 deletions RELEASE_NOTES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@ Note on versioning:
- Minor version changes with bug fixes and minor feature updates
- Some minor versions are skipped due to failed deployment pipelines

15.0.68 13/10/2024
- Deep: reduction in memory in --test and genounzip of Deep files: typically 10-20% less RAM consumption
- Deep: new option: --deep=no-qual to Deep seq, qname only (not qual): consumes drastically less RAM, and generates a file of size in between compressing the FASTQ and BAM alone, and full Deep.
- License: clarify the meaning of "Recognized Academic Research Institution"
- BAM: further reduction in RAM consumption when compressing and uncompressing files with many secondary or supplementary alignments.
- New diagnostic: --show-huffman

15.0.67 23/9/2024
- Improvements in Deep.

Expand Down
5 changes: 3 additions & 2 deletions installers/LICENSE.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@
"License" shall mean the terms and conditions for use as defined by Sections 1 through 11 of this document.<br><br>
"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.<br><br>
"You" (or "Your") shall mean Legal Entity (possibly an individual) exercising permissions granted by this License.<br><br>
"Recognized Academic Research Institution" shall mean a Legal Entity that contributes to the scientific record by regularly publishing papers in scientific journals AND that grants academic degrees which are recongnized as such by the competent authority in the country in which said Legal Entity is organized.<br><br>
"Derivative Works" shall mean any work that is based on (or derived from) Genozip and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from Genozip and Derivative Works thereof.<br><br>
"Your Commercial Data" shall mean data which You (the Legal Entity exercising permissions granted by this License) obtained with intention of using it in the development process of a product and/or for provisioning of any kind of service (including also clinical, diagnostic, DNA or RNA sequencing, bioinformatics and cloud services, but excluding education services) for which You get paid. Data derived from Your Commercial Data is also Your Commerical Data.<br><br>
"Your Computers" shall mean computers You own and/or cloud accounts You own at 3rd party cloud providers.<br><br>
"Genozip Executables" shall mean the executable files genozip, genounzip, genocat and genols (with or without an .exe file name suffix).<br><br>
Other words and terms in this License shall be interpreted as their usual meaning in the context of a software product.<br><br>
2. Grant of copyright license. Licensor hereby grants to You a limited non-exclusive, non-transferrable, non-sublicensable, revokable copyright license to use Genozip on Your Computers, if you meet the conditions attached to any of the License Types a through f below, for the limited purpose attached to that particular License Type, and subject to the terms and conditions of this License agreement:<br><br>
a. Academic License: Using Genozip Executables for academic research, educational or training purposes provided that You are a recognized academic research institution which is not a hospital, or a registered student at such an institution, but excluding use with Your Commercial Data, and limited to a total of 10,000 files per institution.<br><br>
a. Academic License: Using Genozip Executables for academic research, educational or training purposes provided that You are a Recognized Academic Research Institution which is not a hospital, or a registered student at such an institution, but excluding use with Your Commercial Data, and limited to a total of 10,000 files per institution.<br><br>
b. Academic License: Using Genozip Executables for another non-commercial purpose, if it has been pre-approved by Licensor in writing. Email [email protected] to seek such an approval.<br><br>
c. Standard, Enterprise or Premium License: Using Genozip Executables for any legal purpose, if the license was purchased and paid for, and for the duration that it is in effect. In addition, for Premium License only: Distributing Genozip Executables to others.<br><br>
d. Decompression License: Using a subset of Genozip Executables consisting of genounzip, genocat, genols for any legal purpose. A Decompression License is free of charge.<br><br>
Expand All @@ -34,4 +35,4 @@
10. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides Genozip on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Genozip and assume any risks associated with Your exercise of permissions under this License.<br><br>
11. LIMITATION OF LIABILITY. TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, STRICT LIABILITY OR OTHER LEGAL OR EQUITABLE THEORY, SHALL LICENSOR OR DEVELOPER BE LIABLE FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER ARISING AS A RESULT OF THIS LICENSE OR OUT OF THE USE OR INABILITY TO USE GENOZIP (INCLUDING BUT NOT LIMITED TO DAMAGES FOR LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, FILE CORRUPTION, DATA LOSS, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES), EVEN IF LICENSOR OR DEVELOPER HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN NO EVENT WILL LICENSOR'S OR DEVELOPER'S TOTAL LIABILITY TO LICENSEE FOR ALL DAMAGES (OTHER THAN AS MAY BE REQUIRED BY APPLICABLE LAW IN CASES INVOLVING PERSONAL INJURY) EXCEED THE AMOUNT OF $500 USD. THE FOREGOING LIMITATIONS WILL APPLY EVEN IF THE ABOVE STATED REMEDY FAILS OF ITS ESSENTIAL PURPOSE.<br><br>
END OF TERMS AND CONDITIONS<br><br>
Genozip license version: 15.0.67<br><br>
Genozip license version: 15.0.68<br><br>
Binary file modified installers/genozip-installer.exe
Binary file not shown.
Binary file modified installers/genozip-linux-x86_64.tar
Binary file not shown.
Binary file modified installers/genozip-osx-arm.tar
Binary file not shown.
Binary file modified installers/genozip-osx-x86.tar
Binary file not shown.
15 changes: 8 additions & 7 deletions src/aligner.c
Original file line number Diff line number Diff line change
Expand Up @@ -341,9 +341,10 @@ MappingType aligner_seg_seq (VBlockP vb, STRp(seq), bool is_pair_2, PosType64 pa

// PIZ: SEQ reconstruction - only for reads compressed with the aligner
void aligner_reconstruct_seq (VBlockP vb, uint32_t seq_len, bool is_pair_2, bool is_perfect_alignment, ReconType reconstruct,
char *first_mismatch_base, // optional out: caller should initialize to 0
uint32_t *first_mismatch_offset, // optional out
uint32_t *num_mismatches) // optional out: caller should initialize to 0
int max_deep_mismatches, // length of mismatch_base and mismatch_offset arrays
char *mismatch_base, // optional out
uint32_t *mismatch_offset, // optional out
uint32_t *num_mismatches) // optional out
{
START_TIMER;
declare_seq_contexts;
Expand Down Expand Up @@ -428,10 +429,10 @@ void aligner_reconstruct_seq (VBlockP vb, uint32_t seq_len, bool is_pair_2, bool
: nonref_ctx;
RECONSTRUCT_NEXT (ctx, 1);

if (first_mismatch_base) {
if (! *first_mismatch_base) {
*first_mismatch_base = is_forward ? *BLSTtxt : COMPLEM[(int)*BLSTtxt];
*first_mismatch_offset = is_forward ? i : seq_len - i -1;
if (num_mismatches) {
if (*num_mismatches <= max_deep_mismatches) {
mismatch_base[*num_mismatches] = *BLSTtxt;
mismatch_offset[*num_mismatches] = i;
}
(*num_mismatches)++;
}
Expand Down
2 changes: 1 addition & 1 deletion src/aligner.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@
typedef enum { MAPPING_NO_MAPPING, MAPPING_ALIGNED, MAPPING_PERFECT } MappingType;

extern MappingType aligner_seg_seq (VBlockP vb, STRp(seq), bool is_pair_2, PosType64 pair_gpos, bool pair_is_forward);
extern void aligner_reconstruct_seq (VBlockP vb, uint32_t seq_len, bool is_pair_2, bool is_perfect_alignment, ReconType reconstruct, char *first_mismatch_base, uint32_t *first_mismatch_offset, uint32_t *num_mismatches);
extern void aligner_reconstruct_seq (VBlockP vb, uint32_t seq_len, bool is_pair_2, bool is_perfect_alignment, ReconType reconstruct, int mismatches_len, char *mismatch_base, uint32_t *mismatch_offset, uint32_t *num_mismatches);
2 changes: 1 addition & 1 deletion src/arch.c
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ void arch_initialize (rom my_argv0)
ASSERT0 (sizeof (LocalType) == 1, "expecting sizeof (LocalType)==1");
ASSERT0 (sizeof (uint128_t) == 16, "expecting sizeof (uint128_t)==16");
ASSERT0 (sizeof (ReconPlanItem) == 12, "expecting sizeof (ReconPlanItem)==12");
ASSERT0 (sizeof (void *) <= 8, "expecting sizeof (void *)<=8"); // important bc void* is a member of ValueType, and also counting on it in huffman_uncompress
ASSERT0 (sizeof (void *) <= 8, "expecting sizeof (void *)<=8"); // important bc void* is a member of ValueType, and also counting on it in huffman_uncompress, str_pack_bases, bits_init_do
ASSERT0 (sizeof (ValueType) == 8, "expecting sizeof (ValueType)==8");

// Note: __builtin_clzl is inconsistent between Windows and Linux, even on the same host, so we don't use it
Expand Down
46 changes: 34 additions & 12 deletions src/bam_seg.c
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

#include "sam_private.h"
#include "txtfile.h"
#include "mgzip.h"
#include "libdeflate_1.19/libdeflate.h"

void bam_seg_initialize (VBlockP vb)
Expand All @@ -18,7 +19,7 @@ void bam_seg_initialize (VBlockP vb)
buf_alloc (vb, &vb->txt_data, 1, 0, char, 0, 0); // add 1 character after the end of txt_data
*BAFTtxt = '*'; // missing qual;

if (!segconf.running && line_textual_cigars_used)
if (line_textual_cigars_used)
buf_alloc (vb, &VB_SAM->line_textual_cigars, 0, segconf.sam_cigar_len * vb->lines.len32 / (segconf.is_long_reads ? 4 : 1),/*divide in case sam_cigar_len is not representative*/
char, CTX_GROWTH, "line_textual_cigars");
}
Expand Down Expand Up @@ -53,7 +54,7 @@ static int32_t bam_unconsumed_scan_forwards (VBlockP vb)

uint32_t aln_size=0, i;
for (i=0 ; i < txt_len-3; i += aln_size)
aln_size = GET_UINT32_((BAMAlignmentFixed *)&txt[i], block_size) + 4;
aln_size = ((BAMAlignmentFixedP)&txt[i])->block_size + 4;

if (aln_size > txt_len)
return -1; // this VB doesn't not even contain one single full alignment
Expand All @@ -65,14 +66,15 @@ static int32_t bam_unconsumed_scan_forwards (VBlockP vb)
return aln_size - (i - txt_len); // we pass the data of the final, partial, alignment to the next VB
}

static int32_t bam_unconsumed_scan_backwards (VBlockP vb, uint32_t first_i)
static int32_t bam_unconsumed_scan_backwards (rom bam_data, uint64_t bam_data_len,
const BAMAlignmentFixed **last_aln) // optional out
{
int32_t last_i = Ltxt - sizeof(BAMAlignmentFixed);
int32_t last_i = bam_data_len - sizeof(BAMAlignmentFixed);

// find the first alignment in the data (going backwards) that is entirely in the data -
// we identify and alignment by l_read_name and read_name
for (; last_i >= (int32_t)first_i; (last_i)--) {
const BAMAlignmentFixed *aln = (const BAMAlignmentFixed *)Btxt (last_i);
for (; last_i >= 0; last_i--) {
const BAMAlignmentFixed *aln = (const BAMAlignmentFixed *)&bam_data[last_i];

uint32_t block_size = LTEN32 (aln->block_size);
if (block_size > 100000000) continue; // quick short-circuit - more than 100M for one alignment - clearly wrong
Expand All @@ -81,13 +83,13 @@ static int32_t bam_unconsumed_scan_backwards (VBlockP vb, uint32_t first_i)
uint16_t n_cigar_op = LTEN16 (aln->n_cigar_op);

// test to see block_size makes sense
if ((uint64_t)last_i + (uint64_t)block_size + 4 > (uint64_t)vb->txt_data.len || // 64 bit arith to catch block_size=-1 that will overflow in 32b
if ((uint64_t)last_i + (uint64_t)block_size + 4 > bam_data_len || // 64 bit arith to catch block_size=-1 that will overflow in 32b
block_size + 4 < sizeof (BAMAlignmentFixed) + 4*n_cigar_op + aln->l_read_name + l_seq + (l_seq+1)/2)
continue;

// test to see l_read_name makes sense
if (LTEN32 (aln->l_read_name) < 2 ||
&aln->read_name[aln->l_read_name] > BAFTtxt) continue;
&aln->read_name[aln->l_read_name] > bam_data + bam_data_len) continue;

// test pos
int32_t pos = LTEN32 (aln->pos);
Expand Down Expand Up @@ -122,9 +124,11 @@ static int32_t bam_unconsumed_scan_backwards (VBlockP vb, uint32_t first_i)
// agree with our formula. see comment in bam_reg2bin

// all tests passed - this is indeed an alignment
return Ltxt - (last_i + LTEN32 (aln->block_size) + 4); // everything after this alignment is "unconsumed"
if (last_aln) *last_aln = aln;
return bam_data_len - (last_i + LTEN32 (aln->block_size) + 4); // everything after this alignment is "unconsumed"
}

if (last_aln) *last_aln = NULL;
return -1; // we can't find any alignment - need more data (lower first_i)
}

Expand All @@ -142,11 +146,29 @@ int32_t bam_unconsumed (VBlockP vb, uint32_t first_i)

// stringent -either CIGAR needs to match seq_len, or qname needs to match flavor
else
result = bam_unconsumed_scan_backwards (vb, first_i);
result = bam_unconsumed_scan_backwards (Btxt(first_i), Ltxt - first_i, NULL);

return result; // if -1 - we will be called again with more data
}

bool bam_txt_file_is_last_alignment_unmapped (void)
{
// notes: 1. in sorted files, unmapped are at the end. 2. No implemtnation for SAM, as BGZF .sam.gz or .sam are very rare.
if (!segconf.is_sorted || !IS_BAM_ZIP || !TXT_IS_BGZF ||
txt_file->redirected || txt_file->is_remote || is_read_via_ext_decompressor (txt_file)/*CRAM*/)
return false; // we cannot test

STRli(uncomp, BGZF_MAX_BLOCK_SIZE);
if (!bgzf_read_and_uncomp_final_block (txt_file->name, qSTRa(uncomp)))
return false; // failed to find or read or uncompress final bgzf block

const BAMAlignmentFixed *aln;
bam_unconsumed_scan_backwards (STRa(uncomp), &aln);

return aln && // last alignment found (note: possibly NULL (=not found) if by bad luck the final BGZF block is tiny)
((LTEN16(aln->flag) & SAM_FLAG_UNMAPPED) || !aln->n_cigar_op || aln->ref_id == -1 || aln->pos == -1);
}

static rom bam_dump_alignment (VBlockSAMP vb, rom alignment, rom after)
{
buf_free (vb->scratch); // feel free to use scratch bc this is called in an ASSERT before aborting
Expand Down Expand Up @@ -494,7 +516,7 @@ rom bam_seg_txt_line (VBlockP vb_, rom alignment /* BAM terminology for one line
(flag.has_biopsy_line && sam_seg_test_biopsy_line (VB, alignment, block_size + 4)) )
goto done;

sam_cigar_binary_to_textual (vb, n_cigar_op, B1ST(BamCigarOp, vb->binary_cigar), // binary_cigar and not "cigar", as the latter is mis-aligned
sam_cigar_binary_to_textual (vb, B1ST(BamCigarOp, vb->binary_cigar), n_cigar_op, false, // binary_cigar and not "cigar", as the latter is mis-aligned
&vb->textual_cigar); // re-write BAM format CIGAR as SAM textual format in vb->textual_cigar

// SEQ - calculate diff vs. reference (denovo or loaded)
Expand Down Expand Up @@ -574,7 +596,7 @@ rom bam_seg_txt_line (VBlockP vb_, rom alignment /* BAM terminology for one line
else if (IS_SAG_SOLO) sam_seg_prim_add_sag_SOLO (vb, dl);
}

if (dl->SEQ.len > vb->longest_seq_len) vb->longest_seq_len = dl->SEQ.len;
MAXIMIZE (vb->longest_seq_len, dl->SEQ.len);

if (segconf.running)
segconf.est_segconf_sam_size += bam_segconf_get_transated_sam_line_len (vb, dl, tlen);
Expand Down
13 changes: 0 additions & 13 deletions src/bits.c
Original file line number Diff line number Diff line change
Expand Up @@ -895,19 +895,6 @@ void bits_2bit_to_byte (uint8_t *dst, ConstBitsP src_bits, uint64_t base_i, uint
*dst++ = BASE_NEXT_FWD;
}

// convert a Bits containing a series of 2-bits, to a byte array of values 0-3
void bits_2bit_to_ACGT (char *dst, ConstBitsP src_bits, uint64_t base_i, uint32_t num_bases)
{
ASSERT (2*(base_i + num_bases) <= src_bits->nbits, "Expecting 2*(base_is=%"PRIu64" + num_bases=%u) <= nbits=%"PRIu64,
base_i, num_bases, src_bits->nbits);

BASE_ITER_INIT (src_bits, base_i, num_bases, true);

static char acgt[4] = { 'A', 'C', 'G', 'T' };
for (uint32_t i=0; i < num_bases; i++)
*dst++ = acgt[BASE_NEXT_FWD];
}

//
// Logic operators
//
Expand Down
Loading

0 comments on commit e055f22

Please sign in to comment.