Skip to content

Commit

Permalink
Fixed an issue with merging alignment pairs with insertions upstream of
Browse files Browse the repository at this point in the history
the start of the second read causing an out of frame merging
  • Loading branch information
DavidAustinNix committed Jun 2, 2020
1 parent b2c7aa6 commit 7c373aa
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 31 deletions.
88 changes: 66 additions & 22 deletions Documentation/USeqDocumentation/cmdLnMenus.html
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,8 @@ <H1>Command Line Menus</H1> <p>

<a href="#VCF2Tsv">VCF2Tsv</a><br>

<a href="#VCFVarSeqParser">VCFVarSeqParser</a><br>

<a href="#Wig2Bar">Wig2Bar</a><br>

<a href="#Wig2USeq">Wig2USeq</a><br>
Expand Down Expand Up @@ -4188,32 +4190,40 @@ <H1>Command Line Menus</H1> <p>
</pre><br><p>

<a name="SimpleSomaticCaller"><pre>**************************************************************************************
** Simple Somatic Caller: Jan 2020 **
** Simple Somatic Caller: May 2020 **
**************************************************************************************
Takes vcf output from Bcftools mpileup and norm applications (http://www.htslib.org)
run on a paired tumor and normal bam file set and filters the variants for somatic
calls. Select -z to print an example bash script. A one-tailed Fisher's Exact test is
performed on each alt allele count to select somatic variants. Vt normalization isn't
needed for the output vcfs.
needed for the output vcfs. An indel scan is performed to down weight snvs and indels
that occur in noisy indel regions.

Required Params:
-v Full path file or directory containing xxx.vcf(.gz/.zip OK) file(s).
-s Directory in which to save the parsed files, defaults to parent dir of the vcfs.
-t Minimum tumor allele frequency (AF)
-x Maximum tumor AF
-n Maximum normal AF
-u Minimum tumor alignment depth (DP)
-a Minimum tumor alt count
-o Minimum normal DP
-d Minimum T-N AF difference
-r Minimum T/N AF ratio
-f Maximum Fisher's TN pvalue, defaults to 1, no threshold
-y Save failing records
-m For multi alt alleles, save just the biggest T-N AF difference.
-v Path to the bcftools xxx.vcf(.gz/.zip OK) file.
-b Normal BamPileup file, compressed and tabix indexed, see the USeq BamPileup app.

Default Params:
-s Directory in which to save the parsed file, defaults to parent dir of the vcf.
-c Normal BamPileup index, 0
-t Minimum tumor allele frequency (AF), 0
-x Maximum tumor AF, 1
-n Maximum normal AF, 1
-u Minimum tumor alignment depth (DP), 0
-a Minimum tumor alt count, 0
-o Minimum normal DP, 0
-i Maximum normal INDEL AF, 0
-j Maximum normal INDEL AF/ tumor AF ratio, 0
-d Minimum T-N AF difference, 0
-r Minimum T/N AF ratio, 0
-f Maximum Fisher's TN pvalue, defaults to 1
-m For multi alt alleles, save just the biggest T-N AF difference
-p BP padding for indel scan, default 3
-z Print example bcftools bash script.

Example: java -jar pathToUSeq/Apps/SimpleSomaticCaller -v VCFFiles/ -f 0.001
-t 0.005 -x 0.2 -n 0.1 -u 100 -a 3 -o 100 -d 0.0025 -r 3 -m
Example: java -jar pathToUSeq/Apps/SimpleSomaticCaller -v bcfTools.vcf.gz -f 0.001
-t 0.005 -x 0.2 -n 0.1 -u 100 -a 3 -o 100 -d 0.0025 -r 3 -m -s SSC/ -b norm.bp.txt.gz
-i 0.01 -j 0.5

**************************************************************************************
</pre><br><p>
Expand Down Expand Up @@ -4813,7 +4823,7 @@ <H1>Command Line Menus</H1> <p>
</pre><br><p>

<a name="VCFCallFrequency"><pre>**************************************************************************************
** VCF Call Frequency: Feb 2020 **
** VCF Call Frequency: May 2020 **
**************************************************************************************
Calculates a vcf call frequency for each variant when pointed at a genomic Query
service (https://github.com/HuntsmanCancerInstitute/GQuery) or the Data and Index
Expand All @@ -4835,11 +4845,11 @@ <H1>Command Line Menus</H1> <p>
-c Config txt file containing two tab delimited columns with host, queryUrl, realm,
userName, password, and (optionally) fileFilter and or maxCallFreq. 'chmod 600'
the file! e.g.:
host hci-clingen1.hci.utah.edu
queryUrl http://hci-clingen1.hci.utah.edu:8080/GQuery/
host example.hci.utah.edu
queryUrl http://example.hci.utah.edu:8080/GQuery/
realm GQuery
userName FColins
password g0QueryAP1
userName ExampleUser
password ExamplePW
-i (Alternative to -c), provide a path to the Index directory generated by the USeq
QueryIndexer app.
-d (Alternative to -c), provide a path to the Data directory used in creating the Index.
Expand Down Expand Up @@ -5353,6 +5363,40 @@ <H1>Command Line Menus</H1> <p>

Example: java -jar pathToUSeq/Apps/VCF2Tsv -v VCFss/ -q 3 -z -b 0.2 -i Foundation

**************************************************************************************
</pre><br><p>

<a name="VCFVarSeqParser"><pre>**************************************************************************************
** VCF VarSeq Parser: May 2020 **
**************************************************************************************
VVSP either creates vcf files for import into VarSeq by inserting a unique ID into the
INFO field of each record OR, VVSP takes vcf file(s) exported from VarSeq and pulls the
original unmodified vcf record via the unique ID. Use this tool when you want to
just filter variants using VarSeq and retain the unmodified original vcf records.
If you aren't going to filter on sample info, e.g. with t/n somatic variants, use -d
Be carefult to not modify the original vcfs prior to rematching with the VarSeq export
vcf(s)!

When exporting vcf files from VarSeq, include the VRID INFO tag AND where possible
only export Affected Samples, e.g. Field Selection-> Variant Info-> check VRID. On the
Options page select the Affected Samples bubble.

Params (either complete -o -i -x OR -o -e -f):
-o Directory containing original vcfs to modify for VarSeq import
(xxx.vcf(.gz/.zip OK))
-i Directory to save the modified vcf files for VarSeq import
-e Directory containing the exported VarSeq vcf file(s)
-f Directory to save the final filtered original vcf records
-x List of FILTER keys to exclude from the original vcfs, comma delimited, no spaces,
case sensitive
-d Drop sample info for VarSeq vcf file import.

Examples:
1) java -Xmx1G -jar pathTo/USeq/Apps/VCFVarSeqParser -o StartingOriVcfs/
-i VcfsForVarSeqImport -x CallFreq,VCFBkz -d
2) java -Xmx1G -jar pathTo/USeq/Apps/VCFVarSeqParser -o StartingOriVcfs/
-e VarSeqExportedVcfs/ -f FinalFilteredOriVcfs/

**************************************************************************************
</pre><br><p>

Expand Down
2 changes: 1 addition & 1 deletion Source/edu/utah/seq/parsers/MergePairedAlignments.java
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ private void saveHeader() throws FileNotFoundException, IOException {
public static void printDocs(){
System.out.println("\n" +
"**************************************************************************************\n" +
"** Merge Paired Alignments: March 2020 **\n" +
"** Merge Paired Alignments: June 2020 **\n" +
"**************************************************************************************\n" +
"Merges proper paired alignments that pass a variety of checks and thresholds. Only\n" +
"unambiguous pairs will be merged. Increases base calling accuracy in overlap and helps\n" +
Expand Down
50 changes: 42 additions & 8 deletions Source/edu/utah/seq/parsers/PairedAlignmentChrParser.java
Original file line number Diff line number Diff line change
Expand Up @@ -161,27 +161,43 @@ public static int countIs (String cigar, int stop){
if (stop == 0) return 0;
int length = 0;
int numIs = 0;
//IO.pl("Stop/StartRight in 0 coordinates "+stop+" "+cigar);

//for each M D I or N block MDIN
Matcher mat = MergePairedAlignments.CIGAR_SUB.matcher(cigar);
while (mat.find()){
//pass
int num = Integer.parseInt(mat.group(1));
//IO.pl("\nAlignmentBlock "+mat.group(1)+mat.group(2)+" num "+num);
//is call I
if (mat.group(2).equals("I")){
if (mat.group(2).equals("I")){
//IO.pl("\tI found");
numIs+= num;
if (length >= stop) {
//IO.pl("\t\tLen >= stop Breaking");
break;
}
/*
for (int i=0; i< num; i++){
// shouldn't this not count to length!
length++;
numIs++;
if (length >= stop) break;
//numIs++;
}
}*/
}
//nope just addit
else {
//IO.pl("\tNot I found");
length += num;
if (length >= stop) break;
//IO.pl("\t\t"+numIs+" I Len "+length);
if (length >= stop) {
//IO.pl("\t\tLen >= stop Breaking");
break;
}
}

}
//IO.pl("Final num I "+numIs);
return numIs;
}

Expand Down Expand Up @@ -215,29 +231,44 @@ public SamAlignment mergePairedAlignments(SamAlignment first, SamAlignment secon
left = second;
}

//System.out.println("Name "+left.getName());
//System.out.println("\nName "+left.getName());

//fetch genomic space coordinates
//fetch genomic space coordinates, not for INDELS
int startLeft = left.getPosition();
int stopLeft = startLeft + countLengthOfCigar(left.getCigar());
//IO.pl("LeftStart "+startLeft);
//IO.pl("LeftStop "+stopLeft);


int startRight = right.getPosition();
int stopRight = startRight + countLengthOfCigar(right.getCigar());
int stop = stopRight;
if (stopLeft > stop) stop = stopLeft;
//IO.pl("RightStart "+startRight);
//IO.pl("RightStop "+stopRight);

//any Is in left that precede the start of right?
int numAdders = countIs(left.getCigar(), startRight-startLeft);
//IO.pl("Adders "+numAdders+" LeftCigar "+left.getCigar()+" R-L "+(startRight-startLeft));

//make arrays to hold sequence and qualities in cigar space
int size = numAdders + stop-startLeft;

SamLayout leftLayout = new SamLayout(size);
SamLayout rightLayout = new SamLayout(size);


//layout data
leftLayout.layoutCigar(startLeft, left);
rightLayout.layoutCigar(startLeft-numAdders, right);

//System.out.println("LeftLayout");
//leftLayout.print();
//System.out.println("RightLayout");
//rightLayout.print();



//increment overlap and insert size
int[] overNonOver = SamLayout.countOverlappingBases(leftLayout, rightLayout);
numberOverlappingBases+= (double)overNonOver[0];
Expand All @@ -251,8 +282,10 @@ public SamAlignment mergePairedAlignments(SamAlignment first, SamAlignment secon
//merge layouts, modifies original layouts so print first if you want to see em before mods.
SamLayout mergedSamLayout = SamLayout.mergeLayouts(leftLayout, rightLayout, minimumDiffQualScore, minimumFractionInFrameMismatch);

//System.out.println("MergedLayout");
//mergedSamLayout.print();
//System.out.println("\nMergedLayout");
//mergedSamLayout.print();

//System.exit(1);

if (mergedSamLayout == null) {
//add failed merge tag
Expand Down Expand Up @@ -522,6 +555,7 @@ private void processPair(SamAlignment first, SamAlignment second) throws Excepti

/**Attempts to merge a proper paired alignment. Increments counters and sends examples to the gzipper.*/
private void mergeAndScorePair(SamAlignment first, SamAlignment second) throws Exception{

//collect string rep since the merge method will modify the SamAlignment while merging
//so if it fails you can output the unmodified SamAlignment
String firstSamString = first.toString();
Expand Down

0 comments on commit 7c373aa

Please sign in to comment.