Skip to content

Commit

Permalink
Further dev of Caris dataset processing and the VariantPathwayComparator
Browse files Browse the repository at this point in the history
  • Loading branch information
u0028003 committed Mar 10, 2022
1 parent 42a01a8 commit b5e73d1
Show file tree
Hide file tree
Showing 13 changed files with 732 additions and 75 deletions.
39 changes: 32 additions & 7 deletions Documentation/USeqDocumentation/cmdLnMenus.html
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ <H1>Command Line Menus</H1> <p>

<a href="#CalculatePerCycleErrorRate">CalculatePerCycleErrorRate</a><br>

<a href="#CarisVcfComparator">CarisVcfComparator</a><br>

<a href="#CarisXmlVcfParser">CarisXmlVcfParser</a><br>

<a href="#ChIPSeq">ChIPSeq</a><br>
Expand Down Expand Up @@ -1365,29 +1367,50 @@ <H1>Command Line Menus</H1> <p>
Example: java -Xmx1500M -jar pathTo/USeq/Apps/CalculatePerCycleErrorRate
-b /Data/Bam/ -f /Fastas/chrPhiX_Illumina.fasta.gz

**************************************************************************************
</pre><br><p>

<a name="CarisVcfComparator"><pre>**************************************************************************************
** Caris Vcf Comparator: March 2022 **
**************************************************************************************
CVC compares a Caris vcf generated with the CarisXmlVcfParser to a recalled vcf.
Exact recall vars are so noted and removed. Caris vcf with no exact but one
overlapping record can be merged with -k. Be sure to vt normalize each before running.
Recall variants failing FILTER are not saved.

Options:
-c Path to a the Caris xml parsed vcf file, see the CarisXmlVcfParser app.
-r Path to a recalled snv/indel vcf file.
-m Path to named vcf file for saving the results.

Example: java -Xmx2G -jar pathToUSeq/Apps/CarisVcfComparator -f TL-18-03CFD6.vcf
-r /F1/TL-18-03CFD6_recall.vcf.gz -g -c -m /F1/TL-18-93CFD6_merged.vcf.gz -k

**************************************************************************************
</pre><br><p>

<a name="CarisXmlVcfParser"><pre>**************************************************************************************
** Caris Xml Vcf Parser: March 2021 **
** Caris Xml Vcf Parser: March 2022 **
**************************************************************************************
This tool parses Caris paired xml and vcf report files to generate: new vcfs where xml
reported genomic alternations are annotated, bed files of copy number changes and gene
fusions as well as a summary spreadsheet of the non NGS test results and report
attributes.
attributes. Use Vt to normalize the vcf output files.

Options:
-d Path to a directory containing paired xml and vcf files from Caris.
-s Path to a directory for saving the results.
-s Path to a directory for saving the results. Only impacting vcf, cnv, fusions are
saved, if none present, no file(s) are created.
-u Path to a Hg38 UCSC RefFlat or RefSeq merged gene file for CNV and gene fusion
coordinate extraction. See: http://www.genome.ucsc.edu/FAQ/FAQformat.html#format9
(refSeqGeneName name2(optional) chr strand txStart txEnd cdsStart cdsEnd exonCount
exonStarts exonEnds). Be sure to merge transcripts with the USeq MergeUCSCGeneTable
app and remove non standard chromosomes.
-i Include PHI in spreadsheet output, defaults to excluding.
-a Include all vcf records in output, defaults to just those with an xml match.

Example: java -Xmx2G -jar pathToUSeq/Apps/CarisXmlVcfParser -d CarisReports/
-s ParsedCarisReports/ -i -u ~/GRCh38/hg38RefSeq9Dec2020_MergedStdChr.ucsc.gz
-s ParsedCarisReports/ -u ~/GRCh38/hg38RefSeq9Dec2020_MergedStdChr.ucsc.gz

**************************************************************************************
</pre><br><p>
Expand Down Expand Up @@ -4891,7 +4914,7 @@ <H1>Command Line Menus</H1> <p>
</pre><br><p>

<a name="TNRunner2"><pre>**************************************************************************************
** TNRunner2 : Feb 2022 **
** TNRunner2 : March 2022 **
**************************************************************************************
TNRunner is designed to execute several containerized snakmake workflows on tumor
normal datasets via a slurm cluster. Based on the availability of fastq,
Expand Down Expand Up @@ -5165,9 +5188,11 @@ <H1>Command Line Menus</H1> <p>
delimited, the first cell is the pathway ID and description (e.g.
'hsa05210 Colorectal cancer'), subsequent cells, the associated genes.
-r File to save results, should end with .xls
-k (Optional) Tab delimited file containing KEGG gene IDs and gene Names, one row
per gene, for highlighting impacted genes in each KEGG pathway hyperlink.

Example: java -Xmx10G -jar pathTo/USeq/Apps/VariantPathwayComparator -a earlyCRC.txt
-b lateCRC.txt -p keggHumanPathways.txt -r earlyVsLateVPC.xls
-b lateCRC.txt -p keggHumanPathways.txt -r earlyVsLateVPC.xls -k keggGeneIDName.txt

**************************************************************************************
</pre><br><p>
Expand Down Expand Up @@ -5335,7 +5360,7 @@ <H1>Command Line Menus</H1> <p>
-e Print verbose debugging output.

Example: java -jar pathToUSeq/Apps/VCFCallFrequency -f Vcf/ -s CFVcfs -v
Hg38/Somatic/Avatar/Vcf -b -m 0.05 -d ~/GQuery/Data
Hg38/Somatic/Avatar/Vcf -b -m 0.07 -d ~/GQuery/Data

**************************************************************************************
</pre><br><p>
Expand Down
Binary file modified Misc/JavaxComOrgInfoNetClasses.zip
Binary file not shown.
2 changes: 1 addition & 1 deletion Source/edu/utah/seq/run/TNRunner2.java
Original file line number Diff line number Diff line change
Expand Up @@ -864,7 +864,7 @@ public void processArgs(String[] args){
public static void printDocs(){
IO.pl("\n" +
"**************************************************************************************\n" +
"** TNRunner2 : Feb 2022 **\n" +
"** TNRunner2 : March 2022 **\n" +
"**************************************************************************************\n" +
"TNRunner is designed to execute several containerized snakmake workflows on tumor\n"+
"normal datasets via a slurm cluster. Based on the availability of fastq, \n"+
Expand Down
23 changes: 17 additions & 6 deletions Source/edu/utah/seq/run/TNSample2.java
Original file line number Diff line number Diff line change
Expand Up @@ -104,19 +104,30 @@ public TNSample2(File rootDir, TNRunner2 tnRunner) throws IOException{
private void parseMergeClinicalVars() throws IOException {
info.add("Checking clinical variant integration...");

//look for json file
//look for json file and xml vcfs
File[] jsonTestResults = IO.extractFiles(new File(rootDir, "ClinicalReport"), ".json");
File[] xmls = null;
File[] vcfs = null;
File[] toLink = null;

if (jsonTestResults == null || jsonTestResults.length !=1) {
info.add("\tJsonReport\tFAILED to find one xxx.json clinical test report file");
failed = true;
return;

//look for caris xml and vcf.gz
xmls = IO.extractFiles(new File(rootDir, "ClinicalReport"), ".xml");
vcfs = IO.extractFiles(new File(rootDir, "ClinicalReport"), ".vcf.gz");
if (xmls.length != 1 || vcfs.length != 1) {
info.add("\tJson/Xml/VcfReport\tFAILED to find one xxx.json or one xxx.vcf.gz and xxx.xml clinical test report file(s)");
failed = true;
return;
}
jsonTestResults = null;
toLink = new File[]{somaticVariants, new File(somaticVariants+".tbi"), xmls[0], vcfs[0]};
}
else toLink = new File[]{somaticVariants, new File(somaticVariants+".tbi"), jsonTestResults[0]};

//make dir, ok if it already exists
File jobDir = new File (rootDir, "SomaticVariantCalls/"+id+"_ClinicalVars");
jobDir.mkdirs();

File[] toLink = new File[]{somaticVariants, new File(somaticVariants+".tbi"), jsonTestResults[0]};

//any files?
HashMap<String, File> nameFile = IO.fetchNamesAndFiles(jobDir);
Expand Down
8 changes: 4 additions & 4 deletions Source/edu/utah/seq/vcf/VCFCallFrequency.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
import edu.utah.hci.query.MasterQuery;
import edu.utah.hci.query.QueryRequest;
import edu.utah.hci.query.UserQuery;
import edu.utah.seq.vcf.anno.AnnotatedVcfParser;

/**Calculates call frequencies for each vcf record from a db of vcf files and callable region files.
* Most mutated single site in p53 are seen at 0.039 across all cancers in TCGA. 0.57 for BRAF V600E across all cancers.*/
Expand Down Expand Up @@ -292,9 +291,10 @@ public void query(UserQuery userQuery) throws IOException {
//make a single use QueryRequest object
QueryRequest qr = new QueryRequest(masterQuery, null, options, null);
//any errors?
if (qr.getErrTxtForUser() != null || qr.getWarningTxtForUser().size()!=0) {
if (qr.getErrTxtForUser() != null ) {
//if (qr.getErrTxtForUser() != null || qr.getWarningTxtForUser().size()!=0) {
if (qr.getErrTxtForUser() !=null) Misc.printErrAndExit("\nInvalid request, "+qr.getErrTxtForUser()+ ",aborting.");
if (qr.getWarningTxtForUser().size() !=0) Misc.printErrAndExit("\n"+qr.getWarningTxtForUser()+", aborting.");
// already printed! so skip, if (qr.getWarningTxtForUser().size() !=0) Misc.printErrAndExit("\n"+qr.getWarningTxtForUser()+", might be an issue.");
}
//set results
else userQuery.setResults(qr.getJsonResults());
Expand Down Expand Up @@ -639,7 +639,7 @@ public static void printDocs(){


"\nExample: java -jar pathToUSeq/Apps/VCFCallFrequency -f Vcf/ -s CFVcfs -v \n"+
" Hg38/Somatic/Avatar/Vcf -b -m 0.05 -d ~/GQuery/Data \n\n" +
" Hg38/Somatic/Avatar/Vcf -b -m 0.07 -d ~/GQuery/Data \n\n" +
"**************************************************************************************\n");

}
Expand Down
74 changes: 55 additions & 19 deletions Source/edu/utah/seq/vcf/VariantPathwayComparator.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,13 @@ public class VariantPathwayComparator {
private File groupAGeneHits;
private File groupBGeneHits;
private File pathwayGenes;
private File geneKeggIdName;

//internal
private ArrayList<Pathway> pathways = new ArrayList<Pathway>();
private File resultsSpreadsheet = null;
private HashMap<String,String> keggGeneNameID = null;
private double minKeggFreq = 0.025;


//constructor
Expand Down Expand Up @@ -52,14 +55,14 @@ private void sortSavePathways() {

try {
PrintWriter out = new PrintWriter( new FileWriter(this.resultsSpreadsheet));
out.println("NameHyperLink\tAdjPval\tAHits\tANoHits\tFracAHits\tAGeneHits\tBHits\tBNoHits\tFracBHits\tBGeneHits\tLog2(fracA/fracB)");
for (Pathway p: loadedPathways) out.println(p.toString());
if (geneKeggIdName == null) out.println("NameHyperLink\tPval\tAdjPval\tAHits\tANoHits\tFracAHits\tAGeneHits\tBHits\tBNoHits\tFracBHits\tBGeneHits\tLog2(fracA/fracB)");
else out.println("NameHyperLink\tPval\tAdjPval\tAHits\tANoHits\tFracAHits\tAGeneHits\tAGeneLinks\tBHits\tBNoHits\tFracBHits\tBGeneHits\tBGeneLinks\tLog2(fracA/fracB)");
for (Pathway p: loadedPathways) out.println(p.toString(keggGeneNameID));
out.close();
} catch (IOException e) {
e.printStackTrace();
}


}

private void comparePathways() {
Expand Down Expand Up @@ -173,6 +176,8 @@ private void loadPathways() {
e.printStackTrace();
Misc.printErrAndExit("\nFailed to parse the pathway file "+pathwayGenes);
}

if (geneKeggIdName!=null) keggGeneNameID = IO.loadFileIntoHash(geneKeggIdName, 1, 0);
}

private class Pathway implements Comparable<Pathway>{
Expand All @@ -197,32 +202,34 @@ private class Pathway implements Comparable<Pathway>{



public String toString() {
//name adjPval AHits, ANoHits, fracAHits, AGeneHits, BHits, BNoHits, fracBHits, BGeneHits, log2(fracA/fracB)
public String toString(HashMap<String,String>keggGeneNameId) {
//name pval, adjPval, AHits, ANoHits, fracAHits, AGeneHits, (ALinks,) BHits, BNoHits, fracBHits, BGeneHits, (BLinks), log2(fracA/fracB)
StringBuilder sb = new StringBuilder();
//attempt to parse the hsa0000 pathway number
Matcher mat = hsaPath.matcher(name);
String partialHyperLink = null;
if (mat.matches()) {
sb.append("=HYPERLINK(\"https://www.kegg.jp/pathway/");
sb.append(mat.group(1));
partialHyperLink = sb.toString();
sb.append("\",\"");
sb.append(name);
sb.append("\")");
}
else sb.append(name);
sb.append("\t");
else sb.append(name); sb.append("\t");
sb.append(Num.formatNumber(pval, 5)); sb.append("\t");
sb.append(Num.formatNumber(adjPval, 5)); sb.append("\t");
sb.append(groupAHits.size()); sb.append("\t");
sb.append(groupANoHits.size()); sb.append("\t");
double aFrac = (double)groupAHits.size()/ (double)(groupAHits.size()+groupANoHits.size());
sb.append(Num.formatNumber(aFrac, 5)); sb.append("\t");
sb.append(convertToSortedFrequencies(aGenes)); sb.append("\t");
sb.append(convertToSortedFrequencies(aGenes, partialHyperLink)); sb.append("\t");

sb.append(groupBHits.size()); sb.append("\t");
sb.append(groupBNoHits.size()); sb.append("\t");
double bFrac = (double)groupBHits.size()/ (double)(groupBHits.size()+groupBNoHits.size());
sb.append(Num.formatNumber(bFrac, 5)); sb.append("\t");
sb.append(convertToSortedFrequencies(bGenes)); sb.append("\t");
sb.append(convertToSortedFrequencies(bGenes, partialHyperLink)); sb.append("\t");

double rto = aFrac/bFrac;
sb.append(Num.formatNumber(Num.log2(rto), 5));
Expand All @@ -237,7 +244,7 @@ public int compareTo(Pathway o) {

}

public static String convertToSortedFrequencies(HashMap<String,Integer> keyCounts) {
public String convertToSortedFrequencies(HashMap<String,Integer> keyCounts, String pathwayName) {
double totalCounts = 0;
for (String key: keyCounts.keySet()) totalCounts+= keyCounts.get(key);
GeneCount[] gc = new GeneCount[keyCounts.size()];
Expand All @@ -249,17 +256,43 @@ public static String convertToSortedFrequencies(HashMap<String,Integer> keyCount
gc[count++] = new GeneCount(key, geneCount);
}
Arrays.sort(gc);
StringBuilder sb = new StringBuilder(gc[0].name);
sb.append("=");
sb.append(Num.formatNumber(gc[0].fraction, 3));
for (int i=1; i<gc.length; i++) {
sb.append(",");
sb.append(gc[i].name);
sb.append("=");
sb.append(Num.formatNumber(gc[i].fraction, 3));

StringBuilder namesToView = new StringBuilder();
ArrayList<String> namesToLink = new ArrayList<String>();
int last = gc.length-1;
for (int i=0; i<gc.length; i++) {
namesToView.append(gc[i].name);
namesToView.append("=");
namesToView.append(Num.formatNumber(gc[i].fraction, 3));
if (i!=last)namesToView.append(",");
//any for linking?
if (gc[i].fraction >= minKeggFreq) namesToLink.add(gc[i].name);
}
String geneNamesToView = namesToView.toString();
if (keggGeneNameID != null) return geneNamesToView+ "\t"+ createGeneHyperLink(pathwayName, namesToLink, geneNamesToView);
return geneNamesToView;
}
private String createGeneHyperLink(String pathwayLink, ArrayList<String> namesToLink, String geneNamesToView) {
//https://www.kegg.jp/pathway/hsa03440
//must watch size, max 250
StringBuilder sb = new StringBuilder(pathwayLink);
for (String geneName: namesToLink) {
String id = keggGeneNameID.get(geneName);
int size = sb.length() + id.length()+1;
if (size > 250) break;
if (id != null) {
sb.append("+");
sb.append(id);

}
}
sb.append("\",\"");
if (geneNamesToView.length()> 250) sb.append(geneNamesToView.subSequence(0, 247)+"...");
else sb.append(geneNamesToView);
sb.append("\")");
return sb.toString();
}

public static class GeneCount implements Comparable<GeneCount>{
double fraction = 0;
String name = null;
Expand Down Expand Up @@ -301,6 +334,7 @@ public void processArgs(String[] args){
case 'a': groupAGeneHits = new File(args[++i]); break;
case 'b': groupBGeneHits = new File(args[++i]); break;
case 'p': pathwayGenes = new File(args[++i]); break;
case 'k': geneKeggIdName = new File(args[++i]); break;
case 'r': resultsSpreadsheet = new File(args[++i]); break;
case 'h': printDocs(); System.exit(0);
default: Misc.printErrAndExit("\nProblem, unknown option! " + mat.group());
Expand Down Expand Up @@ -341,9 +375,11 @@ public static void printDocs(){
" delimited, the first cell is the pathway ID and description (e.g.\n"+
" 'hsa05210 Colorectal cancer'), subsequent cells, the associated genes.\n"+
"-r File to save results, should end with .xls\n"+
"-k (Optional) Tab delimited file containing KEGG gene IDs and gene Names, one row\n"+
" per gene, for highlighting impacted genes in each KEGG pathway hyperlink.\n"+

"\nExample: java -Xmx10G -jar pathTo/USeq/Apps/VariantPathwayComparator -a earlyCRC.txt\n"+
" -b lateCRC.txt -p keggHumanPathways.txt -r earlyVsLateVPC.xls \n"+
" -b lateCRC.txt -p keggHumanPathways.txt -r earlyVsLateVPC.xls -k keggGeneIDName.txt\n"+

"\n**************************************************************************************\n");

Expand Down
5 changes: 5 additions & 0 deletions Source/edu/utah/seq/vcf/xml/caris/CNVAlteration.java
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,9 @@ public SimpleBed toBed(HashMap<String, UCSCGeneLine[]> name2GeneModels) throws I
if (score == null) score = ".";
return new SimpleBed(model.getChrom(), model.getTxStart(), model.getTxEnd(), geneName+":"+result.toLowerCase().replace(" ", "_"), score, model.getStrand());
}


public String getResult() {
return result;
}
}
Loading

0 comments on commit b5e73d1

Please sign in to comment.