Skip to content

Foundation Pipeline: XML to Staging Data Conversion

angelicaochoa edited this page Aug 11, 2016 · 11 revisions

Instructions

The Foundation Pipeline was written to convert XML data as provided by Foundation Medicine into staging files for cBioPortal. The pipeline requires 3 options on the command line:

  • source: Source directory containing the Foundation XML files
  • output: Output directory for writing the converted data
  • cancer_study_id: A cancer study identifier for writing the meta data files

Spring Batch framework also uses a commit interval (chunk size) for batch processing data, which can be customized in application.properties (current default chunk interval=30). A higher chunk interval should be set for very large datasets.

Example run times for processing 159 Foundation cases:

Chunk Interval Processing Time Total Time (w/startup context)
10 20.9s 32.2s
30 14.9s 26.5s
50 13.2s 24.6s

Data Flow

XML File(s)CNA Data StepGeneral Data StepMeta Data StepStaging Files

1. File reading and data extraction: Data from the XML file(s) in the source directory are extracted and injected into the job execution context as a list of Foundation cases to make it accessible to the subsequent steps.

2. CNA data: Copy number alteration data from each case are extracted and transformed rows of data where the first item in each row is the gene symbol and the remainder are discrete copy number alteration values for each sample for that gene. If a sample doesn't have any copy number alterations for a gene then it is assumed that they have a neutral copy number change. The rows of data are then passed on to a writer to generate the CNA staging file.

3. General data: Clinical, Mutation, and Fusion data are read, processed, and written during the general data step execution. A reader, composite processor, and composite writer convert the different datatypes to their corresponding staging file format and then the data are written to their respective staging files.

4. Meta data: The meta files for CNA, Mutation, and Fusion data are generated by replacing certain elements in each meta data resource by the given cancer study identifier. The current version of the importer does not require a meta_clinical.txt file for the data_clinical.txt staging file. Eventually the data_clinical.txt staging file might need to be split into data_clinical_patients.txt and data_clinical_samples.txt, which do require specific meta files.

Configuration:

application.properties

  • spring.batch.job.enabled=false: This is always false to prevent automatic job execution on startup as described in more detail here.
  • chunk.interval: An integer commit interval for batch processing.

Meta data configuration

The meta data resources are stored in MetaDataConfiguration. The resources are linked hash maps containing meta file properties and values.

The given cancer study identifier is used to customize the following meta file properties for each study:

  • cancer_study_identifier (Mutation, CNA, Fusion)
  • stable_id: (Mutation, CNA, Fusion)
  • profile_description: (Mutation,Fusion)

Usage:

$JAVA_HOME/bin/java -jar foundation/target/foundation-0.1.0.jar --cancer_study_id test_study_id --source ~/xml/source/directory/ --output ~/output/directory/

Logging, Exceptions, and Warnings:

The Spring Batch integrated logging system is logs the job and step execution by default. Additional logging was added to cover the following:

Reading from the source directory:

  • [INFO] Which file(s) are being skipped (files are skipped if not XML)
  • [INFO] Which files are being read/processed
  • [INFO] The total number of cases extracted from the XML file(s)

CNA data:

  • [INFO] The number of cases without CNA data

General data:

  • [ERROR] When a NullPointerException is thrown when converting XML data into staging file data for a sample
  • [WARN] When a reference allele cannot be resolved from a cds effect for a sample