Beginner's Guide to Data Import

This guide describes how to import instrument data into GMS using the downsampled TST1 dataset (TST1ds) as an example. You can follow these steps to import TST1ds yourself. Before continuing you must have already run the ./setup/prime-system.pl command as documented in the installation instructions or use the Pre-configured Virtual Machine.

You may choose to run the already scripted version of these steps here: ~/gms/downsampled-demo-data/example-import-data.sh. Or you may choose to follow the instructions in the rest of this guide, which describes the commands executed by example-import-data.sh. Even if you choose not to execute example-import-data.sh, it provides a complete example of commands needed to import data into GMS.

To simplify this process, a new tool is in development which will replace this series of commands with a single command which takes a spreadsheet-style file as input.

Download the TSTds Dataset

example-import-data.sh lines 31-36:

INSTRUMENT_DATA_DIRECTORY='.'

echo Downloading downsampled instrument data to $INSTRUMENT_DATA_DIRECTORY
wget --no-directories --recursive --continue --no-parent --accept='*.bam' \
  --directory-prefix "$INSTRUMENT_DATA_DIRECTORY" \
  https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395_1tenth_percent/

Import the TSTds Dataset

In order to import an instrument data file, there must exist a record in GMS of the library which was sequenced, the sample from which the library was derived, and the individual from which the sample was collected. So, in order to import a set of instrument data files, an individual, one or more samples, and one or more libraries must first exist or be created in GMS.

Setup the Individual

example-import-data.sh lines 48-54:

INDIVIDUAL='H_NJ-HCC1395ds'
genome individual create                                                        \
    --name="$INDIVIDUAL"                                                        \
    --upn='HCC1395ds'                                                           \
    --common-name="TST1ds"                                                      \
    --gender=female                                                             \
    --taxon="name=human"

The value of the --name argument is arbitrary, but it should be something representative of the individual because you'll reference this name later when adding samples for this individual to GMS.

Setup the Samples

example-import-data.sh lines 105-112:

SAMPLE_RNA_NORMAL='H_NJ-HCC1395ds-HCC1395_BL_RNA'
genome sample create                                                            \
  --extraction-type='rna'                                                       \
  --source="name=$INDIVIDUAL"                                                   \
  --name=$SAMPLE_RNA_NORMAL                                                     \
  --common-name='normal'                                                        \
  --extraction-label='HCC1395 BL_RNA'                                           \
  --tissue-desc='b lymphoblast'

The --source parameter references the individual created in the previous step. Notice how the value of the --source parameter here matches the --name parameter given to genome individual create. The argument to the --extraction-type parameter can be either rna or genomic dna.

Repeat the genome sample create command for each sample in your dataset. The TST1ds dataset has four samples. See example-import-data.sh lines 78-103.

Setup the Libraries

example-import-data.sh lines 216-223:

LIBRARY_RNA_NORMAL='H_NJ-HCC1395ds-HCC1395_BL_RNA-lib1'
genome library create                                                           \
  --name="$LIBRARY_RNA_NORMAL"                                                  \
  --sample="$SAMPLE_RNA_NORMAL"                                                 \
  --protocol='Illumina Library Construction'                                    \
  --original-insert-size='364'                                                  \
  --library-insert-size='483'                                                   \
  --transcript-strand='unstranded'

The genome library create command creates a record of a library in GMS. The library links to the sample from which it was created, so it is important that the argument to --sample here matches the argument to --name given to the genome sample create command above, to indicate which sample record in GMS this library record should link to.

The --transcript-strand parameter is specified because the extraction-type of this library's sample is rna. The --transcript-strand parameter accepts one of three values as its argument. Possible values are 'unstranded', 'firststrand', or 'secondstrand'. This parameter should not be used for samples with extraction type genomic dna.

If your library is a capture library, create it just as you would a library for genomic dna. When you import the instrument data, you will have the opportunity to specify a capture set during import.

Repeat the genome library create command for each library in your dataset. The TST1ds dataset has ten libraries in all. See example-import-data.sh lines 133-212 for the code to create the other nine.

Import the Instrument Data

example-import-data.sh lines 316-321:

INSTRUMENT_DATA_DIRECTORY='.'
LIBRARY_RNA_NORMAL='H_NJ-HCC1395ds-HCC1395_BL_RNA-lib1'
genome instrument-data import basic                                             \
    --description='normal rna 1'                                                \
    --import-source-name='TST1ds'                                               \
    --instrument-data-properties='clusters=170049877'                           \
    --source-files="$INSTRUMENT_DATA_DIRECTORY/gerald_C2DBEACXX_3.bam"          \
    --library="$LIBRARY_RNA_NORMAL"

The genome instrument-data import basic command imports sequence reads. This step depends on already having a library created, as in the previous step, and it requires the argument to --library match the argument to --name given to genome library create. This step imports reads from a bam file, which in this example, exists in a file named gerald_C2DBEACXX_3.bam in the current working directory.

Repeat the genome instrument-data import basic command for each instrument data file in your dataset. The TST1ds dataset has twelve instrument data bam files. See example-import-data.sh lines 246-314 and 331-346 for the code to create the other nine.

Defining Models and Running Builds

To make use of the instrument data, genome models must be defined which use the instrument data, and builds must be executed for those models. Once data has been imported, the genome model clin-seq advise command can be used to guide you through the process of defining models and running builds on those models.

First, see example usage by typing:

genome model clin-seq advise --help

Next, lets see what samples and instrument data are available for the TST1ds individual:

genome model clin-seq advise --allow-imported --individual='common_name=TST1ds'

The above command will show details about the TST1ds individual, default processing profiles and model inputs, and available samples. You should see four samples available (tumor DNA, normal DNA, tumor RNA, and normal RNA). Run the clin-seq advise command again and provide all four sample ids as follows:

genome model clin-seq advise --allow-imported --individual='common_name=TST1ds' --samples='id in [??,??,??,??]'

NOTE: Replace ?? with sample ids

NOTE: Depending on the amount of resources available to your system or virtual machine you may not be able to start all builds recommended by clin-seq advise simultaneously. See the Beginner's Guide to the Demonstration Analysis or Quick-VM-Tour for more details. You can repeat the clin-seq advise command as many times as you wish and it will give you the current status of models/builds and how to progress with the next steps until a complete analysis.

Home	Install		Tutorials	FAQ

Home	Install	Docs	Tutorials	FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly