Skip to content
Laurent edited this page Apr 14, 2014 · 1 revision

Read support and data/metadata extraction is the main goal. The first type of data that should be tackled is raw mass spectrometry data, in particular data in the mzML format. After that, it would be good to implement and interface to mzIdentML and mzQuantML, respectively used to store identification and quantitation data (more details about these later).

Write support to any other raw data format would also be of interest.

Example data

Source http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/

TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mgf TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mz5 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML

and

a raw mzML file and the identification results as an mzIdentML file.

Raw data formats

Example data files. The most important file format is mzML and that's the one efforts should be invested on. This format is the most complete. The others should also be supported by MSDataFile::Reader, though.

The difference between TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML and TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML is that zlib compression has not been applied for the latter. zlib compression has been used for the mzML example file. This is not relevant for mgf (only text). mz5 uses the HDF5 format; not sure if the data inside is zlib compressed.

The PSM identification data format is mzIdentML (ref and specs).

Relevant R code

library("mzR")

fml <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML"
fmlz <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML"
fxml <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"

xx <- openMSfile(fxml)
hdx <- header(xx)
k <- which(hdx$msLevel == 2)[3000]


x <- openMSfile(fml)
hd <- header(x)
i <- which(hd$msLevel == 2)[3000]

xz <- openMSfile(fmlz)
hdz <- header(xz)
j <- which(hdz$msLevel == 2)[3000]

all.equal(hd[i, ], hdz[j, ])
all.equal(peaks(x, i), peaks(xz, j))

hdz[which(hdz$peaksCount == 2118), ]

Exercises

If any of the below is not clear and it seems to be essential to understand the mass spectrometry details to make progress from a programming point of view, please do not hesitate to ask questions.

  • Read the mzML file and return metadata, including general instrument information (manufacturer, model, ionistation, analyser, detector) and general run information (number of scans: 7534, MZ range: 99.99945 - 2069.279, retention time range: 0.4584 - 3601.983, MS levels available in the file: 1 and 2, number of MS1 and MS2 levels: 1431 MS1 and 6103 MS2).

  • Read the mzML data and save it to mgf. The opposite would not make much sense, as mzML contains much more information that the latter. Writing to mz5 and mzXML could also be of interest. Not possible in current mzR.

  • Read the mzML file and write a new mzML that contains only the MS1 or MS2 spectra (ms level). Not possible in current mzR.

  • Extract information of the 3000th MS2 spectrum. Below is the information extracted with the current version of mzR. Also extract the actual spectrum data, i.e. MZ and intensities (see p3000.csv).

acquisitionNum              4092.0000
msLevel                        2.0000
peaksCount                  2465.0000
totIonCurrent            5368603.0000
retentionTime               2269.6948
basePeakMZ                   129.1391
basePeakIntensity         210466.4531
collisionEnergy               45.0000
ionisationEnergy               0.0000
lowMZ                        100.0012
highMZ                      2069.2793
precursorScanNum            4083.0000
precursorMZ                 1111.0800
precursorCharge                2.0000
precursorIntensity       8771486.0000
  • Extract a chromatogram. A chromatogram is composed of two vectors of the same length, encoding retention time and intensity for a given precursor ion defined by a MZ values. This is not directly possible currently and chromatograms need to be calculated (which takes some time in R). I believe it is possible to directly extract such chromatograms from mzML.

Ontology terms

You will see many MS:0000000 terms in the data files. These represent ontology terms (linked curated vocabulary). At this point, it might not be critical to take it into account, but it might be good to keep in mind that this is how the different pieces of information are officially defined to avoid ambiguities. pwiz provides its own copy of the ontology (see pwiz/data/common/psi-ms.obo and similar) and converters (probably CVTranslator.* in the same directory).

Access to the relevant ontologies is also available through the rols Bioconductor package.