-
Notifications
You must be signed in to change notification settings - Fork 29
Exercises
Read support and data/metadata extraction is the main goal. The first
type of data that should be tackled is raw mass spectrometry data, in
particular data in the mzML
format. After that, it would be good to
implement and interface to mzIdentML
and mzQuantML
, respectively
used to store identification and quantitation data (more details about
these later).
Write support to any other raw data format would also be of interest.
Source http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mgf
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mz5
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML
and
a raw mzML
file and the identification results as an mzIdentML
file.
Example data files. The most important file format is
mzML
and
that's the one efforts should be invested on. This format is the most
complete. The others should also be supported by MSDataFile::Reader
,
though.
The difference between
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML
and
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
is that
zlib
compression has not been applied for the latter. zlib
compression has been used for the mzML
example file. This is not
relevant for mgf
(only
text). mz5
uses the HDF5 format; not sure if the data inside is zlib
compressed.
The PSM identification data format is mzIdentML
(ref and specs).
library("mzR")
fml <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML"
fmlz <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01_zlib.mzXML"
fxml <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
xx <- openMSfile(fxml)
hdx <- header(xx)
k <- which(hdx$msLevel == 2)[3000]
x <- openMSfile(fml)
hd <- header(x)
i <- which(hd$msLevel == 2)[3000]
xz <- openMSfile(fmlz)
hdz <- header(xz)
j <- which(hdz$msLevel == 2)[3000]
all.equal(hd[i, ], hdz[j, ])
all.equal(peaks(x, i), peaks(xz, j))
hdz[which(hdz$peaksCount == 2118), ]
If any of the below is not clear and it seems to be essential to understand the mass spectrometry details to make progress from a programming point of view, please do not hesitate to ask questions.
-
Read the
mzML
file and return metadata, including general instrument information (manufacturer, model, ionistation, analyser, detector) and general run information (number of scans: 7534, MZ range: 99.99945 - 2069.279, retention time range: 0.4584 - 3601.983, MS levels available in the file: 1 and 2, number of MS1 and MS2 levels: 1431 MS1 and 6103 MS2). -
Read the
mzML
data and save it tomgf
. The opposite would not make much sense, asmzML
contains much more information that the latter. Writing tomz5
andmzXML
could also be of interest. Not possible in currentmzR
. -
Read the
mzML
file and write a newmzML
that contains only the MS1 or MS2 spectra (ms level
). Not possible in currentmzR
. -
Extract information of the 3000th MS2 spectrum. Below is the information extracted with the current version of
mzR
. Also extract the actual spectrum data, i.e. MZ and intensities (seep3000.csv
).
acquisitionNum 4092.0000
msLevel 2.0000
peaksCount 2465.0000
totIonCurrent 5368603.0000
retentionTime 2269.6948
basePeakMZ 129.1391
basePeakIntensity 210466.4531
collisionEnergy 45.0000
ionisationEnergy 0.0000
lowMZ 100.0012
highMZ 2069.2793
precursorScanNum 4083.0000
precursorMZ 1111.0800
precursorCharge 2.0000
precursorIntensity 8771486.0000
- Extract a chromatogram. A chromatogram is composed of two vectors of
the same length, encoding retention time and intensity for a given
precursor ion defined by a MZ values. This is not directly possible
currently and chromatograms need to be calculated (which takes some
time in
R
). I believe it is possible to directly extract such chromatograms frommzML
.
You will see many MS:0000000
terms in the data files. These
represent ontology terms (linked curated vocabulary). At this point,
it might not be critical to take it into account, but it might be good
to keep in mind that this is how the different pieces of information
are officially defined to avoid ambiguities. pwiz
provides its own
copy of the ontology (see pwiz/data/common/psi-ms.obo
and similar)
and converters (probably CVTranslator.*
in the same directory).
Access to the relevant ontologies is also available through the rols
Bioconductor package.