Evolving ImzML and mzML to an authentic open data format

Transcription

Evolving ImzML and mzML to an authentic open data format
Open Formats in Mass Spectrometry
Evolving ImzML and mzML to an authentic open data format
Alfons Hester, Bernhard Spengler
Institute of Inorganic and Analytical Chemistry, Justus Liebig University, Giessen
Alfons.Hester@anorg.chemie.uni-giessen.de
Overview
GLP Statement of OECD [1]
Where computerised systems are used to capture, process, report or store raw data
electronically, system design should always provide for the retention of full audit
trails to show all changes tothe data without obscuring the original data. It
should be possible to associate all changes to data with the persons making those
changes by use of timed and dated (electronic) signatures. Reasons for change
should be given.
OECD SERIES ON PRINCIPLES OF GOOD LABORATORY PRACTICE
AND
COMPLIANCE MONITORING
NUMBER 10, Page 10, Section 5. Data
mzML and imzML are common data formats for storing MS
and MS imaging data.
mzML and imzML documents can be digitally signed.
Such documents comply with requirements e.g. of GLP
(OECD) (cf. “GLP Statement of OECD”)
Authentic open data formats make data reliable and
trustworthy.
Depending on the quality of used certificates the highest
security level as defined in GLP (OECD), is possible.
Introduction
Rapid progress in IT leads to a rapid obsolescence of data formats, IT equipment and media
(Fig. 1). While technical improvement determines media and equipment, data formats are
independent from hardware. Data in mass spectrometry is mostly stored in homemade data
formats (company, institute, private formats) which are often not well documented or regarded as
company secret. Rapid changes in IT business make data formats become obsolete and no
longer supported. This is a severe risk for the accessibility for existing data.
Supporting many different formats for the same subject leads to large economic problems, (Fig.
2). Networking creates a demand for suitable data formats to exchange data automatically.
Results in science should be open, i.e. available without obstacles, for everyone who is
interested in them.
Examples of some removable media used within the last five decades:
punch card, capacity 80 Byte, 1932-1975
compact disc, 10-870 MByte, since 1981
floppy disk, 80 kByte-2880kByte, 1971-1999
blue ray disc, 7.8-50 GByte, since 2006
Additional requirements to an open data formats are reliability and trustworthiness.
Originator, authorities and legislators require quality assurance of studies, research assignments
and surveys. When data is created, stored, processed, dispatched, received, translated and
converted there is always a probability of forging raw data. Such accidental or wilful changes
have to made detectable.
This can be assured using checksums (hash values) or error correcting code.
Since open formats should be writeable for everyone the data stored therein can be changed
and written properly and a receiver is not able to detect this “betrayal”.
Using asymmetric cryptography and embedding it into the data format can protect the data.
The primary goal of a data format is to represent data! Therefore the authentic part should be
optional.
Finally an authentic open data format ought to be self-consistent which is the reason to
implement authenticity within the definition of the format.
Figure 1 : The rapid progress in IT during the last five decades demonstrated for
removable media.
Quadratic growth of the number of required converters Nc depending on the
number of different formats n :
n n−1
N c=
2
Format
Check
sum
authen Purpose
ticity
mzML 1.1
x
-
MS Data (HUPO-PSI)
ConsensusXML1.3
-
-
OpenMS
FeatureXML 1.3
-
-
OpenMS
IdXML 1.2
-
-
OpenMS
ParamXML 1.3
-
-
OpenMS
TrafoXML 1.0
-
-
OpenMS
GelML
-
-
Gel data
mzData
-
-
former MS Data
analysisXML
-
-
MS analysis (HUPO-PSI-PI)
Table 1 :
Security features implemented in
some actually used open formats for
the MS community.
If a format defines a checksum, in
order to detect erratical changes, it is
marked in the 2nd column with „x”
otherwise with „-”.
If a format has authenticity features it
is also marked in 3rd with „x”
otherwise with „-”.
n=5, Nc=10
n=7, Nc=21
n=5, Nc=55
Figure 2 : Each vertex represents a single data format, and each edge represents a
converter able to convert this two formats into each other
Methods
The Software for reading and writing the new authentic open data format was developed with
ObjectPASCAL Delphi 2009 and runs on MS Windows.
hash
algorithm
block
size
[bit]
As basis for implementing authenticity the open data format imzML was used. imzML was developed within
the COMPUTIS project of the European Union in order to store MS imaging data. imzML itself is an
extension of mzML, which is a XML based open data format for mass spectrometric data and which will
probably become the leading MS data format.
Haval
256
X
1992
Josef
Pieprzyk
Jennifer
Seberry
Yuliang
Zheng
http://labs.calyptix.com/haval.php
MD4
128
-
1990
Ronald
Rivest
RFC 1320, successful collision
attack
MD5
128
-
1991
Ronald
Rivest
RFC 1321, successful collision
attack in December 2008
RipeMD128
128
-
1996
Bosselaers successful collision attack in August
Antoon
2004
Hans
Dobbertin
Bart Preneel
RipeMD160
160
X
1996
SHA1
SHA256
SHA384
SHA512
Tiger
160
256
384
512
192
X
X
X
X
1995
2001
2001
2001
1995
Differing from the mzML format, imzML consists of two files, a XML file derived from mzML and a binary file.
Both files are linked together by an UUID (universally unique identifier). The binary file contains the
measured spectra whereas the XML file contains the rest of the data necessary to describe an experiment.
The parameters for reading the binary data are stored in the textual part too. (=> cf. poster PMM: 83
„Imaging mzML (imzML) – a common format for the comparison and exchange of imaging mass
spectrometry data”).
The component responsible for reading and writing the binary file automatically calculates a checksum for
each spectrum. This checksum is stored in the XML part.
When all binary data is written a general checksum of the binary file is calculated and stored in the XML file.
secure develo originator
25.08.2 ped
009
in
Several checksum calculation algorithms are implemented (cf. table 2) and tested (cf. figure 3).
In order to calculate a ciphered version of each checksum stored in the XML file the asymmetric
cryptographic system RSA is used.
Therefore an originator (human or machine) must have 2 corresponding keys : a private one and a public
one. Such pairs can be created using diverse software or obtained from a certificate authority (CA).
remark
NSA
NSA
NSA
NSA
Ross
Anderson
Eli Biham
security lack in 2005 discovered
http://www.cs.technion.ac.il/~biham/Reports/T
Table 2 :
All implemented hash functions, it's „year of birth” its actual security status.
One can see that some algorithm are unbroken for nearly 2 decades.
Figure 3 : Screenshots of developed software
Results
Performance of implemented hash functions:
!
w
ne
tio
op
scalable
Figure 4 : Throughput in MByte per second.
The computer which was used to determine these values was an AMD Athlon 2500+ with
2GB of memory.
future-proof
l
na
!
open
modular
0
economic
w
ne
20
powerful
Authentic Open
Data Format
l
na
40
recognize
modification
tio
op
60
trustworthy
l
na
80
tio
op
100
Haval 256
MD4
MD5
RipeMD-128
RipeMD-160
SHA1
SHA256
SHA384
SHA512
Tiger
application
-oriented
eligible for
QA
!
w
ne
120
Implemented requirements of the authentic open data format :
platform
independent
manufacturer
independent
device
independent
Figure 5 : Features of mzML/imzM
plus new and optional security features.
A software module was developed which is able to implement authenticity on a still existing mzML/imzM file and during initial file creation.
The performance is adequate and the underlying mzML/imzML format is now usable e.g. for QA, monitored studies and surveys.
Conclusion
An authentic open data format based on mzML and imzML was defined.
This format allows the user to sign mzML / imzML documents.
Data is readable with or without the authenticity extension.
Authenticity is part of this open format.
It is planned to integrate the developments within mzML/imzML-standard.
Acknowledgement
The authors gratefully acknowledge financial support by the European Union
(STREP project LSHG-CT-2005-518194)
References
[ 1] OECD Series on Principles of Good Laboratory Practice and Compliance Monitoring
http://www.oecd.org/document/63/0,3343,en_2649_34381_2346175_1_1_1_1,00.html
[ 2] PKCS #1 v2.1: RSA Cryptography Standard, RSA Laboratories,
ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-1/pkcs-1v2-1.pdf
[ 3] PSI-MS: Mass Spectrometry Standards Working Group
http://www.psidev.info/index.php?q=node/80
[ 4] OpenPGP Message Format RFC4880
http://tools.ietf.org/html/rfc4880