Slides - eSI Wiki

Transcription

Slides - eSI Wiki
Provenance Tracking in Climate Science Data Processing Systems
Curt Tilmes
NASA Goddard Space Flight Center
Curt.Tilmes@nasa.gov
Workshop on Principles of Provenance (PrOPr)
November 19-20, 2007
1 of 13
MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS)
• Located: Goddard Space Flight Center
• MODIS Level 1 and atmosphere products
• Archive size (approx): 600 TB
• Ingest rate (approx): 100 GB/Day • Distributes (approx): 5 TB/Day
• Provide access to MODIS Level 1 and Atmosphere products for 17,303 unique users since September 2006
• Subsetting, sub­sampling, mosaicing, masking, reprojection and format conversion options enable users to transform MODIS standard products
• http://ladsweb.nascom.nasa.gov/
2 of 13
* Courtesy Ed Masuoka, MODAPS Lead
MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS)
• All MODIS Atmosphere products are available on disk ­ Enables rapid staging of orders, <5 seconds for 2,000 files
­ All data on over 20 servers organized in one directory tree by Reprocessing Collection, Product Name and Date • Daily distribution in October 2007
­ 140,000 files to the public
­ 140,000 files to science team
­ 230,000 files to DAACs
• Remote Sensing Information Gateway
server for subsets, aggregation, visualization and format conversion of MODIS and air quality data used by EPA
• Level 0 5 minute files for Ocean processing within 7 hours of acquisition [OCDPS]
• Level 1 for CERES production [LaRC]
• Atmosphere Level 3 for MOVAS [GES DISC]
• Custom products for carbon modelers delivered through LAADS with product generation in MODAPS
3 of 13
* Courtesy Ed Masuoka, MODAPS Lead
MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS)
• 100TB of MODIS Level 1, Atmosphere and Land products will be shipped to JAXA starting April 2008 from online archive, data pool and processing
• Online archive enables innovative services SOAR [UMBC] web Services for On­demand gridded multi­satellite Atmospheric Radiances (MODIS [LAADS], AIRS [GES DISC]
4 of 13
• AVHRR 5km data products from 1981­2000 processed in MODAPS, archived in LAADS
• VIIRS data sets produced from MODIS for testing algorithms
• Comparisons with MODIS used to assess quality of VIIRS SDRs and EDRs for NASA science
* Courtesy Ed Masuoka, MODAPS Lead
MODIS Data Flow
 MODIS production flow:
5 of 13
2007­11­19
Provenance in an SDPS
 Just as a laboratory experimenter must control and capture everything about the experiment environment, so should a science data processing system…
• All ingested data, with the source
• Algorithm Theoretical Basis Documents (ATBD)
• Software Source Code, version
• Software Build Environment, version
 Static libraries, versions, Compiler versions
• Execution Environment
 Specific hardware, OS version, Dynamic libraries versions
• Execution Instance
 Runtime parameters, Input files and versions
 Very rigorous Configuration Management practices required
6 of 13
2007­11­19
Data Processing and Archiving
 Earth Science Data Archive volumes growing steadily
 Over time, the systems evolve:
• Spacecraft, sensors, data processing frameworks
• Science algorithms for transforming and analyzing data  Tracking data provenance through processing systems and archives is a very complicated problem
• Across organizations / agencies this just gets worse
 Science data is being used in new ways not planned by originators
 Value Added Services release their own processed data from independent archives
7 of 13
2007­11­19
Data Processing and Archiving
 Previous versions of data are often discarded in favor of newer ones
• Provenance information stored as metadata along with data is usually removed along with the data itself
 Provenance information is incomplete, and represented in non­standard forms that are difficult to follow
• Imagine a phone call to a researcher “where did you get this data, and what did you do to it?”
 Even if provenance is captured, some systems can’t (or won’t) reproduce older datasets
• Rely on an error prone, manual process to attempt to reproduce data previously released
8 of 13
2007­11­19
Provenance Roadblocks
 Proprietary information
• Hardware and software designs provide a competitive advantage, why share them?
 US International Traffic in Arms Regulations (ITAR)
• Broadly applied, default is to restrict
 Cost
• Capturing/distributing provenance isn't a priority
• A project that proposes comprehensive provenance is at a competitive disadvantage to one that doesn't.
 Competition
• Why should I share my system for reproducing my data which would give my competitor a leg up?
9 of 13
2007­11­19
Provenance Objectives
 Capturing complete and accurate provenance during data ingest and primary data processing
 Archiving provenance such that it can be easily retrieved and searched, even if the data are deleted
 Representing provenance to human users and providing tools for navigating graph to search and explore data provenance
 Representing provenance semantically to other systems at cooperating institutions with standard ontologies
• Semantic Web for Earth and Environmental Terminology (SWEET)
 Allow agents to traverse inter­system provenance graphs and answer provenance questions
 Allow independent systems to mechanically reproduce data processing using the provenance information
10 of 13
2007­11­19
Reprocessing
 Forward processing is easy. • Have a whole day to process each data day (1X)
 Science keeps marching forward
• MODIS had an average of one new science algorithm version update delivered per day for its first year!
 Do you start processing with the new software immediately each time you find a bug?
• Sometimes it is better to keep a dataset consistent with known problems than inconsistent.
 Periodically need to correct old data to make a new “baseline”
 At 1X reprocessing, 7 years of MODIS data would take 7 years – way too long. Even at 10X, it takes over 8 months..
11 of 13
2007­11­19
Process On Demand
 For valid science and complete “scientific reproducibility”, you must capture sufficient information to trace back the provenance of each product.
 Given such provenance and the ability to use it, do you still need the files in the archive at all?
 “Extreme Compression”
• Instead of storing the data product, just store the provenance.
• When someone needs the file, just re­create it.
• Given periodic reprocessing, many files are never needed again anyway..
 Allows much larger “virtual archives”
• We make choices about which products to create, archive and distribute – intermediate products not always kept anyway
12 of 13
2007­11­19
Process On Demand Challenges
 Can I prove that the new file is the same as the old? What if it is “almost” the same...? •
Partial checksumming? Delta differences?
•
“black box” the problem
 Do I still have the same hardware/os/compiler/etc. 10 years from now, much less 100 or 1000?
•
Maintain software validity (with formal science validation) on newer hardware
•
We are already imposing System Engineering rigor onto a science programming world with much resistance... (CM, Regression Testing, etc.)
•
We just reprocessed over 30 years of ozone data.
 Do I have the right input products online, or do I have to re­make them too? (and their inputs…)
•
Cascading problem, make the “archive” or “process­on­demand” decision at every level.
13 of 13
2007­11­19