What are we doing??? What are we doing???
Transcription
What are we doing??? What are we doing???
Faculty of Computer Science – Institute for System Architecture, Database Technology Group Vortrag Fakultätskolloquium – DB Group What are we doing??? Wolfgang Lehner 17. November 2008 … famous Past … the golden present … Bright Future … ??? Datenvolumen in DBMS gespeichert Zeit Bright Future … ??? Java SQL:1999 Objektorientierung Multimedia Information Retrieval Internet XML Strukturen in Tabellen Strukturen zwischen Tabellen Contentmanagement Data Warehouse OLAP / Data Mining Bright Future … ??? Search!!! Information Retrieval Bright Future … ??? • Wo finde ich meine Information und wie komme ich an die gesuchten Informationen heran? • Wie weit kann ich meinen Datenbeständen vertrauen und meinen Datenfluss gezielt kontrollieren? © 2005 IBM Corporation 8 Bright Future … ??? B. Lindsay Ontology XQuery XML Model Management Events Data Analytics Data Integration Web 2.0 Info 2.0 Knowledge Discovery Sampling Data Quality Personal Information Management DWHDWH-Models WebServices Data Streams P2PP2P-Systems Semantic Web Query Optimization SelfSelf-Tuning Data(base) Analytics Math & Models SystemArchitecture System Architecture DB Infrastructures Open Service Process Platform Approximate Query Processing (Sampling) XMLStream Optimization Theseus Data Process Orchestration Research Data Stream Quality Research Data Mining Process Control GCIP – Generation of Integration Processes QStream (trading time for space) space) Data Mining Support for SAP /BIA Netweaver Planning Support for SAP/BIA Netweaver LargeLarge-Scale Reporting Landscape Model aware Phys DB Design Application Level (external) • Clustering – Find similar groups – Ofter superlinear in input size • Procedure – Run k-means – Estimate mean and variance – 99% confidence interval under normal distribution • Run on sample – 5% System Level (internal) • Selectivity Estimation – Determine percentage of tuples that satisfy a query – Key to effective query optimization • Procedure – Exact computation – 5% Sample • How good is this? – Arbitrary dataset – 1% absolute error, 95% confidence – ≈20k items Exact: 1.1% Sample: ≈1.2% Sample: ≈83,6% Exact: 83,8% Option 1: Query Sampling Sampling step Queries Updates Base data • Advantages – No impact on traditional query processing – No storage requirements • Disadvantages – Sampling step is expensive – Supports only simple queries – Cannot handle data skew Estimation step Approximate queries Approximate results Option 2: Materialized Sampling Approximate queries Queries Updates Base data Sample data Sampling step • Advantages – Quick access to the sample – Sophisticated preprocessing feasible • Disadvantages – Storage space – Impact on updates Estimation step Approximate results Sampling – Problem Dimensions AMTC • Entwicklung von High-end Photomasken Immersion und Extrem Ultra Violet (EUV) Lithographie für sub 90nm Halbleitertechnologien Dresden, 17.06.08 TransConnect® - Infotag Folie 19 Übersicht Projektziele DataData-Mining zusätzliche Steuerdaten für den Produktionsprozess Übergang von traditionellen Pull basierten Analysesystemen auf PushPush-Szenarien für Produktionsmodelle Produktionsprozess Prozesssteuerung Datenvorbereitung Produktherstellung Abfassen von Metriken für Meta Mining Online Update von Prozessmodellen Integration von Produktionsdaten mit möglichst geringer Latenz zum Zeitpunkt der Datenerstellung MessdatenMessdaten-Erfassung Integration Qualitätssicherung ADAMAS Clustering of Time Series • Problem: • – capture human similarity notion despite unequal length, amplitude scale and phase • Synopsis for Series – relative frequency of subsequences of length n Preprocess (Discretization) – represent equal length segments by mean (PAA) – transform mean to symbolic value (SAX) • Clustering using – standard algorithms – Euclidian distance Clustering of Time Series • Problem: • – capture human similarity notion despite unequal length, amplitude scale and phase • Synopsis for Series – relative frequency of subsequences of length n Preprocess (Discretization) – represent equal length segments by mean (PAA) – transform mean to symbolic value (SAX) • Clustering using – standard algorithms – Euclidian distance • Current – extension to map data Mining on Complex Objects using Monte Carlo Database Ideas • • Traditional systems perform Mining on scalar data sets Complex Object − Uncertain data annotated data points with PDF − 3D Maps − Tree structures from [Japni08] Data(base) Analytics Math & Models SystemArchitecture System Architecture DB Infrastructures Open Service Process Platform Approximate Query Processing (Sampling) XMLStream Optimization Theseus Data Process Orchestration Research Data Stream Quality Research Data Mining Process Control GCIP – Generation of Integration Processes QStream (trading time for space) space) Data Mining Support for SAP /BIA Netweaver Planning Support for SAP/BIA Netweaver LargeLarge-Scale Reporting Landscape Model aware Phys DB Design MQO Techniques in DSMS on Modern Hardware • Data Stream-Management Systems (DSMS) – Many applications (sensor networks, financial analysis, …) – Key requirements: Low-latency and high-throughput processing • Multi-Query Optimization in DSMS – Huge number of queries, over the same data sources – Opportunity to share “work” for operator trees with complex operator semantics (e.g. DM) • Exploiting modern hardware (Cell BE) – Heterogeneous multi-core architecture, plenty of parallelism – Taking advantage of SIMD Flash and Databases • Flash has asynchronous performance Write Read/Write 1GB of data 0,392 Read 10,314 0 • 5 [MB/s] 10 15 Idea: Trade Write for Read – Example: Sorting a data set larger than available memory – Repeatedly reading the data set instead of writing intermediate results External Merge Sort 264,641 Read Only Sort 79,61 0 50 100 150 [time in s] 200 250 300 Sorting 64MB of integer values with 16MB of available memory SAP BIA: Main Memory DBMS • SAP Netwaever BIA (TREX) – main-memory-based decision support engine – supports aggregation queries only – use of massive parallelism Project Subjects Merging and results preparation for BI queries Any Tool Business Explorer BEx Speed and Flexibility via Attribute Search Technology Integration – adaptation of Data Mining techniques for BIA technology – Support of next generation planning scenarios with SAP Business Intelligence • Analytic Engine 5 4 3 InfoCubes ETL 1 Scalability Any Data Parallel ininmemory processin g of query results 2 via Adaptive Computing SAP Data Aggregatio n onon-thethe-fly Parallel indexing of InfoCube data via standard BI processes Vertical decomposition & compression of indexes Parallelism for ColStore Exploit, monitor, improve… Exploitation loop (optimization time) Feedback loop (execution time) Query Sample View matching Probe query generation Report actual cardinalities ⋈ σ GB Injection of new estimate Monitor quality of estimates ⋈ U σ Probe query execution R ⋈ S T Sample View Refresh Sample View Control chart for SV2 Estimation Process when using Sample Views Sample view SV2 ⋈ (1) View matching ⋈ T ⋈ (5) Inject new cardinalities S R σ GB (2) Probe query generation ⋈ Probe query GB Pr R Cardinality estimator Evaluate predicates Internal Synopses SV1(R,S) Sample View S T Regular cardinality estimation SV2 (3) Probe query execution σ ⋈ Compute aggregates U (4) Report sample measures SV2(R,S,T) Sample View SV3(U,V) R S T U Sample View Histograms Database statistics Quality Control Process for Sample Views Optimization Time Execution Time ⋈ ⋈ SVSV-based cardinality actual cardinality ⋈ ⋈ R S U U σ ⋈ σ ⋈ σ GB σ GB R T S statistical quality control T InfoPack Match & Compensation alternative estimate cardinality estimator - SVId, VersionId - sample cardinality: n - estimated cardinality: K’ - eff. sampling rate - alternative estimate - actual cardinality: K Probe query if outside of control bound Execute Refresh Delete/Insert Sample View Data(base) Analytics Math & Models SystemArchitecture System Architecture DB Infrastructures Open Service Process Platform Approximate Query Processing (Sampling) XMLStream Optimization Theseus Data Process Orchestration Research Data Stream Quality Research Data Mining Process Control GCIP – Generation of Integration Processes QStream (trading time for space) space) Data Mining Support for SAP /BIA Netweaver Planning Support for SAP/BIA Netweaver LargeLarge-Scale Reporting Landscape Model aware Phys DB Design Data Transitions • Extended SOA framework – Data-Grey-Box Web Service technology – BPELDT – Data Transition in BPEL (process execution language) • Architecture: Open Service Process Platform (OSPP) – Based on SOA and database technologies – – Extensible to include semantic schema matching technology Streams in SOA - towards CEP framework • Stream-based execution – Web Service level – Stream-based processing of requests – Intermediate results, lower resource consumption • Stream-based execution – Process level – Standing processes – Pipelined execution of incoming messages GCIP Overview • Vision (“invisible deployment”) – Transparent use of integration systems – Optimality decision – Heterogeneous load balancing Model-Driven Generation and Optimization of Integration Processes • • • • Platform-independent modeling Rule- and costcost-based optimization o (Bi(Bi-directional) modelmodel-transformations Code generation for different IS GCIP Project as foundation for the investigation of optimization concepts in integration processes GCIP Optimization Perspectives – Results and Challenges • Workload-Based Optimization – Cost-based optimization techniques – Statistics maintenance – Workload shift detection – Inter-operator optimization • Throughput Maximization – Vectorization of integration processes (pipelining) – Multi-Process Optimization (MPO) • Message batch processing • horizontal partitioning – Dynamic throughput maximization (workload-based) • Heterogeneous Load Balancing Efficiency Comparison • DIPBench - Benchmark for integration systems - DIPBench toolsuite Specific Specific Problems • Message indexing for document-oriented integration processes • Dynamic adapter / wrapper generation • SIR Transaction Model - Transaction model for integration processes (compensation-based) • Optimization under transactional execution restrictions Real-time Data Warehouse Scheduling • Multi-objective Scheduling • for hundreds of transactions • offline and online • stable schedules required Scheduler u1 w(u1) … uj w(uj) DWH Query Queue • Real-time Data Warehouse • continuous stream of write-only updates and read-only queries / push principle • query (currency) / update (freshness) contention Two utility functions • QoS and QoD objective Two user groups • both trying to minimize/maximize their own utility functions Update Queue • q1 0.3 0.7 … qi 0.9 0.1 QoS QoD lPbruhV ch .G aktsB izM gE n e-u bch.G V el ford aktPB sM gE u ord -efizn Precisison Dairy Farming uationsyem cägszaF olm eh M d n -fru d w rU n u scod em ath täsiInQ ualgcherofm -yM dkNIQ w rgU zF äd Q -ym altnrsich FofThuz e geH n ru ygied Q n Fre fosH n -täam tiekgtlahcru N r derun gen Qua litä ford tse rung en ungs Halt der o anf en rung Eff izi for enzrun deg en P ste roz u e ess ru ng Bio- u . Gentechn ologie emaMath he tisc e ll Mode und oMeth de n Fo M er ar fo kt ni rde s s re au br g un er Ve r ich t ss ch e rs c tä ali hu tz Qu Qua lit ien e g y sich ätsH u n d s - ru n e gstung Hal eme systeme t sy s arm fo sIn tion me e st sy tel w m e U f o r d g en n ru ch Na (J. Spilke, J, Büscher, W. Doluschitz, R.; Fahr, R.; Lehner,W., 2003) e sozial Forde n runge „Precision Dairy Farming ist ein integrativer Ansatz für eine nachhaltige Erzeugung von Milch mit gesicherter Qualität sowie einem hohen Grad an Verbraucherund Tierschutz.“ ha g lti ke er Ti hu sc tz it Datenbeziehungen zwischen Landwirtschaftsunternehmen und Informationspartnern Rechenzentrum (ZWS) Rechenzentrum (HIT) Zuchtverband LKV Futterqualit ät Labor Landwirtschaftsbetrieb Data(base) Analytics Math & Models SystemArchitecture System Architecture DB Infrastructures Open Service Process Platform Approximate Query Processing (Sampling) XMLStream Optimization Theseus Data Process Orchestration Research Data Stream Quality Research Data Mining Process Control GCIP – Generation of Integration Processes QStream (trading time for space) space) Data Mining Support for SAP /BIA Netweaver Planning Support for SAP/BIA Netweaver LargeLarge-Scale Reporting Landscape Model aware Phys DB Design What’s Next ??? • Math & Models – Prediction – FlashForeward Queries – Monte Carlo DB as core concept for DM applications • System Architecture – Investigate Flash and Vectorization Potential for Main Memory DB • DB Infrastructures – Standing Processes – Process Deployment Data(base) Analytics Math & Models SystemArchitecture System Architecture DB Infrastructures Open Service Process Platform Approximate Query Processing (Sampling) XMLStream Optimization Theseus Data Process Orchestration Research Data Stream Quality Research Data Mining Process Control GCIP – Generation of Integration Processes QStream (trading time for space) space) Data Mining Support for SAP /BIA Netweaver Planning Support for SAP/BIA Netweaver LargeLarge-Scale Reporting Landscape Model aware Phys DB Design The Team Maik Thiele Thomas Legler Rainer Gemulla Philipp Rösch Dirk Habich Simone Linke Benjamin Schlegel Hannes Voigt Peter B. Volk Martin Hamann Frank Rosenthal Steffen Preissler Anja Klein Matthias Böhm Ines Funke Bernd Keller Faculty of Computer Science – Institute for System Architecture, Database Technology Group Vortrag Fakultätskolloquium – DB Group What are we doing??? Wolfgang Lehner 17. November 2008