Chapter 1 Introduction • Administrative details • Information Integration • Semester outlook Slides and lecture based on material kindly provided by Prof. Felix Naumann, Hasso-Plattner-Insitut Potsdam

Literature •Ulf Leser und Felix Naumann. Informationsintegration. Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen ISBN 3898644006 dpunkt Verlag. (lecture primarily based on this book) •AnHai Doan, Alon Halevy, Zachary Ives Principles of Data Integration ISBN 0124160441 Morgan Kauffmann (1st Edition) •Selected topics also covered in •Stefan Conrad. Föderierte Datenbanksysteme. •M. Tamer Özsu, Patrick Valduriez. Principles of Distributed Database Systems. •Additional resources will be referenced in class. •All scientific articles can be accessed via Google Scholar, DBLP, CiteSeer, ACM Digital Library, or on the author's personal websites.

What is Information Integration? Information integration is... • ... the combination of data and content coming from different data sources to obtain a unified set of information. • ... the correct, complete, and efficient unification of data and content of different, heterogeneous sources to obtain a unified and structured set of information to be effectively interpreted by users and applications. 11 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Excerpt of a SwissProt file 12 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Beispiel eines HTML Formulars 13 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Excerpt of a list of public Web Services on 14 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data Source Examples adapted"from" Suchanek"&"Weikum"tutorial@SIGMOD"2013" Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart 5" 15 Where do we encouter Information Integration? In a broad sense •Business-Integration •Application-Integration •Process-Integration (Workflow-Integration) In a more strict sense (focus of this lecture) •Datenbanken und Informationssysteme •Verteilt •Autonom/heterogen 16 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Integration = Abstraction • In a single databases (a single source), logical DB-design abstracts from the physical DB design. • Data independence • Queries: procedural vs. declarative • Information integration in turn abstracts from logical DB design • Source independence (where data is stored) • Data model and syntax independence • Independence from semantic differences (hopefully!) 17 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Mashups, see e.g. Mashup repository 18 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Mashup Example 2 19 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Google shopping 20 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 21 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 22 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 23 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas adapted"from" Suchanek"&"Weikum"tutorial@SIGMOD"2013" Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart 5" 24 Information Integration: An old Problem • On the research agenda for more than 50 years • Early systems data back to the 1970s. • Manual integration has been considered even earlier, of course. • New problems • Large number of data sources • Heterogeneity • New types of data (XML, RDF, GIS, OO,...) • Neue types of queries (search, UDFs,...) • Neue types of results (ranking, visualization, ...) • New types of users (managers, admins, anybody, ...) • Alon Halevy: „It‘s plain hard!“ [Halevy04] 25 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Why is it difficult? [Halevy04] • System aspects • Different systems • Query processing over multiple systems • Social aspects • Find relevant data in companies (on the Web) • Access relevant data • People need to be convinced to cooperate / sources must provide some interfaces to be used • Logic-based reasons • Schema and data heterogeneity • These are independent from the chosen integration architecture 26 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Example Web Service A Web Service A Web Service B <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service B • Location: Tübingen • Operationen: ‣ getMovieByActor(firstName, lastName) ‣ getMovieByTitle(title) • Output <film> <name> Troy </name> <cast> Pitt & Cox</cast> <year> 2003 </year> </film> • Location: Hamburg • Operation: myMovies(Actor, Year) • Ausgabestruktur: name title getMov movie myMov Actors film cast Actor year 27 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Integrating Web Services A & B 1. User Interfaces 2. Schema Integration / Schema Mapping 3. Query rewriting 4. Runtime estimation (optimization) 5. Sending requests to both services 6. Get answers 7. Entity resolution 8. Data fusion 1. Resolving conflicts etc. 2. Determining integrated result 3. Executing integration 9. Visualization to the user 28 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 1: User Interfaces Web Service A Web Service B 29 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 2: Schema Integration / Mapping Web Service A title getMov movie Actors Web Service B Actor + name myMov film cast year Integrated Schema myMov Schema Integration = title movie year Actors Actor 30 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 2: Schema Integration / Mapping Web Service A title getMov movie Actors Web Service B Actor Schema Mapping + name myMov film cast year Integrated Schema myMov = title year movie Actors Actor 31 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 3: Query Rewriting Query rewriting based on target representation • Z.B. Concat(Firstname, Lastname) = Actor Sources (Web Service A & B) Target representation Transform query to directly query Web Services A and B 32 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 4: Query Optimization • Optimization focus? A quick response or a complete answer? ‣ Web Service A in Tübingen (local) ‣ Web Service B in Hamburg (remote) ‣ Web Service B has more attributes and more entities. ‣ Web Service A has less attributes. •Außerdem: ‣ Search by year only supported by Web Service B. ‣ Data transformations can be expensive to compute. 33 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 5: Sending Requests • Send requests tp Web Service A and B (1) • Web services send back results (2) Quellen (Web Service A & B) Zielrepräsentation (1) (2) Query w.r.t. integrated schema (1) (2) 34 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 6: Get answers Query Title = “Troy” and year = “2003” returns the following results. Web Service A <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service B <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> 35 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 7: Entity Resolution Is the movie returned by Web Service A the same movie returned by Web Service B? To answer this question, we have (1) to identify semantic equivalences among result schemata (schema matching) and (2) compare the data. Web Service A <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Data comparison using a similarity measure Schema Matching Web Service B <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> 36 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.1: Resolving Conflicts <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service A Identical titles ➙ no conflict Eric Bana, Cox & 2003 exist in only one source ➙ uncertainty <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> Web Service B different data ➙ conflict 37 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.2: Determining Integrated Result Web Service A Web Service B <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Integrated result <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> <movie> <Titel> Troy </Titel> <Actors> <Actor> Bana </Actor> <Actor> Pitt </Actor> <Actor> Cox </Actor> </Actors> <year> 2003 </year> </movie> 38 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.3: Executing Integration • How to perform the data fusion? • Declarative code? ‣ SQL, XQuery, XSLT ‣ Rarely possible ‣ Typically slow • Procedural code? ‣ Java, C++ ‣ Difficult to maintain ‣ Fast 39 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 9: Visualization Visualization of • The result • Data provenance • Data quality • Changed values • Operators used • ... Title from Web Service A and B <movie> <Titel> Troy </Titel> <Actors> <Actor> Bana </Actor> <Actor> Pitt </Actor> <Actor> Cox </Actor> </Actors> <year> 2003 </year> </movie> From Web Service A, was Eric Bana before Conflict has been resolved.

Semester Outlook Problem setting •Introduction to Information Integration •Distribution, Autonomy and Heterogeneity Architectures •Materialized and Virtual Integration •5-Layer Architecture •Mediator/Wrapper Architecture •Schema Mapping •Schema Matching Mapping

Semester Outlook Modelling •Global-As-View modelling •Local-As-View modelling Query processing •Global-As-View query processing •Containment and Local-As-View query processing •Bucket algorithm Data Integration •Entity Resolution •Data Fusion