An object-oriented wrapper to relational databases with
Transcription
An object-oriented wrapper to relational databases with
POLITECHNIKA ŁÓDZKA WYDZIAŁ ELEKTROTECHNIKI, ELEKTRONIKI, INFORMATYKI I AUTOMATYKI KATEDRA INFORMATYKI STOSOWANEJ mgr inŜ. Jacek Wiślicki Ph.D. Thesis An object-oriented wrapper to relational databases with query optimisation praca doktorska Obiektowa osłona do relacyjnych baz danych z uwzględnieniem optymalizacji zapytań Advisor: prof. dr hab. inŜ. Kazimierz Subieta Łódź 2007 To my wife and son, for their love, belief and patience… Index of Contents ABSTRACT .................................................................................................................................................... 5 ROZSZERZONE STRESZCZENIE ...................................................................................................................... 7 CHAPTER 1 INTRODUCTION ......................................................................................................................... 13 1.1 Motivation....................................................................................................................................... 13 1.2 Theses and Objectives..................................................................................................................... 14 1.3 History and Related Works ............................................................................................................. 16 1.4 Thesis Outline ................................................................................................................................. 18 CHAPTER 2 THE STATE OF THE ART AND RELATED WORKS ....................................................................... 20 2.1 The Impedance Mismatch ............................................................................................................... 21 2.2 Related Works................................................................................................................................. 23 2.2.1 Wrappers and Mediators ......................................................................................................... 23 2.2.2 ORM and DAO ....................................................................................................................... 32 2.2.3 XML Views over Relational Data........................................................................................... 37 2.2.4 Applications of RDF ............................................................................................................... 38 2.2.5 Other Approaches.................................................................................................................... 40 2.3 The eGov-Bus Virtual Repository................................................................................................... 42 2.4 Conclusions..................................................................................................................................... 44 CHAPTER 3 RELATIONAL DATABASES ........................................................................................................ 45 3.1 Relational Optimisation Constraints ............................................................................................... 45 3.2 Relational Calculus and Relational Algebra ................................................................................... 47 3.3 Relational Query Processing and Optimisation Architecture.......................................................... 48 3.3.1 Space Search Reduction .......................................................................................................... 51 3.3.2 Planning................................................................................................................................... 55 3.3.3 Size-Distribution Estimator..................................................................................................... 58 3.4 Relational Query Optimisation Milestones ..................................................................................... 60 3.4.1 System-R ................................................................................................................................. 60 3.4.2 Starburst .................................................................................................................................. 61 3.4.3 Volcano/Cascades ................................................................................................................... 62 CHAPTER 4 THE STACK-BASED APPROACH ................................................................................................ 64 4.1 SBA Object Store Models............................................................................................................... 65 4.2 SBQL .............................................................................................................................................. 65 4.2.1 SBQL Semantics ..................................................................................................................... 66 4.2.2 Sample Queries ....................................................................................................................... 68 4.3 Updateable Object-Oriented Views................................................................................................. 75 4.4 SBQL Query Optimisation.............................................................................................................. 77 4.4.1 Independent Subqueries .......................................................................................................... 79 4.4.2 Rewriting Views and Query Modification .............................................................................. 82 4.4.3 Removing Dead Subqueries .................................................................................................... 83 4.4.4 Removing Auxiliary Names .................................................................................................... 83 4.4.5 Low Level Techniques ............................................................................................................ 84 CHAPTER 5 OBJECT-RELATIONAL INTEGRATION METHODOLOGY .............................................................. 86 5.1 General Architecture and Assumptions........................................................................................... 86 5.2 Query Processing and Optimisation ................................................................................................ 88 5.2.1 Naive Approach vs. Optimisation ........................................................................................... 90 5.3 A Conceptual Example ................................................................................................................... 91 Page 3 of 235 Index of Contents CHAPTER 6 QUERY ANALYSIS, OPTIMISATION AND PROCESSING ............................................................. 100 6.1 Proposed Algorithms..................................................................................................................... 100 6.1.1 Selecting Queries................................................................................................................... 101 6.1.2 Deleting Queries.................................................................................................................... 103 6.1.3 Updating Queries................................................................................................................... 104 6.1.4 SQL Query String Generation ............................................................................................... 106 6.1.5 Stack-Based Query Evaluation and Result Reconstruction................................................... 108 6.2 Query Analysis and Optimisation Examples................................................................................. 108 6.2.1 Relational Test Schemata ...................................................................................................... 109 6.2.2 Selecting Queries................................................................................................................... 117 6.2.3 Imperative Constructs ........................................................................................................... 130 6.2.4 Multi-Wrapper and Mixed Queries ....................................................................................... 137 6.2.5 SBQL Optimisation over Multi-Wrapper Queries ................................................................ 144 6.3 Sample Use Cases ......................................................................................................................... 151 6.3.1 Rich Employees..................................................................................................................... 151 6.3.2 Employees with Departments................................................................................................ 152 6.3.3 Employees with Cars............................................................................................................. 153 6.3.4 Rich Employees with White Cars ......................................................................................... 154 CHAPTER 7 WRAPPER OPTIMISATION RESULTS ........................................................................................ 156 7.1 Relational Test Data...................................................................................................................... 157 7.2 Optimisation vs. Simple Rewriting ............................................................................................... 159 7.3 Application of SBQL optimisers................................................................................................... 181 CHAPTER 8 SUMMARY AND CONCLUSIONS ............................................................................................... 183 8.1 Prototype Limitations and Further Works..................................................................................... 184 8.2 Additional Wrapper Functionalities .............................................................................................. 185 APPENDIX A THE EGOV-BUS PROJECT ..................................................................................................... 187 APPENDIX B THE ODRA PLATFORM ........................................................................................................ 190 B.1 ODRA Optimisation Framework.................................................................................................. 193 APPENDIX C THE PROTOTYPE IMPLEMENTATION ..................................................................................... 195 C.1 Architecture .................................................................................................................................. 195 C.1.1 Communication protocol ...................................................................................................... 196 C.2 Relational Schema Wrapping ....................................................................................................... 198 C.2.1 Example................................................................................................................................ 199 C.2.2 Relational Schema Models ................................................................................................... 201 C.2.3 Result Retrieval and Reconstruction..................................................................................... 206 C.3 Installation and Launching ........................................................................................................... 209 C.3.1 CD Contents ......................................................................................................................... 209 C.3.2 Test Schemata Generation .................................................................................................... 210 C.3.3 Connection Configuration .................................................................................................... 210 C.3.4 Test Data Population ............................................................................................................ 211 C.3.5 Schema Description Generation ........................................................................................... 212 C.3.6 Server.................................................................................................................................... 213 C.3.7 Client .................................................................................................................................... 217 C.4 Prototype Testing ......................................................................................................................... 217 C.4.1 Optimisation Testing ............................................................................................................ 218 C.4.2 Sample batch files................................................................................................................. 219 INDEX OF FIGURES .................................................................................................................................... 221 INDEX OF LISTINGS ................................................................................................................................... 224 INDEX OF TABLES ..................................................................................................................................... 225 BIBLIOGRAPHY ......................................................................................................................................... 226 Page 4 of 235 Abstract This Ph.D. thesis is focused on transparent and efficient integration of relational databases to an object-oriented distributed database system available to top-level users as a virtual repository. The core of the presented solution is to provide a wrapper – a dedicated generic piece of software capable of interfacing between the virtual repository structures (in the most common case – object-oriented updateable views) and the wrapped relational database, enabling bidirectional data exchange (e.g. retrieval and updates) with optimal query evaluation. The idea of integration of distributed, heterogeneous, fragmented and redundant databases can be dated to the eighties of the last century and the concept of federated databases, nevertheless the virtual repository approach is closer to the distributed mediation concept from the early nineties. Regardless of the origin, the goal of such systems is to present final and complete business information transparently combined of data stored in bottom-level resources, exactly matching top-level user demands and requirements. In the most general case, the resources mean any data sources and feeds, including the most common relational databases (the focus of the thesis), objectoriented and object-relational databases, XML and RDF data stores, Web Services, etc. The need for integration of various resources representing completely different paradigms and models is caused by the characteristics of the nowadays data and information management systems (besides the resources’ distribution, fragmentation, replication, redundancy and heterogeneity) – the software is usually written is some high-level object-oriented programming language that is completely unadjacent to the resource-level query language (the phenomenon is often referred to as the impedance mismatch). This integration process must be completely transparent so that an end user (a human or an application) is not aware of an actual data source model and structure. Page 5 of 235 Abstract On the other side, an object-oriented database communicating directly with such a wrapper must be opaque – it must work as a black box and its underlying objectrelational interface cannot be available directly from any system element located in upper layers. Another feature of a wrapper is its genericity – its action and reliability must be completely independent of a relational resource it wraps. Also neither data materialisation nor replication are allowed at the virtual repository side (or any other intermediate system module), of course except for current query results that must be somehow returned to users querying the system. Similarly, updating the wrapped resource data from the virtual repository level and its object-oriented query language must me assured. Besides the transparency aspects, the most effort has been devoted to efficient optimisation procedures enabling action of powerful native relational resource query optimisers together with the object-oriented optimisation methods applicable in the virtual repository. The virtual repository for which the wrapper was designed relies on the stackbased approach (SBA), the corresponding stack-based query language (SBQL) and the updateable object-oriented views. Therefore the wrapper must be capable of transforming object-oriented queries referring to the global schema expressed in SBQL into SQL-optimiseable relational queries, whose results are returned to the virtual repository in the same way as actual object-oriented results. The thesis has been developed under the eGov-Bus (Advanced eGovernment Information Service Bus) project supported by the European Community under “Information Society Technologies” priority of the Sixth Framework Programme (contract number: FP6-IST-4-026727-STP). The idea with its different aspects (including the virtual repository it is a part of and data fragmentation and integration issues) has been presented in over 20 research papers, e.g. [1, 2, 3, 4, 5, 6, 7]. Keywords: database, object-oriented, relational, query optimisation, virtual repository, wrapper, SBA, SBQL Page 6 of 235 Rozszerzone streszczenie Liczba dostępnych obecnie źródeł danych jest ogromna. Wiele z nich jest dostępne poprzez Internet, jakkolwiek mogą one nie być publiczne lub ograniczać dostęp do ściśle określonej grupy uŜytkowników. Takie zasoby są rozproszone, niejednorodne, podzielone i nadmiarowe. Idea, której część została opracowana i zaimplementowana w poniŜszej pracy doktorskiej polega na wirtualnej integracji takich zasobów w zcentralizowaną, jednorodną, spójną i pozbawioną fragmentacji oraz nadmiarowości całość tworzącą wirtualne repozytorium zapewniające pewne powszechne funkcjonalności i usługi, włączając w to infrastrukturę zaufania (bezpieczeństwo, prywatność, licencjonowanie, płatności, itp.), Web Services, rozproszone transakcje, zarządzanie procesami (workflow management), itd. Głównym celem przedstawionych prac jest integracja heterogenicznych relacyjnych baz danych (zasobów) do obiektowego systemu bazodanowego, w którym takie zasoby są widziane jako czysto obiektowe modele i składy, które mogą być w przeźroczysty sposób odpytywane za pomocą obiektowego języka zapytań (innymi słowy, muszą być one nieodróŜnialne od rzeczywistych zasobów obiektowych). W związku z tym schematy relacyjne muszą zostać „osłonięte” wyspecjalizowanym oprogramowaniem zdolnym do dwukierunkowej wymiany danych pomiędzy znajdującym się na samym dole SZRBD i innymi elementami systemu. Ten proces musi być przeźroczysty, aby końcowy uŜytkownik (człowiek lub aplikacja) nie był w Ŝaden sposób świadomy faktycznego modelu zasobu. Z drugiej strony, obiektowa baza danych komunikująca się bezpośrednio z taką osłoną musi być całkowicie nieprzejrzysta (czarna skrzynka), aby osłona nie była w Ŝaden sposób bezpośrednio dostępna z Ŝadnego elementu systemu znajdującego się w górnych warstwach architektury. Kolejną cechą takiej osłony jest pełna generyczność – jej działanie, niezawodność i własności wydajnościowe powinny być całkowicie uniezaleŜnione od zasobu, który kryje. Z załoŜenia niedozwolone są takŜe materializacja i replikacja danych relacyjnych Page 7 of 235 Rozszerzone streszczenie po stronie wirtualnego repozytorium (ani Ŝadnego innego pośredniego modułu systemu), oczywiście poza bieŜącymi wynikami zapytań, które muszą zostać zwrócone uŜytkownikom odpytującym system. Tezy pracy zostają zdefiniowane jak poniŜej: 1. Spadkowe bazy relacyjne mogą zostać w przeźroczysty sposób włączone do obiektowego wirtualnego repozytorium, a znajdujące się w nich dane przetwarzane i aktualizowane przez obiektowy język zapytań – nieodróŜnialnie od danych czysto obiektowych, bez konieczności materializacji i replikacji. 2. Dla takiego systemu mogą zostać opracowane i zaimplementowane mechanizmy umoŜliwiające współdziałanie obiektowej optymalizacji wirtualnego repozytorium z natywnymi optymalizatorami zasobu relacyjnego. Sztuka budowania obiektowych osłon przeznaczonych do integracji heterogenicznych zasobów relacyjnych do homogenicznych obiektowych systemów bazodanowych jest rozwijana od około piętnastu lat. Prowadzone prace mają na celu połączenie znacznie starszej teorii baz relacyjnych (podstawy zostały zebrane i zaprezentowane przez Edgara F. Codda prawie czterdzieści lat temu) z względnie młodym paradygmatem obiektowym (pojęcie „obiektowego systemu bazodanowego” pojawiło się w połowie lat osiemdziesiątych ubiegłego stulecia, a podstawowe koncepcje zamanifestowane w [8], jakkolwiek odpowiednie prace rozpoczęły się ponad dziesięć lat wcześniej). Prezentowana praca doktorska jest skoncentrowana na nowatorskim podejściu do integracji heterogenicznych zasobów relacyjnych do rozproszonego obiektowego systemu bazodanowego. Zasoby te muszą być dostępne dla globalnych uŜytkowników poprzez globalny model obiektowy i obiektowy język zapytań, tak aby ci uŜytkownicy nie byli w Ŝaden sposób świadomi faktycznego modelu i składu zasobu. Opracowany i zaimplementowany proces integracji jest całkowicie przeźroczysty i umoŜliwia dwukierunkową wymianę danych, tj. odpytywanie zasobu relacyjnego (pobieranie danych zgodnych z kryteriami zapytań) i aktualizację danych relacyjnych. Ponadto, znajdujące się tuŜ nad zasobem relacyjnym struktury obiektowe (utworzone bezpośrednio w oparciu o ten zasób) mogą zostać bezproblemowo przekształcane i filtrowane (poprzez kaskadowo nabudowane aktualizowalne perspektywy obiektowe) w taki sposób, aby odpowiadały modelowi biznesowemu i ogólnemu schematowi, którego część mają stanowić. Poza aspektami przeźroczystości, największy wysiłek Page 8 of 235 Rozszerzone streszczenie został poświęcony procedurom wydajnej optymalizacji zapytań umoŜliwiającej działanie natywnych optymalizatorów zasobu relacyjnego. Te funkcjonalności zostały osiągnięte poprzez wyspecjalizowany moduł stanowiący osłonę obiektowo-relacyjną, nazywany dalej w skrócie osłoną. Zaprezentowane w pracy doktorskiej prototypowe rozwiązanie zostało oparte o podejście stosowe do języków zapytań i baz danych (SBA, Stack-Based Approach), wynikające z niego język zapytań (SBQL, Stack-Based Query Lanuage) oraz aktualizowalne obiektowe perspektywy, interfejs JDBC, protokół TCP/IP oraz język SQL. Implementacja została wykonana w języku JavaTM. Koncepcja z jej róŜnymi aspektami (włącznie z wirtualnym repozytorium, którego jest częścią oraz kwestiami fragmentacji i integracji danych) przedstawiona została w ponad 20 artykułach, np. [1, 2, 3, 4, 5, 6, 7]. Praca doktorska została wykonana w ramach projektu eGov-Bus (Advanced eGovernment Information Service Bus) wspieranego przez Wspólnotę Europejską w ramach priorytetu „Information Society Technologies” Szóstego Programu Ramowego (nr kontraktu: FP6-IST-4-026727-STP). Tekst pracy został podzielny na następujące rozdziały, których zwięzłe streszczenia znajdują się poniŜej: Chapter 1 Introduction Wstęp Rozdział zawiera motywację do podjęcia tematyki, tezy, cele i załoŜenia pracy doktorskiej, opis wykorzystanych rozwiązań technicznych oraz zwięzły opis stanu wiedzy i prac powiązanych z tematyką. Chapter 2 The State of the Art and Related Works Stan wiedzy i prace pokrewne W tej części pracy omówione zostały stan wiedzy oraz inne prowadzone na świecie prace mające na celu powiązanie baz relacyjnych z obiektowymi systemami i językami programowania. Jako punkt wyjściowy zostało przyjęte zjawisko „niedopasowania impedancji” pomiędzy systemami językami zapytań i językami programowania, jego konsekwencje i moŜliwe rozwiązania problemu. W szczególności uwzględniona została architektura mediatorów i wrapperów zaproponowana przez G. Wiederholda w 1992 oraz szereg rozwiązań opartych na tym właśnie modelu. W dalszej kolejności opisane zostały róŜnorodne rozwiązania ORM (Object-Relational Page 9 of 235 Rozszerzone streszczenie Mapping) i DAO (Data Access Objects) mające na celu mapowanie istniejących danych zgromadzonych w systemach relacyjnych na struktury obiektowe dostępne z poziomu obiektowych języków programowania oraz zapewnianie trwałości obiektów na poziomie obiektowych języków programowania w (relacyjnych) bazach danych, odpowiednio. Część rozdziału została poświęcona takŜe technikom udostępniania danych relacyjnych poprzez perspektywy XML oraz RDF. Oddzielny podrozdział stanowi krótki opis roli osłony w ogólnej architekturze wirtualnego repozytorium opracowywanego w ramach projektu eGov-Bus. Chapter 3 Relational Databases Relacyjne bazy danych Znajdują się tutaj podstawy systemów relacyjnych oraz optymalizacji zapytań relacyjnych z istniejącymi wyzwaniami i opracowanymi metodami, głównie w odniesieniu do najpopularniejszego obecnie języka SQL. Główna część rozdziału dotyczy architektury optymalizatora zapytań wyraŜonych w języku SQL działającego w oparciu o rachunek relacyjny i algebrę relacyjną oraz najpopularniejszych technik stosowanych we współczesnych systemach relacyjnych. Wspomniane zostały równieŜ „kamienie milowe” w rozwoju optymalizacji zapytań relacyjnych takie jak System-R, Starburst i Volcano. Chapter 4 The Stack-Based Approach Podejście stosowe Rozdział koncentruje się na podstawach podejścia stosowego (SBA) oraz obecnie zaimplementowanych w wirtualnym repozytorium metodach optymalizacji zapytań obiektowych, głównie w odniesieniu do języka SBQL, który został wykorzystany podczas prac nad prototypem osłony. W rozdziale ujęte zostały metody optymalizacji SBQL oparte na przepisywaniu zapytań (optymalizacja statyczna) związane z jak najwcześniejszym wykonywaniem selekcji, przepisywaniem perspektyw (query modification), usuwaniem martwych podzapytań i pomocniczych nazw, jak równieŜ wykorzystaniem indeksów. Chapter 5 Object-Relational Integration Methodology Metodologia integracji obiektowo-relacyjnej Ta część pracy stanowi koncepcyjny opis opracowanej i zaimplementowanej metodologii integracji zasobów relacyjnych do obiektu wirtualnego repozytorium. Omówione zostały załoŜenia systemu, sposób odwzorowania schematów z zastosowaniem aktualizowanych perspektyw oraz przetwarzania zapytań obiektowych Page 10 of 235 Rozszerzone streszczenie na relacyjne. Wprowadzone zostały takŜe podstawy przetwarzania zapytania mającego na celu umoŜliwienia działania optymalizacji obiektowej oraz relacyjnej. Rozdział zakończony jest abstrakcyjnym (niezaleŜnym od implementacji) przykładem mapowania schematów oraz optymalizacji zapytania. Chapter 6 Query Analysis, Optimisation and Processing Analiza, optymalizacja i przetwarzanie zapytań W rozdziale znajduje się szczegółowe omówienie opracowanych i zaimplementowanych algorytmów analizy i transformacji zapytań, które mają na celu uzyskanie maksymalnej wydajności i niezawodności systemu. Zaprezentowane zostają przykładowe schematy relacyjne oraz proces ich odwzorowania (za pomocą aktualizowanych perspektyw obiektowych) na schemat obiektowy dostępny dla wirtualnego repozytorium. Omówione metody poparte są rzeczywistymi przykładami przekształceń zapytań odnoszących się do wspomnianych schematów testowych, w których szczegółowo zademonstrowane są poszczególne etapy analizy i optymalizacji. Na końcu rozdziału znajdują się przykłady dalszych transformacji utworzonego schematu obiektowego, które mogą zostać zastosowane w górnych warstwach wirtualnego repozytorium. Chapter 7 Wrapper Optimisation Results Wyniki optymalizacji osłony Rozdział stanowi prezentację i dyskusję wyników optymalizacji przeprowadzanej przez osłonę w odniesieniu do niezoptymalizowanych zapytań oraz do zapytań optymalizowanych za pomocą mechanizmów SBQL. Działanie osłony przetestowane zostało na względnie duŜym zbiorze testowych zapytań (odnoszących się to przykładowych schematów wprowadzonych w poprzednim rozdziale) wykorzystujących róŜnorodne mechanizmy optymalizacyjne języka SQL, którego zapytania wykonywane są bezpośrednio na zasobie relacyjnym. Przeprowadzone zostały takŜe testy dla róŜnych wielkości baz danych, w celu obserwacji zaleŜności wyniku optymalizacji od liczności pobieranych rekordów oraz liczby wykonywanych złączeń. Chapter 8 Summary and Conclusions Podsumowanie i wnioski Zawarte zostały tutaj doświadczenia i wnioski zdobyte podczas opracowywania osłony i testowania prototypu. Podsumowanie wyników optymalizacji przeprowadzanej przez prototyp jednoznacznie dowodzi słuszności tez pracy doktorskiej. Osobny Page 11 of 235 Rozszerzone streszczenie podrozdział poświęcony jest dalszym pracom, które mogą zostać wykonanie w celu rozwoju prototypu i rozszerzenia jego funkcjonalności. Tekst pracy został rozszerzony o trzy załączniki omawiające kolejno projekt eGov-Bus, obiektowy system bazodanowy ODRA odpowiedzialny za obsługę wirtualnego repozytorium, w ramach którego został zaimplementowany prototyp osłony oraz szczegóły implementacji prototypu osłony. Page 12 of 235 Chapter 1 Introduction 1.1 Motivation Nowadays, we observe an enormous number of data sources. Many of them are available via Internet (although they may be not public or restrict access to a limited number of users). Such resources are distributed, heterogeneous, fragmented and redundant. Therefore, information available to users (including governmental and administrative offices and agencies as well as companies and industry) is incomplete and requires much effort and usually manual (or human-interactive computer-aided) analysis to reveal its desired aspects. Even after such time and resource consuming process actually there is no guarantee that some important pieces of information are not lost or overlooked. The main idea partially developed and implemented in the thesis is to virtually integrate such resources into a centralised, homogeneous, integrated, consistent and non-redundant whole constituting a virtual repository providing some common functionalities and services, including a trust infrastructure (security, privacy, licensing, payments, etc.), web services, distributed transactions, workflow management, etc. The proposed solution is focused on relational databases that are the most common in currently utilised (and still developed) data and information management systems. Contemporary system designers still choose them since they are mature, predictable, stable and widely available as either commercial or free products with professional maintenance and technical support. On the other hand programmers tend to use highlevel object-oriented programming languages that collide with relational paradigms and philosophy (the object-relational impedance mismatch phenomenon). Global migration to object-oriented database systems is unlike in reasonable future due to programmers’ habits, its unpredictably high costs (and time required) and lack of commonly accepted Page 13 of 235 Chapter 1 Introduction object-oriented standards and mature database systems (most of them are functional, however still experimental prototypes). These factors reveal strong need for providing a smooth bridge between relational and object-oriented technologies and systems enabling the effective and seamless manipulating relational data in the object-oriented manner together with combining them with purely object-oriented data. 1.2 Theses and Objectives The Ph.D. dissertation is focused on a novel approach to integration of heterogeneous relational database resources in an object-oriented distributed database system. Such resources must be available to global (top level) users via a common object data model (provided by the virtual repository) and an object-oriented query language so that these users are not aware of an actual resource data model and storage. Relational databases are mapped as purely object-oriented models and stores that can be transparently queried with an object-oriented query language. In other words, they are to be compatible with actual object-oriented resource. Therefore, relational schemata must be wrapped (enveloped) with some dedicated pieces of software – interfaces capable of bidirectional exchanging data between a bottom-level RDBMS and other elements of the system. Such an interface is referred to as a relational-to-object data wrapper, or simply – a wrapper. This integration process must be completely transparent so that an end user (a human or an application) is not aware of an actual data source model and structure. On the other side, an object-oriented database communicating directly with such a wrapper must be opaque – it must work as a black box and its underlying objectrelational interface cannot be available directly from any system element located in upper layers. Another feature of a wrapper is its genericity: its action and reliability must be completely independent of a resource it wraps. Also neither data materialisation nor replication are allowed at the virtual repository side (or any other intermediate system module), of course except for current query results that must be somehow returned to users querying the system. Page 14 of 235 Chapter 1 Introduction The summarized theses are: 1. Legacy relational databases can be transparently integrated to an objectoriented virtual repository and their data can be processed and updated with an object-oriented query language indistinguishably from purely objectoriented data without materialisation or replication. 2. Appropriate optimisation mechanisms can be developed and implemented for such a system in order to enable coaction of the object-oriented virtual repository optimisation together with native relational resource optimisers. The prototype solution accomplishing, verifying and proving the theses has been developed and implemented according to the modular reusable software development methodology, where subsequent components are developed independently and combined according to the predefined (primarily assumed) interfaces. First, the state of the art in the field of integration of heterogeneous database and information systems was analysed. A set of related solutions has been isolated with their assumptions, strengths and weaknesses, considering possibility of their adaptation to the designed solution (Chapter 2). The experience of previous works together with requirements of the virtual repository and the author’s experience from the previously participated commercial projects allowed designing and implementing the general modular wrapper architecture (subchapter 5.1 General Architecture and Assumptions). The architecture assures genericity, flexibility and reliability in terms of communication and data transfer and it was not subjected to any substantial verification and experiments. Basing on the above, the first working prototype (without optimisation issues, but allowing transparent and reliable accessing the wrapped resource, referred to as the naive approach) has been implemented and experimentally tested. Since this part of the solution is mutually independent of the virtual repository and the objectives were clear and well defined, the fast cascaded development model has been used. The next stage concerned thorough analysis of various optimisation methods used in relational database systems (Chapter 3) and object-oriented database systems, especially based on the stack-based approach (Chapter 4) implemented in the virtual repository (subchapter 2.3 The eGov-Bus Virtual Repository). The goal was to establish SBQL syntax patterns transformable to optimiseable SQL queries. Subsequently improved prototypes verified with experiments have allowed achieving a set of reliable query rewriting rules and designing the desired query optimiser. The evolutionary/spiral Page 15 of 235 Chapter 1 Introduction development model used at this stage was implied by integration of the wrapper with the independently developed virtual repository and its continuously changing implementation. Also, the experiments’ results concerning the query evaluation efficiency and reliability forced further improvement of the rewriting and optimisation methods. The described shortly development process consists mainly of analysis of related solutions in terms of the thesis objectives and integration requirements with the rest of the virtual repository. The conclusions allowed stating primary assumptions realised in subsequent prototypes and continuously verified with experiments. The resulting integration process is completely transparent and it enables bidirectional data transfer, i.e. querying a relational resource (retrieving data according to query conditions) and updating relational data. Moreover, bottom-level object-oriented data structures basing directly on a relational resource can be easily transformed and filtered (with multiply cascaded object-oriented updateable views, subchapter 6.3 Sample Use Cases) so that they comply with the business model and goals of the general schema they are a part of. Besides the transparency aspects, the most effort has been devoted to efficient optimisation procedures enabling action of powerful native relational resource query optimisers. The prototype solution realising the above goals and functionalities is implemented with JavaTM language. It bases on: • The Stack-Based Approach (SBA), providing SBQL (Stack-Based Query Language) being a query language with a complete computational power of regular programming languages and updateable object-oriented views, • JDBC used for connecting to relational databases and manipulating their data; the technology is supported by almost all currently used relational systems, • TCP/IP sockets allowing communication between distributed system components, • SQL used for low-level communication with wrapped databases. 1.3 History and Related Works The art of building object-relational wrappers for integrating heterogeneous relational database resources into homogeneous object-oriented database systems has been developed for about fifteen years, struggling to join the much older theory of relational databases (almost forty years since its fundamentals were presented by Edgar F. Codd) with the relatively young object-oriented database paradigm. The term of an “object- Page 16 of 235 Chapter 1 Introduction oriented database system” first appeared in the mid-eighties of the last century and the concepts were manifested in [8], however the corresponding works started more than ten years earlier. The concept of mediators and wrappers was first formulated by Wiederhold in [9] as a set of indications for developing future information systems. The inspiration was the growing common demand for information induced by broadband networks and rapid Internet development, simultaneously obstructed by inefficient, fragmented, heterogeneous and distributed data sources. The main idea was to support the decisionmaking software with some flexible infrastructure (expressed in terms of mediators and wrappers) capable of retrieving complete information without human interference. A mediator was defined as an autonomous piece of software capable of processing data from am underlying datasource according to the general system requirements. Since mediators were considered resource-independent, they were supplied with wrappers – another software modules transparently interfacing between the resource and the mediator itself. Another very important concept stated was that mediators are not to be accessible in any user-friendly language, but a communication-friendly language – the mediation process (i.e. creating the actual information basing on various mediators and their datasources) is realised in the background and it must be reliable and efficient. The user interface is realised by some top-level applications and they should be userfriendly instead. The approach and the methodology presented in [9] had several implementations, e.g. Pegasus (1993), Amos (1994) and Amos II (developed since 2002), DISCO (1997) described shortly in the subchapter 2.2.1 Wrappers and Mediators. In the virtual repository the described wrapper solution is a part of, the mediation process relies on updateable object-oriented views based on the Stack-Based Approach. There exist also many other approaches aiming to “objectify” relational data – they are described in subchapters 2.2.2 ORM and DAO, 2.2.3 XML Views over Relational Data, 2.2.4 Applications of RDF. Nevertheless, their application in the thesis was not considered due to completely different assumptions and goals. Page 17 of 235 Chapter 1 Introduction 1.4 Thesis Outline The thesis is subdivided into the following chapters: Chapter 1 Introduction The chapter presents the motivation for the thesis subject, the theses and the objectives, the description of solutions employed and the short description of the state of the art and the related works. Chapter 2 The State of the Art and Related Works The state of the art and the related works aiming to combine relational and object-oriented databases and programming languages (including mediator/wrapper approaches, ORM/DAO and XML) are briefly discussed here. Further, the wrapper location in the overall virtual repository architecture is presented. Chapter 3 Relational Databases The fundamentals of relational systems and their query optimisation methods and challenges are given, focused mainly on SQL, the most common relational query language. Chapter 4 The Stack-Based Approach The stack-based approach and SBQL, the corresponding query language, are presented, with the concept fundamentals and meaningful query examples. Also, the SBQL optimisation methods are introduced. Chapter 5 Object-Relational Integration Methodology The chapter presents the concept and the assumptions of the developed and implemented methodology for integrating relational resources into virtual repository structures. The schema transformation procedures are given, followed by query analysis, optimisation and processing steps. The chapter is concluded with a conceptual example of the query processing methods applied. Chapter 6 Query Analysis, Optimisation and Processing The detailed description of the developed and implemented query analysis and transformation methods is given, followed by demonstrative examples based on two relational test schemata introduced. The chapter ends with more complex examples of Page 18 of 235 Chapter 1 Introduction further schema transformations with object-oriented updateable views applicable in upper levels of the virtual repository. Chapter 7 Wrapper Optimisation Results The chapter provides the discussion of the optimisation results for the prototype based on a relatively large set of sample queries and various database sizes (for the schemata presented in the previous chapter). Chapter 8 Summary and Conclusions The conclusions and future works that can be performed for the further wrapper prototype development. The thesis text is extended with three appendices describing the eGov-Bus project, the ODRA platform responsible for the virtual repository environment and the wrapper prototype implementation issues. Page 19 of 235 Chapter 2 The State of the Art and Related Works The art of object-oriented wrappers built on top of relational database systems has been developed for years – first papers on the topic are dated to late 80s and were devoted to federated databases. The motivation for the wrappers is reducing the technical and cultural difference between traditional relational databases and novel technologies based on object-oriented paradigms, including analysis and design methodologies (e.g., based on UML), object-oriented programming languages (e.g. C++, Java, C#), object-oriented middleware (e.g., based on CORBA), object-relational databases and pure objectoriented databases. Recently, Web technologies based on XML/RDF also require similar wrappers. Despite the big pressure on object-oriented and XML-oriented technologies, people are quite happy with relational databases and there is a little probability that the market will massively change soon to other data store paradigms (costs and time required for such migration are unpredictably huge). Unfortunately, the object-orientation has as many faces as existing systems, languages and technologies. Thus, the number of combinations of object-oriented options with relational systems and applications is very large. Additionally, wrappers can have different properties, in particular, can be proprietary to applications or generic, they can deal with updates or be read-only, can materialise objects on the wrapper side or deliver purely virtual objects, can deal with object-oriented query language or provide some iterative “one-object-in-a-time” API, etc. [10]. This causes an extremely huge number of various ideas and technologies. For instance, Google reports more than 500 000 results as a response to the query “object relational wrapper”. Page 20 of 235 Chapter 2 The State of the Art and Related Works 2.1 The Impedance Mismatch The problem of the impedance mismatch between query languages and programming languages is very well realised and documented, for example see [11]. The impedance mismatch term is used here as a mere analogy to the electrical engineering phenomenon. It refers to a set of conceptual and technical difficulties which are often encountered when a RDBMS is used by a program written in an object-oriented programming language or style, particularly when objects and/or class definitions are mapped in a straightforward way to database tables and/or relational schemata. The basic technical differences and programming difficulties between programming and query (mainly the most popular SQL) languages refer to [11]: • Syntax – a programmer is obliged to use two different language styles and obey two different grammars, • Typology – a query language operates on types defined in a relational database schema, while a programming language is usually based on a completely different typological system, where nothing like relation exists. Most programming languages are supplied with embedded static (compile time) typological control, while SQL does not provide it (dynamic binding), • Semantics and language paradigms –concepts of language semantics are completely different. A query language is declarative (what to retrieve, not how to retrieve), while programming languages are imperative (how to do, instead of what to do), • Pragmatics – a query language frees a programmer from many data organisation and implementation details (collection organisation, presence of indices, etc.), while in programming languages these details must be coded explicitly, • Binding phases and mechanisms – query languages are interpreted (late binding), while programming languages assume early binding during compilation and linking phases. This causes many problems, e.g. for debuggers, • Namespaces and scoping rules – query and programming languages have their own namespaces that can contain the same names with different meanings. Mapping between these namespaces requires additional syntactical and semantic tools. Programming language namespace is hierarchical and obeys stack-based scoping rules – these rules are ignored by a query language, which causes many inconveniences, Page 21 of 235 Chapter 2 The State of the Art and Related Works • Null values – databases and query languages are supplied with dedicated tools for storing and processing null values, these means are not available to programming languages (or null is regarded completely different from the one in a database), • Iterative schemata – in a query language iterations are embedded in its operators’ semantics (e.g. selection, projection, join), while in a programming language iterations must be realised explicitly with some loops (e.g. for, while, repeat). Query result processing with a programming language requires dedicated facilities like cursors and iterators, • Data persistence – query languages process only persistent (physically stored) data, while programming languages operate only on volatile (located in operating memory) data. Combination of these languages requires using dedicated language constructs to parameterize queries with language variables and other language and architectural means for transmitting stored data to memory and reversely, • Generic programming means – these means in a query language are based on reflection (e.g. in dynamic SQL). Using something similar in a programming language is usually impossible because of early binding. Other means are used instead, e.g. higher-level functions, casting, switching to lower language level, polymorphism or templates. SBQL, the query language with a complete functionality of an object-oriented programming language (described in Chapter 4) applied in the virtual repository does not involve the impedance mismatch unless it is embedded in some programming language. The impedance mismatch is also very often defined in terms of relational and object-oriented database systems and their query and programming languages, respectively, regarding mainly their diverse philosophical and cultural aspects [12, 13]. The most important and pointed differences are found in interfaces (relational data only vs. object-oriented “behaviour”), schema binding (much more freedom to an object structure than in strict relational tables) and access rules (a set of predefined relational operators vs. object-specific interfaces). To solve this mismatch problem supporters of each paradigm argue that the other one should be abandoned. But this does not seem to be a good choice, as relational systems are still extremely popular and useful in many commercial, business, industrial, scientific and administrative systems (the situation does not seem to change soon), while programmers are willing to use object-oriented Page 22 of 235 Chapter 2 The State of the Art and Related Works languages and methodologies. There are proposed, however, some reasonable solutions that can minimise the impact of the impedance mismatch on further development on information systems. Preferably, object-oriented programmers should realise that a very close mapping between relational data and object-oriented is erroneous and leads to many complications. Their object-oriented code should be closer to relational model, as with application of JDBC for Java (applied in the wrapper prototype implementation) or ADO.NET [14] for C#. 2.2 Related Works Below there are presented some of the most important solutions aiming to access (mainly) relational databases with object-oriented tools (preferably in distributed environments) over last fifteen years roughly grouped and classified by their methodologies and main approaches. Many of them are not strictly related to the subject of the thesis, nevertheless their importance in the field cannot be neglected, as well as some experiences and concepts became useful for the prototype implementation. 2.2.1 Wrappers and Mediators The concept of wrappers and mediators to heterogeneous distributed data sources was introduced in [9] as vision and indication for developing future information systems for next years (appropriate solutions were usually unavailable yet and were assumed to appear within next ten years). The idea appeared as a response to a growing demand for data and information triggered by the evolution of fast broadband networks and still limited by distributed heterogeneous and inefficient data sources. A knowledge that information exists and is accessible causes users’ expectations, while real-life experience showing that this information is not available in a useful form and it cannot be combined with other pieces of information makes much confusion and frustration. The main reasons for designing and implementing proposed mediation systems are heterogeneities of single databases, where a complete lack of data abstractions exits and no common data representation is available. This issue makes combining information from various database systems very difficult and awkward. The main goal of this architecture was to design mediators as relatively simple software modules transparently keeping specific data information and sharing those data abstractions with higher level mediators or applications. Therefore for large networks Page 23 of 235 Chapter 2 The State of the Art and Related Works mediators should be defined for primitive (bottom-level) mediators that can still mediate for lower modules and data sources. Mediators described in [9] were provided with some technical and administrative “knowledge” concerning how to process data behind them in order to achieve effective decision process supported by distributed, modular and composable architecture. They must work completely automatically, without any interference of human experts, producing complete information from available data. The described information processing model of the proposed mediations system was based on existing human-based or human-interactive solutions (without any copying or mimicking of them, however). The mediation process is introduced in order to overcome the feature of a database-application interface, where only a communication protocol and formats are defined – the interface does not resolve abstraction and representation problems. A proper interface must be active, which functionality is defined as mediation. Mediation includes all the processing needed to make an interface work, knowledge structures driving data transformations and any intermediate storage (if required). Mediators defined as above should be realised as explicit modules acting as an intermediate active layer between user applications and data resources in sharable architecture, independent of data resources. There are also defined meta-mediators responsible for managing and enabling allocating and accessing actual mediators and data sources. The concept of splitting a database-application interface with a mediator layer makes two additional interfaces appear: between a database server and a mediator and between a mediator and a user application. Since programmers tend to stick to one interface language and they are unwilling to switch to any other, unless the current one becomes somehow inconvenient or ineffective, access to a mediator must be defined in a universal high-level extensible and flexible language. The proposal does not suggest any particular language, rather than its recommendation and features. A strong assumption is made that such a language does not have to be user-friendly, but machine- and communication-friendly, as it works in the background – any user interaction is defined on an application level and it is completely separated from a mediation process. This statement allows omitting problems and inadequacies introduced to SQL [15]. On the other side of a mediator, an interaction with a data source takes place. A communication can be realised with any common interface language supported by Page 24 of 235 Chapter 2 The State of the Art and Related Works a resource, e.g., SQL for relational databases. A mediator covering several data sources can be also equipped with appropriate knowledge on how to combine, retrieve and filter data. The most effective mediation can be achieved if a mediator serves a variety of application, i.e. it is sharable. On the other hand, applications can compose their tasks basing on a set of underlying mediators (unavailable information may motivate creating new mediators). The simplified mediation architecture is presented in Fig. 1. User application 1 User application 2 User-friendly external language Mediator Communication-friendly internal language Mediator Mediator Wrapper Mediator Mediator Mediator Wrapper Wrapper Wrapper Resource Mediator Wrapper Resource Mediator Wrapper Resource Mediator Mediator Wrapper Wrapper Resource language Resource Resource Resource Resource Resource Fig. 1 Mediation system architecture The general mediation concept presented briefly above deserved several implementations; the most important are described shortly in the following subsections. Its intentions and architectural aspects are also reflected in the virtual repository idea (subsection 2.3 The eGov-Bus Virtual Repository). 2.2.1.1 Pegasus Pegasus [16, 17] was a prototype multidatabase (federated) system. Its aim was to integrate heterogeneous resources into a common object-oriented data model. Pegasus accessed foreign systems with mapper and translator modules. Mappers were responsible for mapping a foreign schema into a Pegasus schema, while translators processed top-level queries expressed with an object-oriented query language Page 25 of 235 Chapter 2 The State of the Art and Related Works (HOSQL1, an Iris query language [18, 19]) to native queries applicable to an integrated resource. A mapping procedure generated a Pegasus schema covering a resource so that it was compatible with a common database model (a function realised by object-oriented updateable views in the thesis) and defined data and operations available from a resource. Dedicated mappers and translators were designed for different resources and data models, including relational databases, and available as separate modules (mapper/translator pairs). Mapping of relational databases served three different cases: with both primary and foreign keys given, without foreign keys and without primary keys [20]. In the first case a user-defined Pegasus type was created for each relation that has a simple primary key. Then for each user-defined type and for every non-key attribute in the associated relation, a function was created with the type as argument and the attribute as result. For relations with composite primary key, multiargument functions were created with the primary key fields as arguments and the attribute as result. No type creation was needed in the mapping. If an attribute in the mapped relation was a foreign key, it was replaced with the type corresponding to that key. In case of no foreign keys available, userdefined types could not be created, since the attributes that refer to a primary key field cannot be substituted by its associated type. However, a functional view of the underlying relations was created in a manner similar to the previous case. In the last case (no primary keys’ information), as functional dependencies of any kind were not known, only a predicate function was created for each relation, where all the attributes of the relations formed an argument of the function. A predicate function contained one or more arguments returned a boolean result. An input query expressed in terms of Pegasus schema with HOSQL was parsed into an intermediate syntax tree (F-tree) with functional expressions (calls, variables, literals) of HOSQL as nodes. The F-tree was then transformed into a corresponding Btree with catalogue information – for foreign schema queries (e.g., targeting a relational database) this information included all necessary connection parameters (a database name, a network address, etc.). If such a query aimed at a single foreign system, it was optimised by Pegasus and sent directly to the resource (a minimum of processing at 1 HP AllBase SQL Page 26 of 235 Chapter 2 The State of the Art and Related Works a Pegasus side). In case of a query aiming at multiple foreign schemata, its parts were grouped by a destination in order to reduce a number of resource invocations (optimisation issues), which process resulted in D-tree structure. Basing on statistical information Pegasus performed a cost-based optimisation (join order, join methods, join sites, intermediate data routing and buffering, etc.) within such a D-tree. Once optimised, a query was translated into a foreign resource query language and executed. 2.2.1.2 Amos Amos2, another early work devoted to building object-oriented views over relational databases and querying these databases with object-oriented query languages (basing of the Pegasus experience and the mediator approach [9]) has been described in [21]. In the system a user interacts with a set of object-oriented views that translate input queries to a relational query (a set of queries) that are executed on a covered resource. A returned result is again converted (possibly composed from a set of partial results of single relational queries) to an object-oriented form and presented to a user. Such a view besides of covering a relational resource was also capable of storing its own purely object-oriented data and methods. Therefore input queries can transparently combine relational and object references. One of strengths of these views was a query optimisation based on an object and relational algebra and calculus. The most important weakness was a lack of support for relational data updates. Object-oriented views were regarded as means to integrate a relational resource into multidatabase (federated) systems based on an object-oriented common data model (CDM) [22]. This approach assumed also applying views for a general case of mapping any heterogeneous component database into such a system, which approach is also reflected in the thesis virtual repository architecture. The prototype has been implemented with Amos [23] object-oriented database system over Sybase RDBMS. An object-oriented view was designed over an existing relational database to express its semantics in terms of an object-oriented data model. At this mapping stage there was an assumption that a relational database was directly mappable to an objectoriented model, a set of relational views (or other modifications) was required for a relational resource to enable such mapping otherwise. After this mapping procedure a system was ready for querying. A query evaluation methodology for the described 2 Active Mediator Object System Page 27 of 235 Chapter 2 The State of the Art and Related Works system was developed as an extension to conventional methodology [24] where a query cost plans are compared with a translator introduced. A translator was a middleware piece of software responsible for replacing predicates during views’ expansion with their base relational expressions (in general: one-to-one mapping between relational tuples and Amos objects. 2.2.1.3 Amos II Amos II [25, 26] is a current continuation of the Amos project shortly described above being a prototype object-oriented peer mediator system with a functional data model and a functional query language – AmosQL based on a wrapper/mediator concept introduced in [9]. The solution enables interoperations between heterogeneous autonomous distributed database systems. Transparent views used in Amos II mediators enable covering other mediators, wrapping data sources and native Amos objects in a modular and composable way. The Amos II data manager and its query processor enable defining additional data types and operators with some programming languages (e.g., Java or C). Amos II is based on a mediator/wrapper approach [9] with mediator peers communicating over Internet rejecting a centralised architecture with a single server responsible for a resource integration process and translating data into a CDM. Each mediator peer works as a virtual database with data abstraction and a query language. Applications access data from distributed data sources through queries to views in mediator peers. Logical composition of mediators is achieved when multidatabase views in mediators are defined in terms of views, tables, and functions in other mediators or data sources. Due to an assumed heterogeneity of integrated data sources, Amos II mediators can be supplied with one or more wrappers processing data from covered resources, e.g., ODBC relational databases, XML repositories, CAD systems, Internet search engines. In terms of Amos II, a wrapper is a specialised facility for query processing and data translations from between an external resource and the rest of the system. It contains an interface to a resource and information on how to efficiently translate and process queries targeting at its resource. In Amos II architecture its external peers are also regarded as external resources with a dedicated wrapper and special query Page 28 of 235 Chapter 2 The State of the Art and Related Works optimisation methods based on a distribution, capabilities, costs, etc. of the different peers [27]. Amos II assumes an optimisation of AmosQL queries prior to their execution. This procedure is based on an object calculus and an object algebra. First, a query is compiled and decomposed into algebra expressions. The object calculus is expressed in an internal simple logic-based language called ObjectLog [28], which is an objectoriented dialect of Datalog [29, 30]. AmosQL optimisation rules rely on its functional and multidatabase properties. Distributed multidatabase queries are decomposed into local queries executed on an appropriate peer (load balancing is also taken into account). At each peer, a cost optimiser based on statistical estimates is further applied. Finally, an optimised algebra is reinterpreted to produce a final result of a query. Fig. 2 Distributed mediation in Amos II [26] Since Amos II is a distributed system, it introduces a distributed mediation mechanism, where peers communicate and interoperate with TCP/IP protocol. The mediation process is illustrated in Fig. 2 where an application accesses data from two different mediators over heterogeneous distributed resources. Peer communication is depicted with thick lines (the arrows indicate peers acting as servers), and the dotted lines show a communication process with a name server to register a peer and to obtain information on a peer group. The name server stands for a mediator server keeping a simple meta-schema of a peer group (e.g., names, locations, etc.; a schema of each peer is kept and maintained at its side locally). The information in the name server is managed without explicit operator intervention; its content is managed through messages from autonomous mediator peers. Mediator peers usually communicate directly without involving a name server that is used only when a new mediator peer connection is established. A peer can communicate directly with any other peer within its group (Fig. 2 shows communication between different topological peer levels, Page 29 of 235 Chapter 2 The State of the Art and Related Works however). An optimal communication topology is established by the optimisation process for a particular query (individual peer optimiser can also exchange data and schema information to produce an optimised execution plan of a query). 2.2.1.4 DISCO DISCO3 [31,32] was designed as a prototype heterogeneous distributed database based on underlying data sources according to a mediator/wrapper paradigm [9]. It realised common problems concerning such environments, i.e. inconsistencies and incompatibilities of data sources and unavailability and disappearance of resources. Moreover, much attention was put to cost-based query optimisation among wrappers. Besides regular databases, file systems and information retrieval systems (multimedia and search engines), DISCO was oriented on WWW resources (e.g., HTML documents). Its main application was search and integration of data stored in distributed heterogeneous resources, which solution should provide uniform and optimised information access based on a common declarative query language. In a mediator-based architecture, end users interacted with an application which in turn accessed a uniform representation of underlying resources with a SQL-like declarative query language. These resources were covered by mediators encapsulating a representation of multiple data sources for this query language. Mediators could be developed independently or might be combined, which enabled to deal with complexities of distributed data sources. Queries entering mediators, expressed with an algebraic language supporting relational operations, were transformed into subqueries distributed to appropriate data sources. Wrappers interfacing between mediators and actual data sources represented resources as structured views. They accepted (sub)queries from mediators and translated them into an appropriate query language of a particular wrapped resource. On the other hand, query results were reformatted so that they were acceptable by wrappers’ mediators. Each type of a data source required implementing a dedicated wrapper. A wrapper had to contain a precise definition of data source’s capabilities so that a subset of the algebraic language supported by a wrapper could be chosen (resource heterogeneity). On registration of a wrapper by a mediator, a wrapper transmitted this 3 Distributed Information Search COmponent Page 30 of 235 Chapter 2 The State of the Art and Related Works subset description that could be supported in a mediator. This information was automatically incorporated by a mediator in a query transformation process. As mentioned above, cost-based optimisation was one of the most important issues in DISCO. As in the case of defining a language subset available via a wrapper, optional cost information was also included at a wrapper implementation stage. This information concerned selected or all of the algebraic operations supported by a wrapper. Again, cost information was transmitted to a mediator when registering a wrapper. This wrapper-specific cost information overrode a general cost information used by a mediator so that a more accurate cost model could be used. The DISCO’s cost-based optimiser was used to produce the best possible query evaluation plan. The problem of resource unavailability/disappearance was solved on a level of partial query results’ composition. A partial answer to a query being a part of a final result was produces by currently available data sources. It contained also a query representing finished or unfinished parts of the answer. When a previously unavailable data source became accessible, a partial answer could be resubmitted as a new query to obtain a final answer to the original query. Wrapper-mediator interaction occurred in two phases: first – when a wrapper was registered by a mediator (a mediator could register various wrappers), second – during query processing. During registration a wrapper's local schema its support for query processing capabilities and specific cost information were supplied to a mediator. A mediator contained a global schema defined by its administrator (this global schema is seen by applications) and a set of views defining how to connect a global schema to local schemata. During a query processing phase a query from an application was passed to a mediator transforming it into a plan consisting of subqueries and a composition query (information how to produce a final result for a mediator). This plan was optimised with respect to information provided by wrappers in the integration stage. Then, a resulting plan was executed by issuing subqueries to appropriate available wrappers that evaluated them on their underlying resources and returned partial results. In an ideal case, when all the wrappers were available, a mediator combined their partial answers according to the composition query and returned a final result to the applications. If some wrappers were unavailable, however, a mediator returned only a partial answer. The application could extract from it some information depending on which wrappers were accessible. Page 31 of 235 Chapter 2 The State of the Art and Related Works 2.2.1.5 ORDAWA ORDAWA4 [33] was an international project whose aim was to develop techniques for integration and consolidation of different external data sources in an object–relational data warehouse. The detailed issues consisted of construction and maintenance of materialised relational and object-oriented views, index structures, query transformations and optimisations, and techniques of data mining. The ORDAWA architecture [33] is realised as a customised wrapper/mediator approach. The object-oriented views with wrappers constitute mediators transforming data from bottom-level resources. The integrator works also as a top-level mediator basing on underlying mediators. Any further description of ORDAWA is out of the scope of a thesis, since it assumes data materialisation, which approach has been strictly rejected here – it was mentioned just as another implementation of the wrapper/mediator approach. 2.2.2 ORM and DAO ORM5 [34, 35, 36, 37] is a programming technique for converting data between incompatible type systems in databases and object-oriented programming languages. In effect, this creates a “virtual object database” which can be used from within the programming language. In common relational DBMS, stored values are usually primitive scalars, e.g., strings or numeric values. Their structure (tables and columns) creates some logic that in object-oriented programming languages is commonly realised with complex variables. ORM aims to map these primitive relational data onto complex constructs accessed and processed from an object-oriented programming language. The core of the approach is to enable bidirectional transparent transformations between persistent data (stored in a database) with their semantics and transient objects of a programming language, so that data can be retrieved from a database and stored there as results of a programme executions. An object-relational mapping implementation needs to systematically and predictably choose which relational tables to use and generate the necessary SQL code. 4 5 Object–Relational Data Warehousing System Object-Relational Mapping Page 32 of 235 Chapter 2 The State of the Art and Related Works Unfortunately, most of ORM implementations are not very efficient. The reason for this are usually are extra mapping operations and memory consumption occurring in an additional middleware layer introduced between a DBMS and an end-user application. On the other hand, DAO6 [38, 39] indents to map programming language objects to database structures and persist them there. Originally the concept was developed for Java and most applications are created for this language. However. an early attempt to build DAO architecture referred to as an “object wrapper” (the term of DAO appeared later) was shown in [40], while a general architecture and constraints for building object-oriented business applications over relational data stores with persistence issues was extensively described and discussed in [41, 42, 43]. The most important differences between ORM and DAO with their advantages and disadvantages are very well depicted in [44], the crucial distinction appears at the very beginning of design – in ORM a database schema already exists and an application code is created according to it (objects-to-tables mapping), for DAO application logic is established and then appropriate database persistence structures are developed (tablesto-objects mapping). However, in real-life implementations both approaches are very often mixed in favour of API or application usability and features. Below there are shortly described the most common and important approaches to ORM and DAO, mainly for Java (due to the prototype implementation technology), a more complete list of ORM solutions for different platforms and languages can be found at [45]. 2.2.2.1 DBPL Gateway The early project concerning the ORM approach was DBPL7 [46] being a Modula-2 [47] extension provided with a gateway to relational databases (Ingres [48] and Oracle [49]). The API extension concerned a new bulk data type constructor “relation”, persistence and high-level relational expressions (queries) based on the nested relational calculus, maintaining strong typing and orthogonality. The gateway itself was a mechanism enabling regular DBPL programmes accessing relational databases, i.e. no 6 7 Data Access Object DataBase Programming Languages Page 33 of 235 Chapter 2 The State of the Art and Related Works SQL statement strings were embedded in a code and therefore creating transparent access to databases. DBPL constructs referring to relational data were automatically converted to corresponding SQL expressions and statements referring to relational tables. The motivation for designing such solution was the well known impedance mismatch between programming languages and query languages, a misleading userfriendliness of query languages (mainly SQL), a bottom-up evolution of query languages resulting in awkward products like Oracle PL/SQL. A user work on relational data as on persistent DBPL collections transparently mapped onto relational structures. SQL queries resulting from background transformations were optimised by SQL optimisers, therefore a general performance was acceptably good. The conceptual and implementational details of the gateway are described in details in [46]. 2.2.2.2 EOF EOF8 [50] developed by NeXT Software, Inc. (overtaken by Apple Computer in 1997) in often considered as the best ORM implementation. However, the overall impact of EOF on this technology development was rather poor, as it was strictly integrated with OpenStep [51], the NeXT’s object-oriented API specification for operating systems, realised within OPENSTEP. EOF was also implemented as a part of WebObjects, the first object-oriented Web Application Server. Currently, it provides a background technology for the Apple’s e-commerce solutions, e.g., iTunes Music Store. The Apple’s EOF provides two current implementations: with Objective-C [52] (Apple Developer Tools) and with Java (WebObjects 5.2). EOF has been also used as a basis for an open-source Apache Cayenne described shortly below. 2.2.2.3 Apache Cayenne Cayenne [53] is an open-source ORM implementation for Java inspired by EOF provided with remote services support. It is capable of transparent binding of one or more database schemata to Java objects, managing transactions, SQL generation, performing joins, sequences, etc. A solution of Remote Object Persistence enables persisting Java objects with native XML serialisation or via Web Services – such 8 Enterprise Objects Framework Page 34 of 235 Chapter 2 The State of the Art and Related Works persistent objects can be passed to non-Java clients, e.g., Ajax [54] supporting browsers. Java object generation in Cayenne is based on the Velocity [55] templating engine and the process can be managed with a GUI application. 2.2.2.4 IBM JDBC wrapper A simple ORM solution proposed by IBM [56] is based on JDBC9. Relational structures (databases, tables, rows, row sets and results) are mapped onto appropriate Java classes’ instances. Particular structures are accessible by their names or SQL-like predicates given as string parameters to Java methods (including constructors). This simplified approach enables again working with relational databases from within a programming language. 2.2.2.5 JDO JDO10 [57,58] is a specification of Java object persistence. The main application of JDO is persisting Java objects in a database rather than accessing database with a programming language (as described in the above approaches) – JDO is a DAO approach. Object persistence is defined in external XML metafiles, which may have vendor-specific extensions. Currently, JDO vendors offer several options for persistence, e.g., to RDBMS, to OODBMS, to files [59]. JDO is integrated with Java EE in several ways. First of all, the vendor implementation may be provided as a JEE Connector. Secondly, JDO may work in the context of JEE11 transaction services. The most common supporter and implementer of JDO is Apache JDO [60], other commercial and non-commercial implementation (e.g., Apache OJB [61], XORM [62], Speedo [63]) are listed in [64]. 2.2.2.6 EJB The EJB12 [65, 66, 67] specification intends to provide a standard way to implement the back-end “business” code typically found in enterprise applications (as opposed to “front-end” user-interface code). Such code was frequently found to reproduce the same types of problems, and it was found that solutions to these problems are often repeatedly 9 Java DataBase Connectivity Java Data Objects 11 Java Enterprise Edition 12 Enterprise Java Beans 10 Page 35 of 235 Chapter 2 The State of the Art and Related Works re-implemented by programmers. Enterprise Java Beans were intended to handle such common concerns as persistence, transactional integrity, and security in a standard way, leaving programmers free to concentrate on the particular program at hand. Following EJB2, the EJB3 specification covered issues of persistence, however there were some crucial differences from the JDO specification. As a result, there are several implementations of JDO [64], while EJB3 was still under development. The situation became yet more unfavourable for EJB, as a new standard for Java persistence is being introduced (JPA described below) and persistence has been excluded from the EJB3 core. 2.2.2.7 JPA JPA13 [68,69] is specified in a separate document within the EJB specification [66]. The package javax.persistence does not require the EJB container and therefore it can be applied to J2SE14 environment (still required by JDO), according to the POJO15 concept. JPA differs from JDO even more. It is a classical ORM standard, not transparent object persistence, independent of the technology of the underlying data store. 2.2.2.8 Hibernate Hibernate [70] is a DAO solution for Java and .NET [71] (NHibernate). Similarly to JDO, Hibernate is not a plain ORM solution, as its main application is persisting programming language objects. It allows to develop persistent classes according to the object-oriented paradigm, including association, inheritance, polymorphism, composition, and collections. As for querying, Hibernate is provided with its own portable query language (HQL) being an extension to SQL, it supports also a native SQL and an object-oriented Criteria and Example API (QBC and QBE). Therefore data source optimisation can be applied. 13 Java Persistence API Java 2 Standard Edition 15 Plain Old Java Object 14 Page 36 of 235 Chapter 2 The State of the Art and Related Works 2.2.2.9 Torque Torque [72] is an ORM for Java. It does not use a common approach of reflection to access user-provided classes – a Java code and classes are generated automatically (including data objects) basing on an XML relational schema description (generated or created manually). This XML schema can be also used to generate and execute SQL scripts for creating a database schema. In the background Torque uses transparently JDBC and database-specific implementation details, therefore a Torque-based application is completely independent of a database – Torque does not use any DBMS specific features and constructs. 2.2.2.10 CORBA CORBA16 [73, 74] is a standard defined by the OMG17 [75] enabling software components written in multiple computer languages and running on multiple computers to work together. Due to CORBA’s application in distributed systems, object persistence (in terms of DAO) seem a very important issue, unfortunately not defined in the standard itself. Some attempts has been made in order to achieve persistence in object-oriented databases (e.g., [76]) and relational ones (e.g., [77]). Nevertheless, these solutions seem somehow exotic and are not very popular. The detailed discussion of persistence and ORM issues for CORBA is contained within [78]. 2.2.3 XML Views over Relational Data Using XML18 [79] for processing relational databases in object-oriented systems can be regarded as an extension to ORM, but its technique is completely different from the ones mentioned above. The concept of presenting relational data appeared as XML can reflect relational structures easily (with application of a single or multiple documents), it is portable and can be further queried (for example with XQuery [80, 81]) or processed in an object-oriented manner. The most straightforward approach aims it build XML views over relational databases (over 1 000 000 Google hits for “xml relational view”), preferably with data 16 Common Object Request Broker Architecture Object Management Group 18 eXtensible Markup Language 17 Page 37 of 235 Chapter 2 The State of the Art and Related Works updating capabilities. There are a lot of works devoted to this subject with successful implementations (much less realising relational data updates, however). The dedicated solution used in designing XML views over relational and objectrelational data sources is XPERANTO19 [82, 83, 84] by IBM. The goal of XPERANTO was to provide XML views with XQuery querying capability, with no need to use SQL, i.e. without user’s knowledge about an actual data model. The system translated XMLbased queries into SQL requests executed on its resource, received regular SQL results and transformed them into XML documents presented to an end-user application. XPERANTO’s intention was to push as much as possible of query evaluation to a resource level, so that its SQL optimisers could work and XML-level processing was minimised. Another projects similar to was XTABLES [85]. Updating relational data from XML views was solved in UXQuery [86, 87] based on a subset of XQuery [80, 81] used for building views so that s an update to the view could be unambiguously translated to a set of updates on the underlying relational database, assuming that certain key and foreign key constraints held. Another implementation [88] of updateable XML views for relational database was created for the CoastBase [89,90] project within the IST programme of the European Commission (contract no: IST-1999-11406). A concept of triggers for updating is introduced in Quark [91] (built over IBM DB2) [92]. 2.2.4 Applications of RDF Alternative approaches concerning RDF20 [93] and its triples (the subject-predicateobject concept) has been also adapted for accessing relational resources in an objectoriented manner. 2.2.4.1 SWARD SWARD21 [94, 95] (a part of Amos II project) is a wrapper to relational databases realised with RDF format and technologies. A user defines a wrapper exporting a chosen part of a database as a view in terms of the RDF metadata model. This view 19 Xml Publishing of Entities, Relationships, ANd Typed Objects Resource Description Framework 21 Semantic Web Abridged Relational Databases 20 Page 38 of 235 Chapter 2 The State of the Art and Related Works automatically generated from two simple mapping tables can be queried with either SQL or RDQL [96]. SWARD is a system based on a virtual repository concept enabling scalable SQL queries to RDF views of large relational databases storing government documents and life event data. Such a view enables querying both relational metadata and stored data, which supports scalable access to large repositories. A RDF view of a relational database is defined as a large disjunctive query. For such queries an optimisation is critical not only concerning data access time but also the time to perform the query optimisation itself. The SWARD is supplied with novel query optimisation techniques based on query rewriting and compile time evaluation of subexpressions enabling execution of real-world queries to RDF views of relational databases. The process of mapping relational structures to RDF reflects tables in classes and columns in properties, according to a given ontology. A relational database is represented as a single table of RDF triples called an universal property view (UPV). The UPV is internally defined as a union of property views, each representing one exported column in a relational database (a content view) and a representation of relational metadata (a schema view). A generation of UPV is automatic, provided a user specifies for a given relational database and ontology a property mapping table declaring how to map exported relational columns to properties of the ontology. Further, a user specifies a class mapping table declaring URIs of exported relational table. 2.2.4.2 SPARQL SPARQL22 [97, 98, 99] is a RDF query language and protocol for semantic webs. There are some works aiming to use its paradigms for accessing relational data (e.g., [100, 101, 102, 103, 104]), however the concept and its real-life applications are currently difficult to evaluate. It is mentioned here as another approach to the issue. 22 SPARQL Protocol and RDF Query Language (a recursive acronym) Page 39 of 235 Chapter 2 The State of the Art and Related Works 2.2.5 Other Approaches There is also a set of other approaches to the issue of integration of object-oriented systems with relational data stores that cannot be classified to any of the above categories. Below there are shown some of them. 2.2.5.1 ICONS The ICONS23 project [105] was supported by the European Community’s “Information Society Technology” research and realised under the Fifth Framework Programme (contract no: IST-2001-32429). The ICONS project focused on bringing together into coherent, web-based system architecture the advanced research results, technologies, and standards, in order to develop and further exploit the knowledge-based, multimedia content management platform. Integrating and extending known results from the AI and database management fields, combined with advanced features of the emerging information architecture technologies. As ICONS’s result, a prototype solution has been developed and implemented in the Structural Fund Project Knowledge Portal [106, 107] published in Internet. The overall project objective was to integrate and extend the existing research results and standards in the area of knowledge representation as well as integration of pre-existing, heterogeneous information sources. The knowledge representation research covered such paradigms as logic (disjunctive Datalog [29, 30]), semantic nets and conceptual modelling (UML [108] semantic data models and the RDF standard [93]), as well as the procedural knowledge represented by directed graphs (WfMC [109]). The ICONS prototype managed an XML-based multimedia content repository, storing complex information objects and/or representations (proxies) of external information resources such as pre-existing, heterogeneous databases, information processing system outputs, and Web pages, as well as the corresponding domain ontologies. ICONS assumed using existing data and knowledge resources, including heterogeneous databases, legacy information processing systems and Web information sources. The overall integration architecture with information management and processing methodologies [110, 111, 112] are completely out of the scope of the thesis, 23 Intelligent CONtent management System Page 40 of 235 Chapter 2 The State of the Art and Related Works except for techniques of accessing heterogeneous database systems. ICONS assumed a simplified object-oriented model having virtual non-nested objects with repeated attributes and UML-like association links used for navigation among objects. Both repeated attributes and links were derived from the primary-foreign key dependencies in the relational database by a parameterization utility. The ICONS repository was available as an API to Java (without use of JDBC). However, because Java has no query capabilities, all the programming had to be done through sequential scanning of collections of objects. Obviously, this gave no chances to the SQL optimiser, hence initially the performance was extremely bad. It was improved by some extensions, for instance, by special methods with conditions as parameters that were mapped into SQL where clauses, but in general the experience was disappointing. 2.2.5.2 SDO SDO24 [113, 114, 115] are a new approach to a programming model unifying data access and manipulation techniques for various data source types. According do SDO, programming procedures, tools, frameworks and resulting applications should be therefore “resource-insensitive”, which improves much design, coding and maintenance. SDO are implemented as a graph-based framework similar to EMF25 [116] (its detailed description can be found at [115] and it is out of the scope of the thesis). In the framework they are defined data mediator services (DMS, not declared in the specification [117], however) whose are responsible for interactions with data sources (e.g., JDBC, EJB entities [65], XML [79]). Data sources in terms of SDO are not restricted to persistent databases, and they contain their own data formats. A data source is accessed directly only by an appropriate DMS, not an application itself as it works only with data objects in data graphs provided with a DMS). These data objects are the fundamental components of the framework corresponding to service data objects in the specification [117]. Data objects are generic and provide a common view of structured data built by a DMS. While a JDBC DMS, for instance, needs to know about the persistence technology (for example, relational databases) and how to configure and access it, SDO clients need not know anything about it. Data objects hold their “data” in 24 25 Service Data Objects Eclipse Modelling Framework Page 41 of 235 Chapter 2 The State of the Art and Related Works properties. Data objects provide convenience creation and deletion methods (like createDataObject() with various signatures and delete()) and reflective methods to get their types (instance class, name, properties, and namespaces). Data objects are linked together and contained in data graphs. 2.3 The eGov-Bus Virtual Repository One of the goals of the eGov-Bus project (described in Appendix A) is to expose all the data as a virtual repository whose schema is shown in Fig. 3. The central part of the system is the ODRA database server, the only component accessible for the top-level users and applications presenting the virtual repository as a global schema. The server virtually integrates data from the underlying resources available here only as SBQL views’ definitions (described in subchapter 4.3 Updateable Object-Oriented Views) basing on a global integration schema defined by the system administrator and a global index keeping resource-specific information (e.g. data fragmentation, redundancy, etc.). The integration schema is another SBQL view (or a set of views) combining the data from the particular resources (also virtually represented as views, as mentioned) according to the predefined procedures and the global index contents. These contents must determine resource location and its role in the global schema available to the toplevel users. The resource virtual representation is referred to as a contributory view, i.e. another SBQL view covering it and transforming its data into a form compliant with the global schema. A contributory view must comply with the global view; actually it is defined as its subset. The user of the repository sees data exposed by the systems integrated by means of the virtual repository through the global integration view. The main role of the integration view is to hide complexities of mechanisms involved in access to local data sources. The view implements CRUD behaviour which can be augmented with logic responsible for dealing with horizontal and vertical fragmentation, replication, network failures, etc. Thanks to the declarative nature of SBQL, these complex mechanisms can often be expressed in one line of code. The repository has a highly decentralised architecture. In order to get access to the integration view, clients do not send queries to any centralised location in the network. Instead, every client possesses its own copy of the global view, which is automatically downloaded from the integration server after successful authentication to the repository. A query executed on the integration view is Page 42 of 235 Chapter 2 The State of the Art and Related Works to be optimised using such techniques as rewriting, pipelining, global indexing and global caching. Fig. 3 eGov-Bus virtual repository architecture [205] The currently considered bottom-level resources can be relational databases (being the thesis focus), RDF resources, Web services applications and XML documents. Each type of such a resource requires an appropriate wrapper capable of communicating with both the upper ODRA database (described in Appendix B) and the resource itself (except for XML documents currently imported into the system). Such a wrapper works according to the early concept originally proposed in [9], while the contributory views are used as mediators performing appropriate data transformations and standing for external (however visible only within the system, not seen above the virtual repository) resources’ representations. Since SBQL views can be created only over SBA-compliant resources, the wrapper must be accessible in this way. This goal is achieved by utilising a regular ODRA instance whose metadata (schema) are created by the underlying wrapper basing on some resource description, but without any physical data. The data are retrieved by the wrapper only when some query arrives and its result is to be returned (after transformations performed by the contributory view) to the upper system components. Page 43 of 235 Chapter 2 The State of the Art and Related Works Similarly, the data disappears when it is not needed anymore (e.g. when a transaction finishes or dies). 2.4 Conclusions The wide selection of various solutions presented above contains both historical approaches with their implementations (focused mainly on the mediation concept) and contemporary techniques of integration of relational resources into modern objectoriented data processing systems. Actually, the issue of object-relational integration discussed in the dissertation can be divided into two separate, but still overlapping in many fields, parts. The first one is focused on how to integrate heterogeneous distributed and redundant resources into one consistent global structure (here: the virtual repository), which is the mediation process. The other part of the problem is how to access the resources so that they can transparently contribute to the global structure with their data, which should be regarded as wrapping. The mediation procedure does not rely on bare resources but their wrappers, instead. Such a wrapped resource is provided with a common interface accessible to mediators and its actual nature can be neglected in upper-level mediation procedures. The mediation and integration procedures must rely on some effective resource description language used for building low-level schemata combined into top-level schema available to users and the only one seen by them. In the thesis, this feature is implemented by updateable object-oriented views based on SBQL. SBQL is further employed for querying resources constituting the virtual repository. The large set of solutions possibly applicable for wrapping relational resources can be referred to as ORM/DAO. Unfortunately, their relatively poor performance (proved by the author’s previous experience and additional preliminary tests not included in the thesis) does not allow building efficient wrapper for the virtual repository. Again, any intermediate data materialisation (to XML, for example) was rejected as unnecessary and contradictory to the eGov-Bus project requirements. Instead, dedicated client-server architecture was designed and effectively implemented, basing on relatively low-level wrapper-resource communication with its native query language. Page 44 of 235 Chapter 3 Relational Databases The fundamentals of current relational DBMSs were defined in [118] and [119] (System-R [120, 121]). The idea presented in [118] indented to create a simple, logical, value-based data model where data were stored in tuples with named attributes and sets of tuples constituting relations. This made writing queries easier (comparing to prerelational systems) and independent of data storage, but introduced also serious problems with evaluation efficiency and urgent needs for query optimisation. Developed for almost 40 years relational query optimisation techniques are based both on static syntactic transformations (relational calculus and algebra) and storage information on data structure and organisation (data pages, IO operations, various indices, etc.) resulting in various execution plans. The aim of a relational optimiser is to examine all the plans (in practise the number usually is limited to the most promising and straightforward ones to minimise optimisation time and resource consumption) and to choose the cheapest one (again time and resources are considered) however due to process simplification a common practise is just to reject the worst of them and chose one of the remaining. The overview of relational optimisation approaches was collected in [122, 123, 124]. Below, there is an overview of the most common relational optimisation approaches for single databases. Distribution issues are omitted as these are not considered in the thesis, however they are well documented and elaborated for relational systems (e.g., in [125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142]). 3.1 Relational Optimisation Constraints Query optimisation tries to solve the problem of a complete lack of efficiency of query evaluation when handling powerful operations offered by database systems (appearing Page 45 of 235 Chapter 3 Relational Databases especially in case of content-based access to data) by integrating a large number of techniques and strategies, ranging from logical transformations of queries to the optimisation of access paths and the storage of data on the file system level [122]. The economic principle requires that optimisation procedures either attempt to maximise the output for a given number of resources or to minimise the resource usage for a given output. Query optimisation tries to minimise the response time for a given query language and mix of query types in a given system environment. This general goal allows a number of different operational objective functions. The response time goal is reasonable only under the assumption that user time is the most important bottleneck resource. Otherwise, direct cost minimisation of technical resource usage can be attempted. Fortunately, both objectives are largely complementary; when goal conflicts arise, they are typically resolved by assigning limits to the availability of technical resources (e.g., those of main memory buffer space). [122] The main factors considered during relational query optimisation are [122]: • Communication cost – the cost of transmitting data from the site where they are stored to the sites where computations are performed and results are presented; these costs are composed of costs for the communication line, which are usually related to the time the line is open, and costs for the delay in processing caused by transmission; the latter, which is more important for query optimisation, is often assumed to be a linear function of the number of data transmitted, • Secondary storage access cost – the cost of (or time for) loading data pages from secondary storage into main memory; this is influenced by the number of data to be retrieved (mainly by the size of intermediate results), the clustering of data on physical pages, the size of the available buffer space, and the speed of the devices used, • Storage cost – the cost of occupying secondary storage and memory buffers over time; storage costs are relevant only if storage becomes a system bottleneck and if it can be varied from query to query, • Computation cost – the cost for (or time of) using the central processing unit (CPU). Page 46 of 235 Chapter 3 Relational Databases 3.2 Relational Calculus and Relational Algebra The relational calculus (in fact there are two calculi: the tuple relational calculus and the domain relational calculus) is non-procedural and refers to quasi-natural-language expressions used for declarative formulating SQL26 queries and statements (the relational calculus was also used for Datalog [29, 30]); on the other hand, the relational algebra is used for procedural transformations of SQL expressions. Both the calculus and the algebra are the core parts of the relational model. The relational algebra and the relational calculus are logically equivalent. The tuple relational calculus (introduced by Codd [118, 143, 144]) is a notation for defining a result of a query through description of its properties. The representation of a query in relational calculus consists of two parts: a target list and a selection expression [122]. Together with the relational algebra, the calculus constituted the basis for SQL (formerly also for the already forgotten QUEL27). However, the relational model has never been implemented completely complying with its mathematical foundations. The relational model theory defines a domain (or a data type), a tuple (an ordered multiset of attributes) being an ordered pair of a domain and a value and a relation variable (relvar) standing for and ordered pair of a domain and a name (a relation header), and a relation defined as a set of tuples. In relational databases a relation is reflected in a table and a tuple corresponds to a row (a record). The tuple calculus involves atomic values (atoms), operators, formulae and queries. Its complete description is out of the scope of the thesis and can be found in [118]. The domain relational calculus was proposed in [145] as a declarative approach to query languages. It uses the same operators as the tuple calculus (and quantifiers) and it is meant for expressing queries as formulae. Again, its detailed description is omitted as it can be found in [145]. The relational algebra [118, 146] is based on the mathematical logic and set theory and it is equivalent to the domain calculus. The relational algebra was applied as a basis for a set of query languages, e.g., ISBL, Tutorial D, Rel, and SQL. However, in case of SQL, as it does not comply strictly with the theory (tables cannot be regarded as real relations), its affiliation to the relational algebra is not very close, which causes 26 27 Structured Query Language QUEry Language Page 47 of 235 Chapter 3 Relational Databases difficulties in some applications, e.g., in query optimisers. Because a relation is interpreted as the extension of some predicate, each operator of a relational algebra has a counterpart in predicate calculus. There is a set of primitive operations defined in the algebra, i.e. the selection σ, the projection π, the Cartesian product × (the cross product or the cross join), the set union ∪, the set difference –, and the rename ρ (introduced later). All other operators, including the set intersection, the division and the natural join are expressed in terms of the above primitive ones. In terms of the optimisation, the relational algebra can express each query as a tree, where the internal nodes are operators, leaves are relations and subtrees are subexpressions. Such trees are transformed to their semantically equivalent forms, where the average sizes of the relations yielded by subexpressions in the tree are smaller than they were before the optimisation (e.g., avoiding calculating cross products). Further, the algebra optimisation aims to minimise number of evaluations of a single subexpression (as its result can be calculated once and used for evaluating other (sub)expressions). The relational algebra optimisation basic techniques referring to selections, projections, cross products, etc. will be presented in further sections. 3.3 Relational Query Processing and Optimisation Architecture Fig. 4 Query flow through a RDBMS [124] Fig. 4 shows a general architecture of query processing in relational systems [124]. The particular component blocks denote: Page 48 of 235 Chapter 3 Relational Databases • Query parser – no optimisation performed, checks the validity of the query and then translates it into an internal form, usually a relational calculus expression or another equivalent form, • Query optimiser – examines all algebraic expressions that are equivalent to the given query and chooses the one that is estimated to be the cheapest, • Code generator or interpreter – transforms the access plan generated by the optimiser into calls to the query processor, • Query processor – actually executes the optimised query. Queries can be divided to so called ad hoc, interactive, queries issued by an end user or an end user application and stored (embedded) ones “hardcoded” in an application compiled with it (this refers mainly to low-level programming languages, where queries are not stored as common SQL strings). Ad hoc queries go through first three steps shown in Fig. 4 each time they are passed to a RDBMS, embedded queries can go through them only once and then be stored in a database in their optimised form for further use (and called at a runtime), nevertheless a general optimisation process for both kinds of queries can be regarded the same [124]. Fig. 5 presents abstract relational query optimiser architecture. As stated above, query optimisation can be divided into two steps: • Rewriting – based on its calculus and algebra, • Planning – based on physical data storage. Fig. 5 Abstract relational query optimiser architecture [124] A rewriter, the only module responsible for static transformations, relies on query syntax and rewrites it to a semantically equivalent form that should be more efficient than an original one (for sure it would not be worse). The basic rewriter’s tasks (executed to enable further processing and actual evaluation) consist of substituting Page 49 of 235 Chapter 3 Relational Databases view’s calls with their definitions, flattening out nested queries, etc. All the transformations are declarative and deterministic; no physical structure or DBMS specific features are taken into account. A rewritten query is passed for further optimisation to a planner, there is also a possibility that the rewriter produces more than one semantically equivalent query form and sends all of them. The rewriting procedure for relational databases is often considered as an advanced optimisation method and is rarely implemented [124] – this approach is probably caused by SQL irregularities and difficulties in its transformations. The planner is responsible for cost estimation of different query execution plans. The goal is to choose the cheapest one, which is usually impossible (database statistics may be out-of-date, a number of all plans to consider and evaluate can be too large, etc.). Nevertheless, the planner attempts to find the best of the possible plans by applying a search strategy based on examining the space of execution plans. The space is determined by two modules of the cost optimiser: an algebraic space and a methodstructure space. Costs of different plans are evaluated by a cost models and a selfdistributor estimator. The algebraic space module is used for determining orders of actions for a query execution (semantics is preserved but performance may differ). These actions are usually represented relational algebra formulae and/or syntactical trees. Due to an algorithmic nature of objects generated by this module, the overall planning stage operates at a procedural level. On the other hand, the method-structure space module is responsible for determining implementation choices for execution of each action sequence generated previously. Its action is related to available join methods (e.g., nested loops, merge scans, hash joins), existence and cost of auxiliary data structures (e.g., indices that can be persistent or built on-the-fly), elimination of duplicates and other RDBMS-specific features depending on its implementation and storage form. This module produces all corresponding execution plans specifying implementations of each algebraic operator and use of any index possible. The cost model depicted in Fig. 5 specifies arithmetic formulae used for estimating costs of possible execution plans (for each join type, index type, etc.). Algorithms and formulae applied are simplified (in order to limit an optimisation cost itself) and based on certain assumptions concerning buffer management, CPU load, number of IO operations, sequential and random IO operations, etc. Input parameters Page 50 of 235 Chapter 3 Relational Databases for these calculations are mainly a size of a buffer pool for each step (determined by a RDBMS for each query) and sizes of relations with data distribution (provided by a size-distribution estimator). The estimator module specifies how sizes (and possibly frequency distributions of attribute values) of database relations, indices and intermediate results are determined. It also determines if and what statistics are to be maintained in database catalogues. 3.3.1 Space Search Reduction The search space for optimisation depends on the set of algebraic transformations that preserve equivalence and the set of physical operators supported in an optimiser [122]. The following section discusses methods based on the relational algebra used for reducing the search space size. As stated above, a query can be transformed into a syntax tree, where leaves are relations (tables in case of SQL) and nodes are algebraic operators (selections σ, projections π and joins). For multitable queries (i.e. the ones referring to more than one table resulting in joins) a query tree can be created in different ways corresponding to an order of subexpressions’ evaluation. A simple query can result in a few completely different trees, while for more complex ones a number of possible trees can be enormous. Therefore an appropriate search strategy is applied in order to minimise optimisation costs. A search space is usually limited by the following restrictions [124] described shortly in the following subsections. 3.3.1.1 Selections and Projections Selections and projections are processed on-the-fly and almost never generate intermediate relations. Selections are processed as relations are accessed for the first time. Projections are processed as the results of other operators are generated. This is irrelevant for queries without joins, for join queries it implies that all operations are performed as a part of join execution. This restriction eliminates only suboptimal query trees, since separate processing of selections and projections incurs additional costs. Hence, the algebraic space module specifies alternative query trees with join operators only, selections and projections being implicit. A set of alternative joins is determined by their commutativity: R1 join R2 ≡ R2 join R1 and the associativity: Page 51 of 235 Chapter 3 Relational Databases (R1 join R2) join R3 ≡ R1 join (R2 join R3) Due to them, the optimiser can determine which joins are to be executed as inner, and which as outer ones (basing on already calculated results of inner joins). A number of possible join combinations is a factorial of a number of relations (N!), therefore some additional restrictions are usually introduced to minimise the search space, e.g., corresponding to cross products given below. Consider a query aiming to retrieve surnames of employees who earn more than 1200 with names of their departments (the query’s target is the test schema presented in subchapter 6.2 Query Analysis and Optimisation Examples): select employees.surname, departments.name from employees, departments where employees.department_id = departments.id and employees.salary > 1200 The join is established between employees and departments on their department_id and id columns, respectively. The resulting possible syntactical trees (generated by the algebraic space module shown in Fig. 5 above) are presented in Fig. 6. The plan resulting from the tree complies with point 1 – an index scan of employees finds tuples satisfying the selection on employees.salary on-the-fly and the join is performed only on them, moreover, the projection of the result occurs as join tuples are generated. πemployees.surname, departments.name πemployees.surname, departments.name joinemployees.department_id = departments.id σemployees.salary > 1200 departments joinemployees.department_id = departments.id πemployees.surname, employees.department_id πdepartments.name, departments.id 1 employees σemployees.salary > 1200 πemployees.surname, departments.name 3 departments πemployees.surname, employees.salary, employees.department_id σemployees.salary > 1200 2 employees joinemployees.department_id = departments.id employees departments Fig. 6 Possible syntax trees for a sample query for selections and projections Page 52 of 235 Chapter 3 Relational Databases 3.3.1.2 Cross Products Cross products are never formed, unless the query itself asks for them. Relations are combined always through joins in the query, which eliminates suboptimal join trees typically resulting from cross products. Exceptions to this restriction can occur when cross product relations are very small and time for its calculation is negligibly short. For visualisation of the restriction a more complex query should be taken into account – again, it refers to the test schema and aims to retrieve all employees’ surnames with names of their departments and departments’ locations: select employees.surname, departments.name, locations.name from employees, departments, locations where employees.department_id = departments.id and departments.location_id = locations.id The syntax trees for this query are shown in Fig. 7. The tree does satisfy the restriction as it invokes a cross product (an unconditional join). πemployees.surname, departments.name, locations.name 1 joindepartments.location_id = locations.id joinemployees.department_id = departments.id πemployees.surname, departments.name, locations.name locations joinemployees.department_id = departments.id 2 employees departments joindepartments.location_id = locations.id πemployees.surname, departments.name, locations.name locations employees departments joindepartments.location_id = locations.id, employees.department_id = departments.id 3 join employees departments locations Fig. 7 Possible syntax trees for a sample query for cross products 3.3.1.3 Join Tree Shapes This restriction is based on shapes of join trees and it is omitted in some RDBMSs (e.g., Ingres, DB2-Client/Server) – the inner operand of each join is a database relation, Page 53 of 235 Chapter 3 Relational Databases never an intermediate result. Its description is given below basing on an illustrative example. In order to explain this restriction, the test schema is not complex enough, an additional relation (table) should be introduced, e.g., countries related to locations with id and country_id columns, respectively. A sample query stands for all employees’ surnames with their departments’ names, locations’ names (cities) for the departments and locations’ countries: select employees.surname, departments.name, locations.name, countries.name from employees, departments, locations, countries where employees.department_id = departments.id and departments.location_id = locations.id and locations.country_id = countries.id Again, possible syntax trees for the query are presented in Fig. 8 (without cross products). The tree complies with the restriction, trees and do not as they contain at least one join with an intermediate result as the inner relation. Trees satisfying the restriction (e.g., ) are referred to as left-deep, these with an outer relation always being a database relation (e.g., ) are called right-deep, while trees with at least one join between two intermediate results (e.g., ) are called bushy (in general, trees shaped like and are called linear). The restriction is more heuristic than the previous ones and in some cases it might eliminate the optimal plan, but it has been claimed that most often the optimal left-deep tree is not much more expensive than the optimal tree overall since [124]: • Having original database relations as inner ones increases the use of any preexisting indices, • Having intermediate relations as outer ones allows sequences of nested loops joins to be executed in a pipelined fashion. Both index usage and pipelining reduce the cost of join trees. Moreover, the last significantly reduces the number of alternative join trees, to O(2N) for many queries with N relations (from N! previously). Hence, the algebraic space module of the typical query optimiser (provided it is implemented) specifies only join trees that are left-deep. [124] Page 54 of 235 Chapter 3 Relational Databases πemployees.surname, departments.name, locations.name, countries.name 1 joinlocations.country_id = countries.id joindepartments.location_id = locations.id joinemployees.department_id = departments.id employees countries locations departments πemployees.surname, departments.name, locations.name, countries.name joinlocations.country_id = countries.id countries 2 joindepartments.location_id = locations.id locations joinemployees.department_id = departments.id departments πemployees.surname, departments.name, locations.name, countries.name 3 joindepartments.location_id = locations.id joinemployees.department_id = departments.id departments employees employees joinlocations.country_id = countries.id locations countries Fig. 8 Possible syntax trees for a sample query for tree shapes 3.3.2 Planning The planner module (Fig. 5) is responsible for analysing the set of alternative query plans developed by the algebraic space and the method-structure space modules in order to find “the cheapest” one. This decision process is supported by the cost model and the size-distribution estimator. There are different search strategies described in the following subsections – many interesting alternative approaches (e.g., based on genetic programming or artificial intelligence) to the problem are omitted here, however. Their summary with short descriptions and references are available in. [124] 3.3.2.1 Dynamic Programming Algorithms The dynamic programming approach (primarily proposed for System-R [119]) currently is implemented in most commercial RDBMSs. It is based mainly on a dynamic Page 55 of 235 Chapter 3 Relational Databases exhaustive search algorithm constructing a set of possible query trees complying with the restrictions described above (any tree recognised as suboptimal is rejected). [124] The main issue of the algorithm is an interesting order. A merge-scan join method that is very often “suggested” by the method-structure module sorts join attributes prior to executing the join itself. This sorting is performed on two input relations’ join attributes which are then merged with a synchronized scan. However, if any input relation is sorted already (e.g., due to sorting by previous merge-scan joins or some B+-tree index action), a sorting stem can be skipped for this relation. In such a situation, costs of two partial plans cannot be evaluated and compared well if a sort order is not considered. An apparently more expensive partial plan can appear more beneficial if it generates a sorted result that can be used for evaluation of some subsequent merge-scan join execution. Therefore any partial plan that produces a sorted result must be treated specially and its possible influence for a general query plan examined. The dynamic programming algorithm can be described in the following steps [124]: Step 1: For each relation in the query, all possible ways to access it, i.e., via all existing indices and including the simple sequential scan, are obtained (accessing an index takes into account any query selection on the index key attribute). These partial (single-relation) plans are partitioned into equivalence classes based on any interesting order in which they produce their result. An additional equivalence class is formed by the partial plans whose results are in no interesting order. Estimates of the costs of all plans are obtained from the cost model module, and the cheapest plan in each equivalence class is retained for further consideration. However, the cheapest plan of the no-order equivalence class is not retained if it is not cheaper than all other plans. Step 2: For each pair of relations joined in the query, all possible ways to evaluate their join using all relation access plans retained after Step 1 are obtained. Partitioning and pruning of these partial (two-relation) plans proceeds as above. … Step i: For each set of i – 1 relations joined in the query, the cheapest plans to join them for each interesting order are known from the previous step. In this step, Page 56 of 235 Chapter 3 Relational Databases for each such set, all possible ways to join one more relation with it without creating a cross product are evaluated. For each set of i relations, all generated (partial) plans are partitioned and pruned as before. … Step N: All possible plans to answer the query (the unique set of N relations joined in the query) are generated from the plans retained in the previous step. The cheapest plan is the final output of the optimiser, to be used to process the query. This algorithm guarantees finding the cheapest (optimal) plan from all the ones complaint with the search space restrictions presented above. It often avoids enumerating all plans in the space by being able to dynamically prune suboptimal parts of the space as partial plans are generated. In fact, although in general still exponential, there are query forms for which it only generates O(N3) plans [147]. The illustrative example of the dynamic programming algorithm can be found in. [124] The possibilities offered by the method-structure space in addition to those of the algebraic space result in an extraordinary number of alternatives that the optimiser must search through. The memory requirements and running time of dynamic programming grow exponentially with query size (i.e. a number of joins) in the worst case since all viable partial plans generated in each step must be stored to be used in the next one. In fact, many modern systems place a limit on the size of queries that can be submitted (usually about fifteen joins), because for larger queries the optimiser crashes due to its very high memory requirements. Nevertheless, most queries seen in practice involve less than ten joins, and the algorithm has proved to be very effective in such contexts. It is considered the standard in query optimisation search strategies. [124] 3.3.2.2 Randomised Algorithms Dynamic programming algorithms are not capable of analysing relatively large and complex queries (memory consumption issues and required limitations to a number of joins mentioned above). In order to overcome these inconveniences, some randomised algorithms were proposed – some of them are shortly described below. The most important class of these optimisation algorithms is based on plan transformations instead of the plan construction of dynamic programming, and includes algorithms like simulated annealing, iterative improvement and two-phase optimisation. Page 57 of 235 Chapter 3 Relational Databases These algorithms are generic and they can be applied to various optimisation issues. They operate on graphs whose nodes represent alternative execution plans with associated cost in order to find a node with a minimum overall cost. Such algorithms perform random walks thorough a graph – two nodes that can be reached in one move from a node S are the neighbours of S. If a move in a graph traverses from a cheaper node to a more expensive node A move is called uphill (downhill, respectively) if the cost of the source node is lower (higher, respectively) than the cost of the destination node. A node is a global minimum if it has the lowest cost among all nodes. It is a local minimum if, in all paths starting at that node, any downhill move comes after at least one uphill move. [124] The iterative improvement algorithm (II) [148, 149, 150] performs a large number of local optimisations. Each one starts at a random node and repeatedly accepts random downhill moves until it reaches a local minimum. II returns the local minimum with the lowest cost found. simulated annealing (SA) performs a continuous random walk accepting downhill moves always and uphill moves with some probability, trying to avoid being caught in a high cost local minimum [151, 152, 153]. This probability decreases as time progresses and eventually becomes zero, at which point execution stops. Like II, SA returns the node with the lowest cost visited. The two-phase optimisation (2PO) algorithm is a combination of II and SA [153]. In phase 1, II is run for a small period of time, i.e., a few local optimisations are performed. The output of that phase (the best local minimum found) is the initial node of the next phase. In phase 2, SA is run starting from a low probability for uphill moves. Intuitively, the algorithm chooses a local minimum and then searches the area around it, still being able to move in and out of local minima, but practically unable to climb up very high hills. [124] 3.3.3 Size-Distribution Estimator The size-distribution estimator module (Fig. 5) is responsible for estimation of sizes of the (sub)queries and frequency distributions of values in attributes of these results. Although distributions of frequencies can be generalised as combinations of arbitrary numbers of attribute, most RDBMSs deal with frequency distributions of individual attributes only, because considering all possible combinations of attributes is very expensive (the attribute value independence assumption). Page 58 of 235 Chapter 3 Relational Databases Below there is given a description of the most commonly implemented method (e.g., in DB2, Informix, Ingres, Sybase, Microsoft SQL Server) for query result sizes and frequency distributions’ estimation, i.e. histograms, although some other approaches could be also discussed (e.g., [154, 155, 156, 157, 158, 159]). [124] 3.3.3.1 Histograms In a histogram on attribute a of relation R, the domain of a is partitioned into buckets, and a uniform distribution is assumed within each bucket. That is, for any bucket b in the histogram, if a value vi ∈ b, then the frequency fi of vi is approximated by ∑ vi ∈b fi b A histogram with a single bucket generates the same approximate frequency for all attribute values. Such a histogram is called trivial and corresponds to making the uniform distribution assumption over the entire attribute domain. In principle, any arbitrary subset of an attribute's domain may form a bucket and not necessarily consecutive ranges of its natural order. [124] There are various classes of histograms that systems use or researchers have proposed for estimation. Most of the earlier prototypes, and still some of the commercial RDBMSs, use trivial histograms, i.e., make the uniform distribution assumption [119]. That assumption, however, rarely holds in real data and estimates based on it usually have large errors [160, 161]. Excluding trivial ones, the histograms that are typically used belong to the class of equi-width histograms [162]. In those, the number of consecutive attribute values or the size of the range of attribute values associated with each bucket is the same, independent of the frequency of each attribute value in the data. Since these histograms store a lot more information than trivial histograms (they typically have 10-20 buckets), their estimations are much better. Some other histogram classes has been also proposed (e.g., equi-height or equi-depth histograms [162, 163] or multidimensional histograms [164]), however they are not implemented in any RDBMS. [124] It has been proved that serial histograms are the optimal ones [165, 166, 167]. A histogram is serial if frequencies of attribute values associated with each bucket are either all greater or all less than the frequencies of the attribute values associated with any other bucket (buckets of serial histogram group frequencies that are close to each Page 59 of 235 Chapter 3 Relational Databases other with no interleaving). Identifying the optimal histogram among all serial ones takes exponential time in the number of buckets. Moreover, since there is usually no order-correlation between attribute values and their frequencies, storage of serial histograms essentially requires a regular index that will lead to the approximate frequency of every individual attribute value. Because of all these complexities, the class of end-biased histograms has been introduced. In those, some number of the highest frequencies and some number of the lowest frequencies in an attribute are explicitly and accurately maintained in separate individual buckets, and the remaining (middle) frequencies are all approximated together in a single bucket. End-biased histograms are serial since their buckets group frequencies with no interleaving. Identifying the optimal end-biased histogram, however, takes only slightly over linear time in the number of buckets. Moreover, end-biased histograms require little storage, since usually most of the attribute values belong in a single bucket and do not have to be stored explicitly. Finally, in several experiments it has been shown that most often the errors in the estimates based on end-biased histograms are not too far from the corresponding (optimal) errors based on serial histograms. Thus, as a compromise between optimality and practicality, it has been suggested that the optimal end-biased histograms should be used in real systems. [124] 3.4 Relational Query Optimisation Milestones The general (and the most common) relational query optimisation methods and approaches were developed for almost 40 years. Below there are enumerated a few historical “milestone” relational model implementations with emphasise for their query optimisers. 3.4.1 System-R System-R is the first implementation of the relational model (developed between 1972 and 1981), whose solution and experiences are still regarded as fundamentals for many currently used relational databases (including commercial ones), its complete bibliography is available at [121]. In System-R there was introduced a very important method for optimising select-project-join (SPJ) queries (which notion covers also conjunctive queries), i.e. queries where multiple joins and multiple join attributes are involved (non-star join) and Page 60 of 235 Chapter 3 Relational Databases also group by, order by and distinct clauses. The search space for the System-R optimiser in the context of a SPJ query consists of operator trees that correspond to linear sequence of join operations (described in section Join Tree Shapes). Such sequences are logically equivalent because of associative and commutative properties of joins. A join operator can use either the nested loop or sort-merge implementation. Each scan node can use either an index scan (using a clustered or non-clustered index) or a sequential scan. Finally, predicates are evaluated as early as possible. [122] The cost model realised in System-R relied on [122]: • A set of statistics maintained on relations and indexes, e.g., a number of data pages in a relation, a number of pages in an index, number of distinct values in a column, • Formulae to estimate selectivity of predicates and to project the size of the output data stream for every operator node; for example, the size of the output of a join was estimated by taking the product of the sizes of the two relations and then applying the joint selectivity of all applicable predicates, • Formulae to estimate the CPU and I/O costs of query execution for every operator; these formulae took into account the statistical properties of its input data streams, existing access methods over the input data streams, and any available order on the data stream (e.g., if a data stream was ordered, then the cost of a sort-merge join on that stream may be significantly reduced); in addition, it was also checked if the output data stream would have any order. The query planning algorithm for System-R optimiser used the concept of dynamic programming with interesting orders (described in section 3.3.2.1 Dynamic Programming Algorithms). Indeed, the System-R optimiser was novel and innovative; however it did not generalise beyond join ordering. 3.4.2 Starburst Query optimisation in the Starburst project [168] developed at IBM Almaden [169] between 1984 and 1992 (and eventually it became DB2). For optimisation purposes it used a structural representation of the SQL query that was used throughout the lifecycle of optimisation called a Query Graph Model (QGM). In the QGM, a box represented a query block and labelled arcs between boxes represented table references across blocks. Each box contained information on the predicate structure as well as on whether Page 61 of 235 Chapter 3 Relational Databases the data stream was ordered. In the query rewrite phase of optimisation [170], rules were used to transform a QGM into another equivalent QGM. These rules were modelled as pairs of arbitrary functions – the first one checked the condition for applicability and the second one enforced the transformation. A forward chaining rule engine governed the rules. Rules might be grouped in rule classes and it was possible to tune the order of evaluation of rule classes to focus search. Since any application of a rule resulted in a valid QGM, any set of rule applications guaranteed query equivalence (assuming rules themselves were valid). The query rewrite phase did not have the cost information available. This forced this module to either retain alternatives obtained through rule application or to use the rules in a heuristic way (and thus compromise optimality). [122] In the second phase of query optimisation (plan optimisation) an execution plan (operator tree) was chosen for a given QGM. In Starburst, the physical operators (called LOLEPOPs28) were combined in a variety of ways to implement higher level operators. In Starburst, such combinations were expressed in a grammar production-like language [171]. The realisation of a higher-level operation was expressed by its derivation in terms of the physical operators. In computing such derivations, comparable plans that represented the same physical and logical properties but higher costs, were pruned. Each plan had a relational description corresponding to the algebraic expression it represented, an estimated cost, and physical properties (e.g.,, order). These properties were propagated as plans were built bottom-up. Thus, with each physical operator, a function showing the effect of the physical operator on each of the above properties was associated. The join planner in this system was similar to the one of System-R’s (described shortly above). [122] 3.4.3 Volcano/Cascades The Volcano [172] and the Cascades extensible architecture [173] evolved from Exodus [174] (their current “incarnation” is MS SQL Server). In these systems, two kinds of rules were used universally to represent the knowledge of search space: the transformation rules mapped an algebraic expression into another and the implementation rules mapped an algebraic expression into an operator tree. The rules might have conditions for applicability. Logical properties, physical 28 LOw LEvel Plan OPerators Page 62 of 235 Chapter 3 Relational Databases properties and costs were associated with plans. The physical properties and the cost depended on the algorithms used to implement operators and its input data streams. For efficiency, Volcano/Cascades used dynamic programming in a top-down way (memoization). When presented with an optimisation task, it checked whether the task had already been accomplished by looking up its logical and physical properties in the table of plans that had been optimised in the past. Otherwise, it applied a logical transformation rule, an implementation rule, or used an enforcer to modify properties of the data stream. At every stage, it used the promise of an action to determine the next move. The promise parameter was programmable and reflected cost parameters. The Volcano/Cascades framework differed from Starburst in its approach to planning: these systems did not use two distinct optimisation phases because all transformations were algebraic and cost-based and the mapping from algebraic to physical operators occurred in a single step. Further, instead of applying rules in a forward chaining fashion, as in the Starburst query rewrite phase, Volcano/Cascades applied a goal-driven application of rules. [122] Page 63 of 235 Chapter 4 The Stack-Based Approach The Stack-Based Approach (SBA) is a formal methodology for both query and programming languages. Its query language (SBQL, described below) has been designed basing on concepts well known from programming languages. The main SBA idea is that there is no clear and final boundary between programming languages and query languages. Therefore, there should arise a theory consistently describing both aspects. SBA offers a complete conceptual and semantic foundation for querying and programming with queries, including programmes with abstractions (e.g., procedures, functions, classes, types, methods, views). SBA defines language semantics with an abstract implementation method, i.e. an operational specification that requires definitions of all runtime data structures followed by unambiguous definitions of behaviour of any language construct on these structures. SBA introduces such three abstract structures: • An object store, • An environment stack (ENVS), • A result stack (QRES). Any operator used in queries (e.g., a selection, a projection, a join, quantifiers) must be precisely described with these three abstract structures, with no referring to classical notions and theories of relational and object algebras. As stated above, SBA is based on a classic programming language mechanism modified with required extensions. The substantial syntactic decision was made on unification of a query language with a programming language, which results in expressing query languages as a form of programming languages. Hence, SBA does not distinguish between simple expressions (e.g., 2 + 2, (x + y) * x) and complex queries like Employee where salary = 1000 or (Employee where salary = (x + y) * z).surname. Page 64 of 235 Chapter 4 The Stack-Based Approach All these expressions can be used in any imperative constructs, as arguments of procedures, functions or methods, as a function returned value, etc. Any SBA object must be represented by the following features: • A unique internal identifier (OID, an object identifier), • A name used for accessing an object; it does not have to be unique, • It can contain a value (a simple object), other object(s) (a complex object) or a reference to another object (a pointer object). 4.1 SBA Object Store Models SBA is defined for a general object store model. Because various object models introduce a lot of incompatible notions, SBA assumes some families of object store models which are enumerated M0, M1, M2 and M3, which order corresponds to their complexities. The simplest is M0, which covers relational, nested-relational and XMLoriented databases. M0 assumes hierarchical objects with no limitations concerning nesting of objects and collections. M0 covers also binary links (relationships) between objects referred to as pointers. Higher-level store models introduce classes and static inheritance (M1), object roles and dynamic inheritance (M2), and encapsulation (M3). 4.2 SBQL SBA defines the Stack-Based Query Language (SBQL) constituting the same role for SBA as relational algebra for a relational model (nevertheless, SBQL is much more powerful). The language has been precisely designed from a practical point of view and it is completely neutral with respect to data models. It can be successfully applied to relational and object databases, XML, RDF, and others, since its basic assumption is that it operates on data structures, not on models. Therefore, once a reflection of appropriate structures on SBA abstract structures is defined, a precise definition of a language operating on them appears. This behaviour changes completely an approach to a definition of a language to a new data form – instead of defining a new language one has to state how these data are to be processed with SBQL. SBQL semantics is based on a name binding space paradigm. Any name in a query is bound with a runtime being (a persistent object, a procedure, a procedure parameter, etc.), according to a current name space. Another popular solution from programming languages employed for SBQL is that a naming scope is defined by an Page 65 of 235 Chapter 4 The Stack-Based Approach environment stack searched from top to bottom. Due to differences between a programming environment and databases, a former concept has been extended – a stack does not contain data itself but their pointers, moreover many object can be bound to a single name (in this way an unified actions on collections and single objects is achieved). The detailed assumptions are as follows: • Each name is bound to a runtime being (an object, an attribute, a procedure, a view, etc.), • All data are processed in the same manner, regardless if they are persistent or transient, • Procedures' results belong to the same category as query results, which implies that they can be arbitrarily combined and processed, • All semantic and syntactic constructs are the same for either object or their subobjects (complete object relativism). 4.2.1 SBQL Semantics The language is defined by the composition rule: 1. Let Q be a set of all queries. 2. If q ∈ Q (q is a query) and σ is an unary operator, then σ(q) ∈ Q. 3. If q1 ∈ Q and q2 ∈ Q (q1 and q2 are queries) and µ is a binary operator, then µ(q1, q2) ∈ Q. Due to these assumption, SBQL queries can be easily decomposed into subqueries down to atomic ones (names, operators, literals). A result returned by a query is calculated by eval(q), where q ∈ Q. SBQL semantics is defined with the following notions: • Binders, • Binding names on ENVS, • Keeping temporary results on QRES, • eval function for algebraic and nonalgebraic operators. 4.2.1.1 Name binding A binder is a construct allowing keeping object's name with its identifier. For an arbitrary object named n with an identifier i, a binder is defined as n(i). A binder Page 66 of 235 Chapter 4 The Stack-Based Approach concept can be generalised so that n(x) means that x can be an identifier, but also a literal or a complex structure, including a procedure pointer. The ENVS stack (separate from an object store) contains sections consisting of binders' collections. On startup, ENVS contains a single section with query root objects' binders. Any object name n appearing in a query is bound on ENVS. Since the stack is searched in a topdown direction (i.e. stating from the newest sections), its sections are searched for n(ij) binders – a search is stopped on the first match (if no match is found, a next lower section is searched). As a binding result (provided binders hold only object identifiers) can be a single object identifier, a collection of identifiers or an empty collection. 4.2.1.2 Operators and eval Function The eval function pushes its result on QRES. For an arbitrary query q, the eval definition is: • If q is a string denoting a simple value (literal), eval returns this string value, • If q is an object name, ENVS is top-down searched for binders q(ij); all objects bound by q name are returned, • If q is a name of a procedure, this procedure is called, i.e. a new section with procedure parameters, a local environment and a return point is pushed on ENVS; next, a procedure body is executed (it may result with pushing some result on QRES); when finished, a section is popped from ENVS, • If q is in a form of q1 ∆ q2 (∆ is an algebraic operator), q1 is evaluated and its result is pushed onto QRES, the q2 is evaluated and its result is pushed onto QRES; then ∆ is evaluated on two elements popped from QRES – its result is finally pushed on QRES, • If q is in a form of q1 Θ q2 (Θ is a nonalgebraic operator), q2 is evaluated multiple times for each result returned by q1 in context of this result; a final result is pushed onto QRES; the procedure can be described as follows: For each ij returned by eval(q1) do: ■ Push on ENVS a representation of ij object's contents (subobjects' binders), ■ Evaluate q2 and store its partial result, ■ Pop from ENVS a representation of ij object's contents, Push all partial results onto QRES. Page 67 of 235 Chapter 4 The Stack-Based Approach The main differences between nonalgebraic and algebraic operators are their influence on ENVS and evaluating subqueries in context (or without) of other subqueries. The common algebraic operators are: • Arithmetic operators (+, -, *, /, etc.), • Logical operators (and, or, not, etc.), • Alias operator (as), • Grouping operator (groupas). Nonalgebraic operators are represented by: • Navigational dot (.), • Selection (where), • Join (join), • Quantifiers (all, exists). 4.2.2 Sample Queries The following queries (with explanation of their semantics) refer to the relational wrapper test schema described in subchapter 6.2.1. Each query is supplied with its syntax tree (without typechecking, for simplification). They are valid wrapper queries, i.e. they are expressed with wrapper views' names: Example 1: Retrieve surnames and names of employees earning more than 1200 (Employee where salary > 1200).(surname, name); Fig. 9 Sample SBQL query syntax tree for example 1 Page 68 of 235 Chapter 4 The Stack-Based Approach Example 2: Retrieve first names of employees named Kowalski earning less than 2000 (Employee where surname = "Kowalski" and salary < 2000).name; Fig. 10 Sample SBQL query syntax tree for example 2 Page 69 of 235 Chapter 4 The Stack-Based Approach Example 3: Retrieve surnames of employees and names of departments of employees named Nowak ((Employee where surname = "Nowak") as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); Fig. 11 Sample SBQL query syntax tree for example 3 Page 70 of 235 Chapter 4 The Stack-Based Approach Example 4: Retrieve surnames of employees and cities their departments are located in (Employee as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); Fig. 12 Sample SBQL query syntax tree for example 4 Page 71 of 235 Chapter 4 The Stack-Based Approach Example 5: Retrieve surnames and birth dates of employees named Kowalski working in the production department (Employee where surname = "Kowalski" and worksIn.Department.name = "Production").(surname, birthDate); Fig. 13 Sample SBQL query syntax tree for example 5 Page 72 of 235 Chapter 4 The Stack-Based Approach Example 6: Retrieve surnames and birth dates of employees named Kowalski working in Łódź city (Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").(surname, birthDate); Fig. 14 Sample SBQL query syntax tree for example 6 Page 73 of 235 Chapter 4 The Stack-Based Approach Example 7: Retrieve the sum of salaries of employees named Kowalski working in Łódź city sum((Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").salary); Fig. 15 Sample SBQL query syntax tree for example 7 Page 74 of 235 Chapter 4 The Stack-Based Approach 4.3 Updateable Object-Oriented Views In databases, a view means an arbitrarily defined image of data stored – in terms of distributed application (e.g., Web applications) views can be used for resolving incompabilities between heterogeneous data sources enabling their integration [175, 176, 216], which corresponds to mediation described in subchapter 2.2.1 Wrappers and Mediators. Database views are distinguished as materialised (representing copies of selected data) and virtual ones (standing only for definitions of data that can be accessed by calling such a view). A typical view definition is a procedure that can be invoked from within a query. One of the most important features of database views is their transparency which means that a user issuing a query must not distinguish between a view and actual stored data (he or she must not be aware of using views), therefore a data model and a syntax of a query language for views must conform with the ones for physical data. Views should be characterized by the following features [177, 214]: • Customisation, conceptualisation, encapsulation – a user (programmer) receives only data that are relevant to his/her interests and in a form that is suitable for his/her activity; this facilitates users’ productivity and supports software quality through decreasing probability of errors; views present the external layer in the three-layered architecture (commonly referred to as the ANSI29/SPARC30 architecture [178]), • Security, privacy, autonomy – views give the possibility to restrict user access to relevant parts of database, • Interoperability, heterogeneity, schema integration, legacy applications – views enable the integration of distributed/heterogeneous databases, allowing understanding and processing alien, legacy or remote databases according to a common, unified schema, • Data independence, schema evolution, views enable the users to change physical and logical database organisation and schema without affecting already written applications. The idea of updateable object views [214] relies in augmenting the definition of a view with the information on users’ intents with respect to updating operations. Only 29 30 American National Standards Institute Standards Planning And Requirements Committee Page 75 of 235 Chapter 4 The Stack-Based Approach the view definer is able to express the semantics of view updating. To achieve it, a view definition is divided in two parts. The first part is the functional procedure, which maps stored objects into virtual objects (similarly to SQL). The second part contains redefinitions of generic operations on virtual objects. These procedures express the users’ intents with respect to update, delete, insert and retrieve operations performed on virtual objects. A view definition usually contains definitions of subviews, which are defined on the same rule, according to the relativism principle. Because a view definition is a regular complex object, it may also contain other elements, such as procedures, functions, state objects, etc. The above assumptions and SBA semantics allow achieving the following properties [210]: • Full transparency of views – after defining the view user use the virtual objects in the same way as stored object, • Views are automatically recursive and (as procedures) can have parameters. The first part of a view definition has the form of a functional procedure named virtual objects. It returns entities called seeds that unambiguously identify virtual objects (usually seeds are OIDs of stored objects). Seeds are then (implicitly) passed as parameters of procedures that overload operations on virtual objects. These operations are determined in the other part of the view definition. There are distinguished four generic operations that can be performed on virtual objects: • delete – removes the given virtual object, • retrieve (dereference) – returns the value of the given virtual object, • insert – puts an object being a parameter inside the given virtual object, • update – modifies the value of the given virtual object according to a parameter (a new value). Definitions of these overloading operations are procedures that are performed on stored objects. In this way the view definer can take full control on all operations that should happen on stored objects in response to update of the corresponding virtual object. If some overloading procedure is not defined, the corresponding operation on virtual objects is forbidden. The procedures have fixed names, respectively on_delete, on_retrieve, on_new, and on_update. All procedures, including the function supplying seeds of virtual objects, are defined in SBQL and may be arbitrarily complex. Page 76 of 235 Chapter 4 The Stack-Based Approach 4.4 SBQL Query Optimisation Issues of query optimisation for SBQL were deeply discussed and analysed in [179] and over 20 research papers (e.g., [180, 181, 182, 183, 184, 185, 186]). A summary of the most important and currently implemented in the virtual repository (ODRA) SBA optimisation techniques (with a short description of general object-oriented query optimisation given as a background) is presented below. The principal object-oriented query optimisation goals do not differ from the ones for relational queries. The process aims to produce a semantically equivalent form of an input query but promising better (for sure not worse) evaluation efficiency. However the techniques for reaching it are much more diverse than former ones, which corresponds directly to variety of object-oriented models and approaches. Since objectoriented query languages were developed in the world strongly imbued with relational paradigms, similar solutions were produced also for them, including various relational algebras (e.g., [187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200]) – their description is out of the scope of the thesis, as SBQL does not rely on any algebra. Regardless of an algebra used, any object-oriented optimisation, similarly to the relational one, aims to: • Avoid evaluating Cartesian products, which however is less important than is relational systems (in object-oriented databases relational joins are replaced with object links or pointers), • Evaluate selections and projections as early as possible. Moreover, some query transformations can be performed, e.g. concerning order of operations (expensive operations should be performed after cheaper ones, to let them operate on smaller collections) and identifying constant and common expressions (they can be calculated only once and their result used many times, also in other parallely evaluated queries). Finally, some low-level techniques (referring to physical data organisation, e.g. data files, accessing through indices, etc.) are also utilised in objectoriented database systems. A general SBQL query processing flow diagram and basic system modules are shown in Fig. 16. The SBQL optimiser, in contrast to the relational one, relies mainly on static query transformations, mainly various rewriting methods that are easy to implement (due to the SBQL regularity), very efficient and reliable. These static Page 77 of 235 Chapter 4 The Stack-Based Approach transformations are performed directly on a query syntax tree, without application of any object-oriented algebra as an intermediate query representation. The static query analysis is begun with two basic steps: • Type-checking (each name and each operator are checked according to a database schema in order to determine if a query is syntactically correct), • Generation of a query syntax tree on which further rewriting (actual optimisation) is performed. Fig. 16 Architecture of query processing in SBQL [179] The static optimisation is performed with auxiliary compiletime data structures corresponding to runtime structures: • Static QRES (corresponding to a runtime QRES) used for modelling query results’ accumulation, • Static ENVS (corresponding to a runtime ENVS) used for modelling opening sections and binding operations, • Metabase (corresponding to an object store, but kept unchanged during optimisation), standing for a description (model) of a database schema; this model contains information on object structures and types, procedure names with a return type and argument types, auxiliary structures (e.g., indices), possibly database statistics, etc. Static stacks are only approximations of runtime ones, hence their content and structure are different. The main concept used by the static query analysis is a signature corresponding to runtime objects and ENVS’s sections – a signature stands for a definition of a runtime being it substitutes. All the static operations are performed with signatures only, e.g. queries (and subqueries) are assumed to return only single signatures reflecting their result types (signatures can be also complex, e.g. in case of structures, but they do not reflect quantities of actually returned query result’s objects). A signature for a (sub)query is determined by its semantics, i.e.: Page 78 of 235 Chapter 4 The Stack-Based Approach • If a query is a literal (integer, real, string, boolean, date), a signature corresponds to this literal’s primitive type (a value signature: integer signature, real signature, string signature, boolean signature, date signature, respectively), • If a query is a name a signature is created by taking a value from a static ENVS (static binding procedure of name) – a reference signature, • If a query is q as name, a binder signature is created (a name for an actual signature of q, corresponding to a named result), • If a query is q1 ∆ q2 (∆ is an algebraic operator), a final signature is a composition of subquery signatures according to the ∆ type inference rules (e.g. some coercion), • If a query is q1 Θ q2 (Θ is a nonalgebraic operator), a final signature is a composition of subquery signatures build according to the Θ semantics, e.g. a structure signature for the , (comma) operator. The static stack’s operations are executed by the static nested and static eval procedures, corresponding to the runtime ones, respectively. Some of the methods presented below are adapted from the relational optimisers (e.g., pushing selections), some are designed from scratch for SBQL purposes and can be applied only due to its regularity and specific features. 4.4.1 Independent Subqueries In SBQL a subquery is called independent if it can be evaluated outside a loop (implied by nonalgebraic operators) it is called from within (which corresponds to the relational optimisation principle of evaluating selections as early as possible). The method for determining if a subquery is independent consists of analysing on which static ENVS’s sections names in the subquery are bound (in SBQL, any nonalgebraic operator opens its own scope on ENVS and each name is bound in some stack’s section). In none of the names is bound in a scope opened by the nonalgebraic operator currently being evaluated, then the subquery is independent and it can be evaluated earlier than it is implied by the original containing query. In order to perform this analysis, an algebraic operator is assigned a number of a section it opens, and each name it the query is assigned two numbers: one for the ENVS size (number of existing sections) when the name binding occurs, the other one for a number of the section where the name is bound. Page 79 of 235 Chapter 4 The Stack-Based Approach The query syntax tree is then modified so that the subquery can be evaluated as early as possible. The independent subqueries method is divided into two variants: pushing out and factoring out, described below. 4.4.1.1 Factoring out The factoring out technique is inspired by the relational method for optimising nested subqueries, but here it is much more general. The general rule for factoring out is formulated as follows. Given a (sub)query of a form q1 Θ q2 where Θ is a nonalgebraic operator and q2 is expressed as: α1 q3 α2 where q3 is a syntactically correct subquery connected to the rest of q2 by arbitrary operators. Then the query is: q1 Θ (α1 q3 α2) if q3 is independent of the operator Θ and other nonalgebraic operators (if any present) whose scopes are on ENVS above the scope opened by Θ, it q3 be can be factored out and the general query will be (an auxiliary name x is introduced): (q3 as x).(q1 Θ (α1 x α2)) This holds if q3 returns a single result, however if its result is more numerous (a general case), the groupas operator should be applied: (q3 groupas x).(q1 Θ (α1 x α2)) There are still some limitations when an independent subquery cannot be factored out. The main exception occurs if the independent subquery consists of a single name expression. The first reason (that could be neglected, as shown) is that a number of evaluations of such an factored out expression is the same as without factoring out and some additional operations for binding an auxiliary name are invoked. However, since costs of these are negligibly small, this is not considered an issue (similarly, a factored out expression might be evaluated multiple times if each operator it is independent of performs one operation in the loop – again, this optimisation does not improve the query evaluation, but it is not considerably deteriorated, still). This is not true is a factored out name is a procedure call – in this case it could be evaluated once, indeed, which should improve general query performance. Page 80 of 235 Chapter 4 The Stack-Based Approach 4.4.1.2 Pushing out The pushing out technique is based on a distributivity property of some operators. A nonalgebraic operator Θ is distributive if for a query: (q1 union q2) Θ q3 is equivalent to: (q1 Θ q3) union (q2 Θ q3) In SBQL where, . (dot) and join operators are distributive. For them, the following associativity rules can be derived: (q1.q2).q3 ⇔ q1.(q2.q3) (q. join q2) join q3 ⇔ q1 join (q2 join q3) (q. join q2).q3 ⇔ q1 join (q2.q3) (q. join q2) where q3 ⇔ q1 join (q2 where q3) The distributivity property allows a new optimisation method – pushing a selection before a join (also known in relational systems). If a whole predicate is not independent of where, but has a form: p1 and p2 and … pk-1 and … pk and … pk+1 and … pn-1 and pn and some of pk subpredicate is independent, the predicate can be transformed into: p1 and p2 and … pk-1 and pk+1 and … pn-1 and pn provided pk is pushed so that its selection is evaluated only once. Pushing out selections is similar to factoring out – it can be applied to a set of nonalgebraic operators the predicate is independent of and the predicate can be arbitrarily complex (including procedures’ invocations). However, there is a substantial difference between these techniques – a factored out expression is evaluated after some nonalgebraic operator Θ it is dependent on pushes its sections (opens its scope) on the ENVS, while in case of pushing out an independent predicate is evaluated before such an operator opens its scope (a pushed out predicate is pushed before the operator it is dependent on). Pushing out can be applied if all nonalgebraic operators a predicate is independent of are distributive, and only if it is connected to the rest of its container predicate with and operator (otherwise factoring out could be applied). Page 81 of 235 Chapter 4 The Stack-Based Approach 4.4.2 Rewriting Views and Query Modification SBQL updateable views are saved as stored functional database procedures. Their execution (on invocation from a query) can be performed with a common macrosubstitution technique; however SBQL uses also a much better solution – query modification. A basic view execution means that when a view name is encountered in a query, its body is executed and the result is just pushed onto QRES (possibly for further processing as a sub-result of the query). However if a view body is macro-substituted for its name in a query, new optimisation possibilities open as a nested query from the view becomes a regular subquery pending all SBQL rules and optimisation techniques. The names in the resulting subquery are bound as other names in the query and therefore the whole query can be rewritten regardless if some name is entered in its textual form or macro-substituted for a view name (including nested view invocations). Views that can be macro-substituted must be characterized by the following features: • They do not create their own local environments (i.e. they operate only on global data), • They are dynamically bound, • The only expression in the body is a single query. The query modification is slightly more constrained for parameterized views (i.e. views with arguments). A basic method for passing arguments (parameters) to the procedure (here: a view) is call-by-value (or call-by-reference). When using a macrosubstitution, call-by-name should be used instead. This means that when macrosubstituting a view body, its formal parameters must be replaced with the actual parameters. And a resulting query can be optimised in a regular way. The limitation for using this technique is that a view cannot have side effects, precisely: its execution cannot affect in any way its actual parameters. The view modification method can be applied to any stored procedure, not only a view, provided it obeys the limitations listed above. Page 82 of 235 Chapter 4 The Stack-Based Approach 4.4.3 Removing Dead Subqueries A dead subquery is a part of a query whose execution does not influence the final result and, of course, it is not used as an intermediate result for evaluating a (sub)query it is contained within. Therefore evaluation of such a dead subquery only consumes time and resources and should be avoided, which is performed by the method described. Dead subqueries occur in user-typed queries and they are very often introduced as a result of macro-substitution of views. The procedure for identification of a dead subquery is based on excluding subqueries that directly or indirectly contribute to the final result of the query. Queries that can contain dead parts are these with navigation operators and/or quantifiers, as these operators (projection and quantifiers) can use only a part of their left operand (a subquery), ignoring the rest of it which becomes dead. Other nonalgebraic operators in SBQL (e.g., a navigational join or selection) do not have this property as they always consume whole operand queries’ results. Where might use only a part of its left subquery, however it always determines the result type and therefore none part of the left operand can be considered dead. However, some dead subqueries cannot be removed. This rare situation occurs if removing a dead part affects quantity of returned objects (a size of the result) – such a subquery is called partially dead. Another interesting case is that removing one dead subquery could make another subquery that, therefore the removing process should be repeated until all dead parts are removed. 4.4.4 Removing Auxiliary Names Auxiliary names (aliases) can be introduced to a query by a programmer in order to make it clearer or to access precisely its particular parts (subquery results) in other parts. They are also introduced automatically from views’ macro-substitution (as they exist in views’ definitions). An auxiliary name (resulting in a binder) is processed by projection onto an alias, which operations in many cases come out unnecessary (a projected result can be accessed directly) and a new opportunity for optimisation appears. The operation of removing unnecessary auxiliary names is not straightforward, as it can change a query result. Page 83 of 235 Chapter 4 The Stack-Based Approach An auxiliary name n can be removed only if: • A result of evaluating n is consumed by a dot operator, • It is used only for navigation, i.e. a direct nonalgebraic operator using n is dot and a result of evaluating n is not used for any other nonalgebraic operator. Moreover, an auxiliary name cannot be removed if this operation would make names in the “uncovered” subquery (previously “hidden” under the alias) bind in other sections than in the original query with the auxiliary name. 4.4.5 Low Level Techniques A set of low level query optimisation techniques exists in object-oriented and relational query optimisers. In general, they rely mainly on some physical features of a database system (e.g. data storage and organisation). These techniques can be applied on both data access and query transformation (e.g., introducing index invocations) stages. 4.4.5.1 Indexing Indices are auxiliary (redundant) database structures stored at a server side accelerating access to particular data according to given criteria. In general, an index is a twocolumn table where the first column consists of unique key values and the other one holds non-key values which in most cases are object references. Key values are used as an input for index search procedures. As a result, such a procedure returns suitable nonkey values from the same table row. Keys are usually collections of distinct values of specific attributes (they can be also some complex structures or results of some queries or procedures) of database objects (dense indices) or represent ranges of these values (range indices). The current ODRA implementation supports indices based on linear hashing [201] structure which can be easily extended to its distributed version SDDS31 [202] in order to optimally utilise a data grid computational resources. Additionally to key types mentioned earlier (dense and range) enumerated type was introduced to improve multiple key indexing. ODRA supports local indexing which ensures an index transparency by providing mechanism (optimisation framework) to automatically utilise 31 Scalable Distributed Data Structure Page 84 of 235 Chapter 4 The Stack-Based Approach index before query evaluation and therefore to take advantage of indices (distributed indexing is under development). The index optimisation procedure (substituting index call for a particular subquery) is performed during static query rewriting. Such an index invocation is a regular procedure (function) with parameters (corresponding to index key values) that could be used in a regular query (because of index transparency features, index functions’ names are rejected at query typechecking). The function accesses index data according to given or calculated key value(s) and pushes found object identifiers onto QRES for further regular processing. An index call is substituted as a left where operand so that other predicates (if exist) are evaluated on considerably smaller collection (sometimes orders of magnitude) returned from fast index execution. A cost model is used in order to determine which index is the most efficient (its selectivity is the issue) for a given query and what predicate can be replaced with an index function call. Page 85 of 235 Chapter 5 Object-Relational Integration Methodology 5.1 General Architecture and Assumptions Fig. 3 (page 43) presents the architecture of the eGov-Bus virtual repository and the place taken by the described wrapper. Fig. 17 presented below shows a much more general view of a virtual repository with its basic functional elements. The regarded resources can be any data and service sources, however here only relational databases are shown for simplification. Global client 1 Global client 2 Global infrastructures (security, authentication, transaction, indices, workflow, web services) Data and services global virtual store (virtual data integration and presentation) Object-oriented model and query language Schema mapping, query recognition and rewriting Communication and transport, result reconstruction Wrapper 1 Wrapper 2 Communication and transport, reading schema RDBMS 1 RDBMS 2 Relational model and query language Fig. 17 Virtual repository general architecture The virtual repository provides global clients with required functionalities, e.g. the trust and security infrastructure, communication mechanisms, etc. It is also responsible for integration and management of virtual data and presentation of the global schema. A set of wrappers (marked with red) interfaces between the virtual Page 86 of 235 Chapter 5 Object-Relational Integration Methodology repository and the resources. The basic wrapper functionalities (shown in blue boxes) refer to: • Providing communication and data transportation means (both between the wrapper and the virtual repository and between the wrapper and the resource), • Enabling, preferably automated, relational schema reading, • Mapping a relational schema to the object-oriented model, • Analysing object-oriented queries so that appropriate results were retrieved from relational resources. Another view on the virtual repository with its wrappers is presented in Fig. 18, where schema integration stages are shown (colours correspond to the mediation architecture shown in Fig. 1, page 25). Global client 1 Global client 2 Object-oriented business model Global schema Global infrastructures Global virtual store Integration schema Global views Contributory view 1 Contributory schema Wrapper 1 Contributory view 2 Contributory schema Administrator/ designer Object-oriented relational model representation (M0) Wrapper 2 Relational model RDBMS 1 Local schema 1 RDBMS 2 Local schema 2 Fig. 18 Schema integration in the virtual repository Contributory views present relational schemata as simple M0 object-oriented models (subchapter 4.1 SBA Object Store Models). These views are parts of the global schema provided by the system administrator/designed, i.e. they must obey names and object structures used in the integration schema. The integration schema is responsible for combining local schemata (according to known fragmentation rules and ontologies) into the global schema presented to global users and the only one available in the top of the virtual repository. In the simplest case, the integration schema can be used as the global schema, it can be also further modified to match the virtual repository requirements (in the most general case there can be applied separate global schemata according to clients’ requirements and access rights). The integration schema and its Page 87 of 235 Chapter 5 Object-Relational Integration Methodology mapping onto the global schema are expressed with the global views in the virtual repository. In this architecture wrappers are responsible for reading local schemata of wrapper resources and presenting them as object-oriented schemata (simple M0 models). These object-oriented schemata are further enveloped by appropriate contributory views. 5.2 Query Processing and Optimisation Besides the schema mapping and translation (so that relational resources become available in the virtual repository), the wrapper is responsible for the query processing. The schematic query processing diagram is presented in Fig. 19 (schema models shown in light green correspond to Fig. 18 above). Global query (SBQL) Object-oriented business model Parser + type checker Front-end SBQL syntax tree External wrapper (updateable views and query modification) Object-oriented relational model representation (M0) Rewriting optimiser Back-end SBQL syntax tree Internal wrapper (converting SBQL subqueries into equivalent SQL) SBQL interpreter dynamic SQL (ODBC, JDBC, ADO,...) Relational schema information Relational model RDBMS Fig. 19 Query processing schema The global ad-hoc query (referring to the global schema) issued by one of the global clients is regularly parsed and type-checked, which results in the front-end syntax tree (still expressed in terms of the global schema). This syntax tree is macro-substituted with the views’ definitions (corresponding to the global, integration and contributory schemata shown in Fig. 18) and can be submitted to query modification procedures32. The back-end SBQL syntax tree is extremely large due to the macro-substitution applied and it refers to the primary objects, i.e. the ones exposed directly by the 32 The views must obey some simple rules for the query modification to be applied (subchapter 4.4.2), which must be ensured by the system administrator Page 88 of 235 Chapter 5 Object-Relational Integration Methodology wrapper. Here the SBQL rewriting optimisers (regular SBQL optimisation, subchapter 4.4) can be applied together with the internal wrapper rewriter. The wrapper query rewriter analyses the syntax tree to find expressions corresponding to “relational” names, i.e. names corresponding to relational tables and columns – the analysis procedure is based on the relational schema information available to the wrapper’s back-end, the metabase and expressions’ signatures provided by the type checker (details in Chapter 6 Query Analysis, Optimisation and Processing). If such names are found, their SBQL subqueries are substituted with corresponding dynamic SQL expressions (execute immediately) to be evaluated in the wrapped resource. According to the naive approach, each relational name should be substituted with simple SQL: select * from R, where all records are retrieved and processed in the virtual repository for the desired result. This approach, however always correct and reliable, is extremely inefficient and it introduces undesired data transportation and materialisation. Therefore, the strong need for optimisation appears so that much evaluation load was pushed down to the wrapped relational database, where powerful query optimisers can work (details discussed in the following subsection). The general wrapper query optimisation and evaluation procedure consists of the following steps: 1. Query modification is applied to all view invocations in a query, which are macrosubstituted with seed definitions of the views. If an invocation is preceded by the dereference operator, instead of the seed definition, the corresponding on_retrieve function is used (analogically, on_navigate for virtual pointers). The effect is a monster huge SBQL query referring to the M0 version of the relational model available at the back-end. 2. The query is rewritten according to static optimisation methods defined for SBQL such as removing dead sub-queries, factoring out independent sub-queries, pushing expensive operators (e.g. joins) down in the syntax tree, removing unused auxiliary names, etc. The resulting query is SBQL-optimised, but still no SQL optimisation is applied. 3. According to the available information about the relational schema, the back-end wrapper's mechanisms analyse the SBQL query in order to recognise patterns Page 89 of 235 Chapter 5 Object-Relational Integration Methodology representing SQL-optimiseable queries. Then, execute immediately clauses are issued and executed by the resource driver interface (e.g. JDBC). 4. The results returned by execute immediately are pushed onto the SBQL result stack as collections of structures, which are then used for regular SBQL query evaluation. 5.2.1 Naive Approach vs. Optimisation As stated above, the naive implementation can be always applied, but it gives no chance for relational optimisers to work and all processing is executed by the virtual repository mechanism. SBQL expressions (subqueries) resulting from step 2 in the above procedure should be further analysed by the internal wrapper in order to find possibly largest subqueries (patterns) transformable to (preferably) optimiseable33 SQL. The wrapper optimisation goals are as follows: • Reduce amount of data retrieved and materialised (in the most favourable case only final results can be retrieved). • Minimise processing by the virtual repository (in the most favourable case retrieved results match exactly the original query intention). The wrapper optimisation is much more challengeable than the simple rewriting applied in the naive approach presented in the previous chapter. The first issue is that many SBQL operators and expressions do not have relational counterparts (e.g. as, groupas), some others’ semantics differ (e.g. the assignment operator in SBQL is not macroscopic). Hence, the set of SQL-transformable SBQL operators has been first isolated. According to the optimisation goals, the wrapper optimiser should attempt to find the largest SBQL subquery that can be expressed in equivalent SQL. Therefore the following search order has been assumed: 33 • Aggregate functions, • Joins, • Selections, • Names (corresponding to relational tables). Although SQL optimisers (subchapter 3.3 Relational Query Processing and Optimisation Architecture) are transparent, one can assume when they actually act, e.g. when evaluating joins, selections over indexed columns, etc. Page 90 of 235 Chapter 5 Object-Relational Integration Methodology This order is caused by possible complex expression forms and arguments (internal expressions), i.e. aggregate functions can be evaluated over joins or selections and joins can contain additional selection conditions. In case of joins, selections and table names, also another optimisation is applied since projections can be found and only desired relational columns retrieved. The actual wrapper query processing algorithms are discussed in Chapter 6, followed by comprehensive examples, while results of application of these algorithms are given in Chapter 7. 5.3 A Conceptual Example This subsection presents an abstract conceptual (implementation independent) example of the presented relational schema wrapping and query processing approach. First, consider a simple two-table relational schema (Fig. 20). This medical database contains information on patients (the patientR table) and doctors (doctorR); “R” stands for “relational” and it is introduced just to increase the example clearness. Each patient is treated by some doctor, which is realised with the primary-foreign key relationship on doctorR.id and patientR.doctor_id columns. Besides the primary keys there are nonunique (secondary) indices on patientR.surname and doctorR.surname columns. doctorR id (PK) name surname salary ... specialty patientR id (PK) name surname doctor_id (FK) Fig. 20 Relational schema for the conceptual example This schema is imported by the wrapper and primary objects corresponding to relational tables and columns are created, i.e. a relational table is a complex object whose subobjects correspond to the table’s columns (with their primitive data types); the relational names are preserved for these objects (one-to-one mapping). This simple object-oriented schema is ready for querying, but still relational constraints are not reflected. Therefore it is enveloped with object-oriented updateable views. The resulting object-oriented appears (Fig. 21) – the primary-foreign key relationship is realised with the isTreatedBy virtual pointer. Page 91 of 235 Chapter 5 Object-Relational Integration Methodology Doctor isTreatedBy ◄ id name surname salary specialty Patient id name surname Fig. 21 Object-oriented view-based schema for the conceptual example The simplified code of the enveloping views with no updates defined (just retrieval and navigation where necessary) is presented in Listing 1. Listing 1 Simplified updateable views for the conceptual example view DoctorDef { virtual objects Doctor: record {d: doctorR;}[0..*] { return (doctorR) as d; } /* on_retrieve skipped for Doctor */ } view idDef { virtual objects id: record {_id: doctorR.id;} { return d.id as _id; } on_retrieve: integer { return deref(_id); } } view nameDef { virtual objects name: record {_name: doctorR.name;} { return d.name as _name; } on_retrieve: string { return deref(_name); } } view surnameDef { virtual objects surname: record {_surname: doctorR.surname;} { return d.surname as _surname; } on_retrieve: string { return deref(_surname); } } view salaryDef { virtual objects salary: record {_salary: doctorR.salary;} { return d.salary as _salary; } on_retrieve: real { return deref(_salary); } } view specialtyDef { virtual objects specialty: record {_specialty: doctorR.specialty;} { return d.specialty as _specialty; } on_retrieve: string { return deref(_specialty); } } } Page 92 of 235 Chapter 5 Object-Relational Integration Methodology view PatientDef { virtual objects Patient: record {p: patientR;}[0..*] { return (patientR) as p; } /* on_retrieve skipped for Patient */ } view idDef { virtual objects id: record {_id: patientR.id;} { return p.id as _id; } on_retrieve: integer { return deref(_id); } } view nameDef { virtual objects name: record {_name: patientR.name;} { return p.name as _name; } on_retrieve: string { return deref(_name); } } view surnameDef { virtual objects surname: record {_surname: patientR.surname;} { return p.surname as _surname; } on_retrieve: string { return deref(_surname); } } view isTreatedByDef { virtual objects isTreatedBy: record {_isTreatedBy: patientR.doctor_id;} { return p.doctor_id as _isTreatedBy; } on_retrieve: integer { return deref(_isTreatedBy); } on_navigate: Doctor { return Doctor where id = _isTreatedBy; } } } The next paragraphs discuss the query processing steps with the corresponding textual forms and visualised syntax trees. Page 93 of 235 Chapter 5 Object-Relational Integration Methodology Consider an object-oriented query referring this object-oriented view-based schema aiming to retrieve surnames of doctors treating patients named Smith, whose salary is equal to the minimum salary of cardiologists (Fig. 22): ((Patient where surname = "Smith").isTreatedBy.Doctor as doc where doc.salary = min((Doctor where specialty = "cardiology").salary)).doc.surname; Fig. 22 Conceptual example Input query syntax tree The first step in the query processing is introducing implicit deref calls where necessary (Fig. 23): (((((((Patient where (deref(surname) = "Smith")) . isTreatedBy) . Doctor)) as doc where (deref((doc . salary)) = min(deref(((Doctor where (deref(specialty) = "cardiology")) . salary))))) . doc) . surname); Page 94 of 235 Chapter 5 Object-Relational Integration Methodology Fig. 23 Conceptual example query syntax tree with dereferences Then, macro-substitute deref calls with on_retrieve and on_navigate definitions for virtual objects and virtual pointers, respectively. Substitute all view invocations with the queries from sack definitions (the syntax tree illustration is skipped due to its large size). The step allows the query modification procedures (subsection 4.4.2 Rewriting Views and Query Modification) applied in the next steps due to the simple single-command views’ bodies: ((((((((patientR) as p where ((((p . surname)) as _surname . deref(_surname)) = "Smith")) . ((p . doctor_id)) as _isTreatedBy) . ((doctorR) as d where ((((d . id)) as _id . deref(_id)) = deref(_isTreatedBy))))) as doc where (((doc . ((d . salary)) as _salary) . deref(_salary)) = min(((((doctorR) as d where ((((d . specialty)) as _specialty . deref(_specialty)) = "cardiology")) . ((d . salary)) as _salary) . deref(_salary))))) . doc) . ((d . surname)) as _surname); Now, remove auxiliary names where possible (i.e.: p, d, _surname, _isTreatedBy, _id, _salary. _specialty, _salary) – the views’ definitions macro-substituted in the previous step are now regular parts of the original query and syntactical transformations can be applied (Fig. 24): Page 95 of 235 Chapter 5 Object-Relational Integration Methodology ((((((doctorR where (deref(id) = ((patientR where (deref(surname) = "Smith")) . deref(doctor_id))))) as doc where ((doc . deref(salary)) = min(((doctorR where (deref(specialty) = "cardiology")) . deref(salary))))) . doc) . surname)) as _surname; Fig. 24 Conceptual example query syntax tree after removing auxiliary names Apply SBQL optimisation methods – here two independent subqueries can be found (for evaluating the minimum cardiologists’ salary and patients names Smith). They are pulled in front of the query and their results given auxiliary names aux0 and aux1 (Fig. 25): (((((min(((doctorR where (deref(specialty) = "cardiology")) . deref(salary)))) as aux0 . ((((((patientR where (deref(surname) = "Smith")) . deref(doctor_id))) as aux1 . (doctorR where (deref(id) = aux1)))) as doc where ((doc . deref(salary)) = aux0))) . doc) . surname)) as _surname; Page 96 of 235 Chapter 5 Object-Relational Integration Methodology Fig. 25 Conceptual example query syntax tree after SBQL optimisation Page 97 of 235 Chapter 5 Object-Relational Integration Methodology In the following query form the wrapper analysis and wrapper optimisation are performed. Basing on the relational model information available the following SQL queries invoked with execute immediately can be created (Fig. 26): exec_immediately("select min(salary) from doctorR where specialty = 'cardiology'") as aux0 . exec_immediately("select doctor_id from patientR where surname = 'Smith'") as aux1 . exec_immediately("select surname from doctorR where salary = '" + aux0 + "' and id = '" + aux1 + "'") as _surname; Fig. 26 Conceptual example query syntax tree after wrapper optimisation Page 98 of 235 Chapter 5 Object-Relational Integration Methodology Any of the SQL queries will be executed in the relational resource with application of indices, where available. The minimum processing is required by the virtual repository – the partial results returned from the first two execute immediately calls are pushed onto stacks as they parameterise the last SQL query. The last execute immediately call retrieves the final result matching the original query semantics. Please notice, that the above transformations are valid only there is exactly one employee named Smith; nevertheless the idea holds in more general cases, as proved by more complex examples. Page 99 of 235 Chapter 6 Query Analysis, Optimisation and Processing The algorithms presented in the following chapter refer to step 3 in the query optimisation and evaluation procedure described above (subchapter 5.2 Query Processing and Optimisation, page 89). They assume that the input SBQL query in view-rewritten (macro-substitution and query modification applied) and type-checked. 6.1 Proposed Algorithms The algorithm shown in Fig. 27 is applied for the query analysis. First, the query tree is checked if relational names exist. If not, the analysis stops – the query does not refer to the wrapper and it should not be modified. Otherwise, the tree is checked if it is an update on a delete query. If so, the check if the wrapper optimisation can be applied is performed. Currently only range expressions are detected at this stage, since they do not have counterparts in SQL and pointing to some particular object for updating/deleting cannot be translated. If such an expression is found as the operation target object, an error is returned and the algorithm stops. In all other cases, the delete/update tree is rewritten (the detailed procedure presented in subsections 6.1.2 and 6.1.3, respectively) and returned. If query is selection (neither update nor delete), possibly SQL-transformable patterns are searched (aggregate functions, joins, selections, table names). For each target pattern all matches are found starting from the tree root so that the most external ones are first processed (i.e. the ones corresponding to the largest subqueries). Each match is checked if it can be optimised (i.e. if all subexpressions of the corresponding Page 100 of 235 Chapter 6 Query Analysis and Processing tree branch are recognised and transformable). If so, the transformation is applied (details in subsection 6.1.1). Otherwise, the tree branch corresponding to the current match is returned unoptimised for further analysis and processing. START Get tree NO NO Is delete? Aktualizacja? Is update? YES YES Relational names exist? NO Return original tree YES Make tree copy Can method be applied? Target : aggregate functions, joins, selections, names NO Return error YES Rewrite branch (tree) Next target exists? NO Return optimised tree YES Next target STOP Find all target matches NO Next match exists? Rewrite branch YES Next match YES Can method be applied? NO Leave branch unchanged Fig. 27 Query analysis algorithm At the beginning of the query analysis and transformation the tree copy is made. The copy is returned instead of an optimised tree if some unexpected error occurs (e.g. an unsupported expression signature) when some branches are already modified. This ensures that query evaluation is always correct, however executed without the wrapper optimisation (the naive approach is applied in such case). 6.1.1 Selecting Queries The following algorithm is applied to each wrapper-optimiseable subquery (tree branch) found according to the analysis procedure described above (Fig. 27). Page 101 of 235 Chapter 6 Query Analysis and Processing START Get tree Remeber result signature Aggregate function? YES Remember function type NO Can table names be recognized? NO Return error YES Find table names Can selection conditions be found? NO Return error YES Find selection conditions Find projections (colum names) Build SQL query Build result pattern (from signature) Build SBQL expression Reconstruct typological information Return SBQL expression STOP Fig. 28 Selecting query processing algorithm Selecting queries are optimised according to the algorithm shown in Fig. 28. The first step is remembering the input (sub)query signature for further use. The query is checked if it is an aggregate function (if so, the function type is remembered). The next steps are finding table names invoked (if the names cannot be established, e.g. in case of unsupported signatures, an error is returned and the procedure stops) and selection conditions (if the conditions cannot be established, e.g. in case of unsupported signatures, an error is returned and the procedure stops). Next, projections (column names) are found. The projected names are not required for the most queries (although projections can seriously limit amounts of data retrieved and materialised), they can be required by some aggregate functions (e.g. min, max, avg). The final steps consist of building a SQL query string (subsection 6.1.4) basing on the analysis performed and Page 102 of 235 Chapter 6 Query Analysis and Processing enveloping the query string in the appropriate SBQL expression for regular stack-based evaluation (subsection 6.1.5). The expression is given back the signature of the original input query, it is also provided with the description of the result to be reconstructed. The description is established basing on the original signature and it enables pushing onto stack the result matching the original query semantics. If this procedure returns an error, the overall query is processed in the unoptimised form. 6.1.2 Deleting Queries The following algorithm is applied only if the query is recognised as deleting in the analysis process illustrated in Fig. 27. Processing deleting queries is similar to selecting ones, since delete operators in SBQL and SQL are similar (in both languages they are macroscopic), the algorithm is presented in Fig. 29. The input query tree is checked for the table name, from which the deletion should be performed and selection conditions to point records to be deleted, again an error is returned if any of these fails. There is no need to remember the input query signature, since in SBQL the delete operator does not return a value (an empty result). However, since the common practice in relational databases is to return the number of records affected by the deletion, an integer signature (and the corresponding result pattern) is assigned to the final SBQL expression. This mismatch between regular SBQL deleting and wrapper deleting does not affect the query semantics as it does enforce any further modifications by the programmer. But the additional information might be useful in some cases. If this procedure returns an error, it is propagated and returned instead of the optimised query, since there is no other way of expressing deletes, as it was possible for selects. Page 103 of 235 Chapter 6 Query Analysis and Processing START Get tree Can table name be recognized? NO Return error YES Find table name Can selection conditions be found? NO Return error YES Find selection conditions Build SQL query Build result pattern (integer value) Build SBQL expression Reconstruct typological information Return SBQL expression STOP Fig. 29 Deleting query processing algorithm 6.1.3 Updating Queries The following algorithm is applied only if the query is recognised as updating in the analysis process illustrated in Fig. 27. Selecting and deleting queries are processed similarly; however updates in SBQL substantially differ from SQL. The semantics of the assignment operator (:=) in SBQL requires a single object as the left-hand side expression, in SQL the update operation in macroscopic. Further, the SBQL update returns a reference to the affected object, which cannot be realised directly in SQL. Due to these differences, the algorithm shown in Fig. 30 has been designed. First, the procedure analyses the left-hand side expression of the assignment operator to find the table and its column to be updated and again selection conditions are detected Page 104 of 235 Chapter 6 Query Analysis and Processing (in case of an unrecognised expression, an error is returned and the procedure ends). Then the right-hand side expression is isolated for being used as the update value. START Get tree Can table name be recognized? NO Return error YES Find table name (LHS expression) Can selection conditions be found? NO Return error YES Find selection conditions (LHS expression) Find projection (LHS expression) Isolate RHS expression Build SBQL count check expr. Build SQL update query Build result pattern (integer value) Build SBQL expression Reconstruct typological information Build IfThen expression Return IfThen expression STOP Fig. 30 Updating query processing algorithm According to the selection conditions found, the aggregate count function is realised in SQL in order to check the number of rows to be affected (in order to comply with the SBQL assignment semantics). The other SQL expression is the actual update constructed according to the analysis results. Both SQL query strings are embedded in Page 105 of 235 Chapter 6 Query Analysis and Processing the SBQL IfThen expression evaluating first the count check, then performing the SQL update (provided the count check result is 1). No operation is performed otherwise. The original reference signature is not available from the SQL evaluation and it is replaced with the integer value signature – on the query evaluation the number of affected rows is returned (0 or 1 values are possible only). The problem of changing the original typological information (an integer value instead of a reference) unfortunately might cause some problems (detected already in the query typechecking stage, not the runtime). Some approaches for overcoming this issue were considered like performing additional selection after the actual update in order to return the reference. Such behaviour would be much useful, however, since the reference to the same object (realising a wrapped relational column) is dynamically assigned on each selection and retrieval and it does not provide any constant mapping between the object-oriented and relational stores. Similarly to deleting queries, if this procedure returns an error, it is propagated and returned instead of the optimised query, since there is no other way of expressing updates, as it was possible for selects. 6.1.4 SQL Query String Generation Any of the presented algorithms (selecting, deleting or updating queries) requires building a SQL query string based on conditions returned from the appropriate analysis process. The actual SQL generation algorithm is irrelevant (in the prototype implementation zql [203] is employed), nevertheless some issues should be pointed here. Problems arise if the analysed SBQL query involves relational names (referring to tables and columns), literal values (strings, integers, etc.) and pure object-oriented expressions (mixed queries), e.g. names not referring to the wrapper schema, procedure calls, SBQL-specific expressions (e.g. now(), random()). Such expressions cannot be just substituted to SQL query strings as first they must be stack-based evaluated. Hence, the proposed algorithm is presented in Fig. 31. This procedure should be performed over the actual query analysis and its results should be restored in final SQL query strings but for simplification it is skipped in diagrams presented in Fig. 27, Fig. 28, Fig. 29 and Fig. 30. Page 106 of 235 Chapter 6 Query Analysis and Processing START Get tree Find „forbidden” expressions Extract and replace „forbidden ” expressions with unique dummy expressions (e.g. strings) Store externally oroginal „forbidden ” expressions with their dummy expressions Restore correct signatures to dummy expressions (e.g. string values) Process tree with dummy expressions with appropriate algorithm Search generated SQL strings for dummy expressions Split SQL strings for dummy expressions Concatenate SQL substrings with original expressions with SBQL operators (+) Replace original SQL strings with concatenated ones Return modified tree STOP Fig. 31 SQL generation and processing algorithm for mixed queries The input SBQL syntax tree is searched for “forbidden” expressions, i.e. the ones that cannot be mapped onto the relational schema or embedded in SQL strings. These expressions are extracted from the tree and their branches replaced with unique dummy expressions (e.g. string expressions) with appropriate signatures. The original expressions are stored with the corresponding dummy ones for the future reconstruction. This modified tree is submitted to analysis and transformations described in the previous subsections. The resulting tree is searched for SQL query strings. Each SQL string is searched for the dummy expressions and split into two substrings over the match. The substrings are then concatenated with the corresponding original expression related to the dummy one (regular SBQL + operator used). The tree branch corresponding to the SQL query is again type-checked to introduce typological information and modifications required for the stack-based evaluation. The final SQL Page 107 of 235 Chapter 6 Query Analysis and Processing query appears on runtime when all restored SBQL subexpressions are evaluated on stacks and the final string can be concatenated (also in a stack-based manner). 6.1.5 Stack-Based Query Evaluation and Result Reconstruction As stated above, SQL query strings executed with dynamic SQL must be enveloped with appropriate SBQL expressions providing typological information so that the stackbased evaluation can be performed. These SBQL expressions are referred to execsql and they have to be provided with additional information, i.e. how to process returned relational results so that the original SBQL query semantics is unchanged. This information should be passed somehow to the runtime environment, where expression signatures available during the analysis process do not exists anymore. Another piece information to be supplied for the wrapper query evaluation is some unique identifier of the particular wrapper – as shown in Fig. 17, the virtual repository integrates many resources. During evaluating SBQL query34, when execsql expression is encountered, its arguments (the SQL query, the result pattern, the wrapper identifier) are pushed onto stacks. If the SQL query is to be evaluated separately, regular expression evaluation procedures are first applied. When the string is ready, it is send to the appropriate wrapper pointed by the identifier. Returned relational results are processed according to the result pattern and pushed onto the query result stack. In the execsql expression is a subexpression of a larger query the results can be further processed. If not, they are returned directly to the client. In the prototype implementation (Appendix C), the wrapper identifier denotes the corresponding database module name, while the result patterns are expressed as simple string expressions. 6.2 Query Analysis and Optimisation Examples The examples of query processing performed by the wrapper presented below are based on two relational schemata: “employees” and “cars” described in the following subsection. 34 The actual evaluation is performed on the compiled query byte code, which step is skipped in the procedure description for simplification Page 108 of 235 Chapter 6 Query Analysis and Processing 6.2.1 Relational Test Schemata The first schema (Fig. 32) presents some company employees data. The main table (employees) contains personal data and is related by one-to-many relation to the departments table, which in turn is related to the locations table. The other schema (Fig. 33) presents data on cars. The main table (cars) contains car data and is related by one-to-many relation to the models table, which in turn is related to the makes table. By assumption, the schemata can be maintained on separate machines and they are only logically related by employees.id and cars.owner_id columns (an employee can own a car), therefore more general and real-life wrapper actions can be simulated and analysed. employees id departments (PK) id (PK) name name surname location_id locations id (PK) name (FK) sex salary info birth_date department_id (FK) Fig. 32 The "employees" test relational schema cars id models (PK) owner_id model_id id makes (PK) name (FK) make_id id (PK) name (FK) year colour Fig. 33 The "cars" test relational schema Besides the primary keys (with appropriate unique indices), secondary (nonunique) indices are maintained at employees.surname, employees.salary, employees.sex departments.name, locations.name columns. The schemata are automatically wrapped to simple internal object-oriented schemata where a table corresponds to a complex object, while a column corresponds to its simple subobject with a corresponding primitive data type. The applied naming convention uses original table and column names prefixed with the $ character (e.g. “employees” results in “$employees”). The complete generation procedure is described Page 109 of 235 Chapter 6 Query Analysis and Processing in Appendix C (subchapter Relational Schema Wrapping) and any intermediate implementation-dependent steps are skipped in this chapter. These simple object-oriented models are then enveloped with administratordesigned views (regarded as contributory schema views) realising virtual pointers responsible for primary-foreign key pairs, the views can also control data access (e.g. in the example they disallow updating primary keys and virtual pointers – the commented code blocks). The code of the views used in the example is presented in Listing 2. The views transform the relational schemata into object-oriented models shown in Fig. 34 (for simplification, the fields corresponding to virtual pointers are not shown, although on_retrieve procedures are defined). worksIn *► isOwnedBy Employee *► id name surname sex salary info birthDate Car id colour year Department isLocatedIn *► id name isModel *► Model Location id name isMake *► id name Make id name Fig. 34 The resulting object-oriented schema Listing 2 Code of views for the test schemata view EmployeeDef { virtual objects Employee: record { e: employees; }[0..*] { return (employees) as e; } on_retrieve: record { id: integer; name: string; surname: string; sex: string; salary: real; info: string; birthDate: date; worksIn: integer; } { return ( deref(e.id) as id, deref(e.name) as name, deref(e.surname) as surname, deref(e.sex) as sex, deref(e.salary) as salary, deref(e.info) as info, deref(e.birth_date) as birthDate, deref(e.department_id) as worksIn ); } on_delete { delete e; } view idDef { virtual objects id: record { _id: employees.id; } { return e.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ Page 110 of 235 Chapter 6 Query Analysis and Processing } view nameDef { virtual objects name: record { _name: employees.name; } { return e.name as _name; } on_retrieve: string { return deref(_name); } on_update(newName: string) { _name := newName; } } view surnameDef { virtual objects surname: record { _surname: employees.surname; } { return e.surname as _surname; } on_retrieve: string { return deref(_surname); } on_update(newSurname: string) { _surname := newSurname; } } view sexDef { virtual objects sex: record { _sex: employees.sex; } { return e.sex as _sex; } on_retrieve: string { return deref(_sex); } on_update(newSex: string) { _sex := newSex; } } view salaryDef { virtual objects salary: record { _salary: employees.salary; } { return e.salary as _salary; } on_retrieve: real { return deref(_salary); } on_update(newSalary: real) { _salary := newSalary; } } view infoDef { virtual objects info: record { _info: employees.info; } { return e.info as _info; } on_retrieve: string { return deref(_info); } on_update(newInfo: string) { _info := newInfo; } } view birthDateDef { virtual objects birthDate: record { _birthDate: employees.birth_date; } { return e.birth_date as _birthDate; } on_retrieve: date { return deref(_birthDate); } on_update(newBirthDate: date) { _birthDate := newBirthDate; } } view worksInDef { Page 111 of 235 Chapter 6 Query Analysis and Processing virtual objects worksIn: record { _worksIn: employees.department_id; } { return e.department_id as _worksIn; } on_retrieve: integer { return deref(_worksIn); } /* do not update the virtual pointer */ on_navigate: Department { return Department where id = _worksIn; } } } view DepartmentDef { virtual objects Department: record { d: departments; }[0..*] { return (departments) as d; } on_retrieve: record { id: integer; name: string; isLocatedIn: integer; } { return ( deref(d.id) as id, deref(d.name) as name, deref(d.location_id) as isLocatedIn ); } on_delete { delete d; } view idDef { virtual objects id: record { _id: departments.id; } { return d.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ } view nameDef { virtual objects name: record { _name: departments.name; } { return d.name as _name; } on_retrieve: string { return deref(_name); } on_update(newName: string) { _name := newName; } } view isLocatedInDef { virtual objects isLocatedIn: record { _isLocatedIn: departments.location_id; } { return d.location_id as _isLocatedIn; } on_retrieve: integer { return deref(_isLocatedIn); } /* do not update the virtual pointer */ on_navigate: Location { return Location where id = _isLocatedIn; } } } view LocationDef { virtual objects Location: record { l: locations; }[0..*] return (locations) as l; } on_retrieve: record { id: integer; name: string; } { return ( deref(l.id) as id, deref(l.name) as name ); } on_delete { delete l; Page 112 of 235 { Chapter 6 Query Analysis and Processing } view idDef { virtual objects id: record { _id: locations.id; } { return l.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ } view nameDef { virtual objects name: record { _name: locations.name; } { return l.name as _name; } on_retrieve: string { return deref(_name); } on_update(newName: string) { _name := newName; } } } view CarDef { virtual objects Car: record { c: cars; }[0..*] { return (cars) as c; } on_retrieve: record { id: integer; isOwnedBy: integer; isModel: integer; colour: string; year: integer; } { return ( deref(c.id) as id, deref(c.owner_id) as isOwnedBy, deref(c.model_id) as isModel, deref(c.colour) as colour, deref(c.year) as year ); } on_delete { delete c; } view idDef { virtual objects id: record { _id: cars.id; } { return c.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ } view isOwnedByDef { virtual objects isOwnedBy: record { _isOwnedBy: cars.owner_id; } { return c.owner_id as _isOwnedBy; } on_retrieve: integer { return deref(_isOwnedBy); } /* do not update the virtual pointer */ on_navigate: Employee { return Employee where id = _isOwnedBy; } } view isModelDef { virtual objects isModel: record { _isModel: cars.model_id; } { return c.model_id as _isModel; } on_retrieve: integer { return deref(_isModel); } /* do not update the virtual pointer */ on_navigate: Model { return Model where id = _isModel; } Page 113 of 235 Chapter 6 Query Analysis and Processing } view colourDef { virtual objects colour: record { _colour: cars.colour; } { return c.colour as _colour; } on_retrieve: string { return deref(_colour); } on_update(newColour: string) { _colour := newColour; } } view yearDef { virtual objects year: record { _year: cars.year; } { return c.year as _year; } on_retrieve: integer { return deref(_year); } on_update(newYear: integer) { _year := newYear; } } } view ModelDef { virtual objects Model: record { m: models; }[0..*] { return (models) as m; } on_retrieve: record { id: integer; isMake: integer; name: string; return ( deref(m.id) as id, deref(m.make_id) as isMake, deref(m.name) as name ); } on_delete { delete m; } view idDef { virtual objects id: record { _id: models.id; } { return m.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ } view isMakeDef { virtual objects isMake: record { _isMake: models.make_id; } { return m.make_id as _isMake; } on_retrieve: integer { return deref(_isMake); } /* do not update the virtual pointer */ on_navigate: Make { return Make where id = _isMake; } } view nameDef { virtual objects name: record { _name: models.name; } { return m.name as _name; } on_retrieve: string { return deref(_name); } on_update(newName: string) { _name := newName; } } Page 114 of 235 } { Chapter 6 Query Analysis and Processing } view MakeDef { virtual objects Make: record { m: makes; }[0..*] { return (makes) as m; } on_retrieve: record { id: integer; name: string; } { return ( deref(m.id) as id, deref(m.name) as name ); } on_delete { delete m; } view idDef { virtual objects id: record { _id: makes.id; } { return m.id as _id; } on_retrieve: integer { return deref(_id); } /* do not update the primary key */ } view nameDef { virtual objects name: record { _name: makes.name; } { return m.name as _name; } on_retrieve: string { return deref(_name); } on_update(newName: string) { m.name := newName; } } } The following examples present subsequent query forms corresponding to the substantial processing steps performed by the prototype implemented: • Raw – the syntactically correct ad-hoc query from a client, • Typechecked – the query after the typological control (dereferences and type casts introduced where necessary) submitted for the view macro-substitution and the query rewriting steps, • View-rewritten – the query after the view macro-substitution and the query modification, ready for the wrapper analysis and optimisation; this query form refers directly to the lowest level relational names recognisable by the wrapper (relational names prefixed with “$”); this form of the query is re-typechecked, • Optimised – the wrapper optimised query form, where the possibly best optimisation was performed basing on the relational model available, • Simply-rewritten – the query corresponding to the naive wrapper action. The simply-rewritten query forms are not shown for imperative queries as inapplicable ones. Similarly, for the multi-wrapper queries, completely optimised queries are not available due to the current virtual repository limitations (the Page 115 of 235 Chapter 6 Query Analysis and Processing justification for this wrapper behaviour is given in subsection 6.2.4 prior to the corresponding examples). The visualised syntax trees are provided for simple queries; they are skipped for more complex queries due to their exceeding sizes, however. Page 116 of 235 Chapter 6 Query Analysis and Processing 6.2.2 Selecting Queries Example 1: Retrieve surnames and names of employees earning more than 1200 Raw: (Employee where salary > 1200).(surname, name); Fig. 35 Raw (parsed) query syntax tree for example 1 Typechecked: ((Employee where (deref(salary) > (real)(1200))) . (surname , name)) Fig. 36 Typechecked query syntax tree for example 1 View-rewritten: (((($employees) as _$employees) as e where (((e . (_$employees . $salary)) . deref(_VALUE)) > (real)(1200))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $name))) as _name)) Page 117 of 235 Chapter 6 Query Analysis and Processing Fig. 37 View-rewritten query syntax tree for example 1 Optimised: execsql("select employees.surname, employees.name from employees where (employees.salary > 1200)", "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $employees | $name | _name | none | binder 1> 0>", "admin.wrapper1") Fig. 38 Optimised query syntax tree for example 1 Simply-rewritten: ((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where (((e . (_$employees . $salary)) . deref(_VALUE)) > (real)(1200))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $name))) as _name)) Page 118 of 235 Chapter 6 Query Analysis and Processing Basing on the query form provided by the view-rewriter performing macro-substituting views’ definitions and query modification steps (Fig. 37), the wrapper optimiser searches the syntax tree to find patterns (expressions) transformable into SQL-optimiseable subqueries (finally enveloped with execsql expressions). The largest expression in the query is where (marked with the red ellipse in Fig. 37) with the corresponding syntax tree branch. After analysing this expression for selection conditions, projected columns are established basing on the right-hand side of the root dot expression signature (the comma expression marked with the blue ellipse in Fig. 37 involving $surname and $name corresponding to relational columns). The optimised expression contains a single execsql expression where only surnames and names of employees (projection) earning more than 1200 (selection) are retrieved, which exactly matches the initial query intention. No processing is required from the virtual repository; the result patterns included in the execsql allow construction of a valid SBQL result compliant with the primary query signature. In the unoptimised form (Fig. 39) the SQL query retrieves all records and the actual processing (selection and projection) is completely performed by the virtual repository. The wrapper-rewriting simply performs replacement of the $employees name expression recognised as a relational table name (the corresponding black ellipses in Fig. 37 and Fig. 39). Page 119 of 235 Chapter 6 Query Analysis and Processing Fig. 39 Simply-rewritten query syntax tree for example 1 Page 120 of 235 Chapter 6 Query Analysis and Processing Example 2: Retrieve first names of employees named Kowalski earning less than 2000 Raw: (Employee where surname = "Kowalski" and salary < 2000).name; Fig. 40 Raw (parsed) query syntax tree for example 2 Typechecked: ((Employee where ((deref(surname) = "Kowalski") and (deref(salary) < (real)(2000)))) . name) Fig. 41 Typechecked query syntax tree for example 2 View-rewritten: (((($employees) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and (((e . (_$employees . $salary)) . deref(_VALUE)) < (real)(2000)))) . ((e . (_$employees . $name))) as _name) Page 121 of 235 Chapter 6 Query Analysis and Processing Fig. 42 View-rewritten query syntax tree for example 2 Optimised: execsql("select employees.name from employees where ((employees.surname = 'Kowalski') AND (employees.salary < 2000))", "<0 $employees | $name | _name | none | binder 0>", "admin.wrapper1") Fig. 43 Optimised query syntax tree for example 2 Simply-rewritten: ((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and (((e . (_$employees . $salary)) . deref(_VALUE)) < (real)(2000)))) . ((e . (_$employees . $name))) as _name) Page 122 of 235 Chapter 6 Query Analysis and Processing Similarly to the previous example, the largest optimiseable subquery found is where expression (the red ellipse in Fig. 42). This expression is analysed for selection conditions, here the complex and condition mappable directly to the SQL operator is recognised. The projection searched by navigating up to the tree root (the dot expression) reveals only a single column denoted by $name (contained within the unary as expression, the blue ellipse). Again, the resulting SQL string carries all query evaluation conditions. The unoptimised query form (Fig. 44) is developed exactly as in the previous example by replacing the $employees name expression recognised as a relational table name (the corresponding black ellipses in Fig. 42 and Fig. 44). Page 123 of 235 Chapter 6 Query Analysis and Processing Fig. 44 Simply-rewritten query syntax tree for example 2 Page 124 of 235 Chapter 6 Query Analysis and Processing Example 3: Retrieve surnames of employees and names of departments of employees named Nowak Raw: Typechecked: View-rewritten: Optimised: (Employee as e join e.worksIn.Department as d).(e.surname, d.name); (((Employee) as e join (((e . worksIn) . Department)) as d) . ((e . surname) , (d . name))) ((((($employees) as _$employees) as e) as e join (((e . ((e . (_$employees . $department_id))) as _worksIn) . ((($departments) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE)))))) as d) . ((e . ((e . (_$employees . $surname))) as _surname) , (d . ((d . (_$departments . $name))) as _name))) execsql("select employees.surname, departments.name from employees, departments where (departments.id = employees.department_id)", "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $departments | $name | _name | none | binder 1> 0>", "admin.wrapper1") Simply-rewritten: (((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e) as e join (((e . ((e . (_$employees . $department_id))) as _worksIn) . (((execsql("select departments.name, departments.location_id, departments.id from departments", "<0 $departments | | | none | ref 0>", "admin.wrapper1")) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE)))))) as d) . ((e . ((e . (_$employees . $surname))) as _surname) , (d . ((d . (_$departments . $name))) as _name))) The raw query does not introduce explicit selection conditions; they appear after the macro-substituting view definitions and the query modification from virtual pointers’ on_navigate procedures. The largest expressions recognised by the wrapper analyser is the join (relational names $employees and $departments as arguments). Join conditions are found (corresponding to a primary-foreign key relation in the relational schema), and finally projections established. The resulting SQL performs the join and retrieves only requested column values, no processing is required from the virtual repository. In case of the unoptimised query, two SQL selects are executed retrieving all records from employees and departments tables. The expensive join has to be evaluated by the virtual repository. Page 125 of 235 Chapter 6 Query Analysis and Processing Example 4: Retrieve surnames of employees and cities their departments are located in Raw: Typechecked: (Employee as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); ((((Employee) as e join (((e . worksIn) . Department)) as d) join (((d . isLocatedIn) . Location)) as l) . ((e . surname) , (l . name))) View-rewritten: (((((($employees) as _$employees) as e) as e join (((e . ((e . (_$employees . $department_id))) as _worksIn) . ((($departments) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE)))))) as d) join (((d . ((d . (_$departments . $location_id))) as _isLocatedIn) . ((($locations) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE)))))) as l) . ((e . ((e . (_$employees . $surname))) as _surname) , (l . ((l . (_$locations . $name))) as _name))) Optimised: execsql("select employees.surname, locations.name from employees, locations, departments where ((departments.id = employees.department_id) AND (locations.id = departments.location_id))", "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $locations | $name | _name | none | binder 1> 0>", "admin.wrapper1") Simply-rewritten: ((((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e) as e join (((e . ((e . (_$employees . $department_id))) as _worksIn) . (((execsql("select departments.name, departments.location_id, departments.id from departments", "<0 $departments | | | none | ref 0>", "admin.wrapper1")) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE)))))) as d) join (((d . ((d . (_$departments . $location_id))) as _isLocatedIn) . (((execsql("select locations.id, locations.name from locations", "<0 $locations | | | none | ref 0>", "admin.wrapper1")) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE)))))) as l) . ((e . ((e . (_$employees . $surname))) as _surname) , (l . ((l . (_$locations . $name))) as _name))) Similarly to the previous example, join conditions appear after the view’s definitions are macro-substituted – the analysis procedure is the same, although three relational tables joined over primary-foreign key pairs are recognised. The optimised query executes only one SQL query evaluating the join and retrieving only required columns. The unoptimised one retrieves all records from these three tables (three separate SQL queries) and join operations are evaluated and the projection performed by the virtual repository. Page 126 of 235 Chapter 6 Query Analysis and Processing Example 5: Retrieve surnames and birth dates of employees named Kowalski working in the production department Raw: Typechecked: (Employee where surname = "Kowalski" and worksIn.Department.name = "Production").(surname, birthDate); ((Employee where ((deref(surname) = "Kowalski") and (deref(((worksIn . Department) . name)) = "Production"))) . (surname , birthDate)) View-rewritten: (((($employees) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((e . (_$employees . $department_id))) as _worksIn . ((($departments) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . (d . (_$departments . $name))) . deref(_VALUE)) = "Production"))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $birth_date))) as _birthDate)) Optimised: execsql("select employees.surname, employees.birth_date from employees, departments where ((employees.surname = 'Kowalski') AND ((departments.name = 'Production') AND (departments.id = employees.department_id)))", "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $employees | $birth_date | _birthDate | none | binder 1> 0>", "admin.wrapper1") Simply-rewritten: ((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((e . (_$employees . $department_id))) as _worksIn . (((execsql("select departments.name, departments.location_id, departments.id from departments", "<0 $departments | | | none | ref 0>", "admin.wrapper1")) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . (d . (_$departments . $name))) . deref(_VALUE)) = "Production"))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $birth_date))) as _birthDate)) The raw query contains two explicit selection conditions; the next one corresponding to the primary-foreign key relationship arises from the workIn virtual pointer on_navigate procedure macro-substituted. The relational tables are detected and selection conditions with projected columns established. The optimised query executes a single SQL query exactly matching the original intention, while the unoptimised one requires again performing evaluation by the virtual repository mechanisms. Page 127 of 235 Chapter 6 Query Analysis and Processing Example 6: Retrieve surnames and birth dates of employees named Kowalski working in Łódź city Raw: Typechecked: (Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").(surname, birthDate); ((Employee where ((deref(surname) = "Kowalski") and (deref(((((worksIn . Department) . isLocatedIn) . Location) . name)) = "Łódź"))) . (surname , birthDate)) View-rewritten: (((($employees) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((((e . (_$employees . $department_id))) as _worksIn . ((($departments) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . ((d . (_$departments . $location_id))) as _isLocatedIn) . ((($locations) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE))))) . (l . (_$locations . $name))) . deref(_VALUE)) = "Łódź"))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $birth_date))) as _birthDate)) Optimised: execsql("select employees.surname, employees.birth_date from employees, locations, departments where ((employees.surname = 'Kowalski') AND ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))))", "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $employees | $birth_date | _birthDate | none | binder 1> 0>", "admin.wrapper1") Simply-rewritten: ((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((((e . (_$employees . $department_id))) as _worksIn . (((execsql("select departments.name, departments.location_id, departments.id from departments", "<0 $departments | | | none | ref 0>", "admin.wrapper1")) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . ((d . (_$departments . $location_id))) as _isLocatedIn) . (((execsql("select locations.id, locations.name from locations", "<0 $locations | | | none | ref 0>", "admin.wrapper1")) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE))))) . (l . (_$locations . $name))) . deref(_VALUE)) = "Łódź"))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $birth_date))) as _birthDate)) The navigation expressed in the raw query with virtual pointers reveals two additional selection conditions (primary-foreign key pairs) after macro-substituting pointer’s on_navigate procedures. These conditions are recognised together with the explicit one and the optimised query relies on a single SQL query matching exactly the original semantics. Page 128 of 235 Chapter 6 Query Analysis and Processing Example 7: Retrieve the sum of salaries of employees named Kowalski working in Łódź city Raw: Typechecked: sum((Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").salary); sum(deref(((Employee where ((deref(surname) = "Kowalski") and (deref(((((worksIn . Department) . isLocatedIn) . Location) . name)) = "Łódź"))) . salary))) View-rewritten: sum(((((($employees) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((((e . (_$employees . $department_id))) as _worksIn . ((($departments) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . ((d . (_$departments . $location_id))) as _isLocatedIn) . ((($locations) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE))))) . (l . (_$locations . $name))) . deref(_VALUE)) = "Łódź"))) . (e . (_$employees . $salary))) . deref(_VALUE))) Optimised: execsql("select sum(employees.salary) from employees, locations, departments where ((employees.surname = 'Kowalski') AND ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))))", "<0 $employees | $salary | | real | value 0>", "admin.wrapper1") Simply-rewritten: sum((((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where ((((e . (_$employees . $surname)) . deref(_VALUE)) = "Kowalski") and ((((((((e . (_$employees . $department_id))) as _worksIn . (((execsql("select departments.name, departments.location_id, departments.id from departments", "<0 $departments | | | none | ref 0>", "admin.wrapper1")) as _$departments) as d where (((d . (_$departments . $id)) . deref(_VALUE)) = (_worksIn . deref(_VALUE))))) . ((d . (_$departments . $location_id))) as _isLocatedIn) . (((execsql("select locations.id, locations.name from locations", "<0 $locations | | | none | ref 0>", "admin.wrapper1")) as _$locations) as l where (((l . (_$locations . $id)) . deref(_VALUE)) = (_isLocatedIn . deref(_VALUE))))) . (l . (_$locations . $name))) . deref(_VALUE)) = "Łódź"))) . (e . (_$employees . $salary))) . deref(_VALUE))) The analysis procedure is similar as in the previous example, although the largest transformable expression found is the sum aggregate function. Again, the selection conditions (explicit and implicit ones introduced by the macro-substituted on_navigate procedures) are expressed in a single SQL query evaluating the aggregate function in the wrapped resource environment. Page 129 of 235 Chapter 6 Query Analysis and Processing 6.2.3 Imperative Constructs Example 1: Delete employees named Nowak Raw: delete Employee where surname = "Nowak"; Fig. 45 Raw (parsed) query syntax tree for example 1 (imperative query) Typechecked: delete((Employee where (deref(surname) = "Nowak"))) Fig. 46 Typechecked query syntax tree for example 1 (imperative query) View-rewritten: delete(((((($employees) as _$employees) as e where (((e . (_$employees . $surname)) . deref(_VALUE)) = "Nowak")) . e) . _$employees)) Page 130 of 235 Chapter 6 Query Analysis and Processing Fig. 47 View-rewritten query syntax tree for example 1 (imperative query) Optimised: execsql("delete from employees where (employees.surname = 'Nowak')", "", "admin.wrapper1") Fig. 48 Optimised query syntax tree for example 1 (imperative query) As shown in the analysis procedure described in subsection 6.1.2, processing deleting queries is similar to selecting ones, but only selections are detected (projections are not applicable). Page 131 of 235 Chapter 6 Query Analysis and Processing Example 2: Set salaries of all employees to 1000 Raw: Employee.salary := 1000; Fig. 49 Raw (parsed) query syntax tree for example 2 (imperative query) Typechecked: ((Employee . salary) := (real)(1000)) Fig. 50Typechecked query syntax tree for example 2 (imperative query) View-rewritten: (((($employees) as _$employees) as e . ((e . (_$employees . $salary))) as _salary) . (((real)(1000)) as newSalary . (_salary . ((newSalary) as new$salary . (_VALUE := new$salary))))) Page 132 of 235 Chapter 6 Query Analysis and Processing Fig. 51 View-rewritten query syntax tree for example 2 (imperative query) Optimised: if ((execsql("select COUNT(*) from employees", "", "admin.wrapper1") = 1)) then (execsql("update employees set salary=1000", "", "admin.wrapper1") Fig. 52 Optimised query syntax tree for example 2 (imperative query) Processing updating queries (subsection 6.1.3) requires introducing additional check corresponding to microscopic assignment operator in SBQL. Therefore basing on common selection conditions, first a number of rows to be modified is checked (the red ellipse in Fig. 52), then the actual update is conditionally executed (the blue ellipse). Page 133 of 235 Chapter 6 Query Analysis and Processing Example 3: Set salaries of all employees who earn less than 1000 to 1200 Raw: (Employee where salary < 1000).salary := 1200; Fig. 53 Raw (parsed) query syntax tree for example 3 (imperative query) Typechecked: (((Employee where (deref(salary) < (real)(1000))) . salary) := (real)(1200)) Fig. 54 Typechecked query syntax tree for example 3 (imperative query) Page 134 of 235 Chapter 6 Query Analysis and Processing View-rewritten: ((((($employees) as _$employees) as e where (((e . (_$employees . $salary)) . deref(_VALUE)) < (real)(1000))) . ((e . (_$employees . $salary))) as _salary) . (((real)(1200)) as newSalary . (_salary . ((newSalary) as new$salary . (_VALUE := new$salary))))) Fig. 55 View-rewritten query syntax tree for example 3 (imperative query) Page 135 of 235 Chapter 6 Query Analysis and Processing Optimised: if ((execsql("select COUNT(*) from employees where (employees.salary < 1000)", "", "admin.wrapper1") = 1)) then (execsql("update employees set salary=1200 where (employees.salary < 1000)", "", "admin.wrapper1")) Fig. 56 Optimised query syntax tree for example 3 (imperative query) The query presented in this example introduces explicit selection conditions. These conditions are reflected in both the check query (count, the red ellipse in Fig. 56) and the actual update (the blue ellipse). Page 136 of 235 Chapter 6 Query Analysis and Processing 6.2.4 Multi-Wrapper and Mixed Queries Multi-wrapper queries mean queries invoking more than one wrapper instance, on the other hand mixed queries combine both “relational” and pure object-oriented objects and expressions. Such queries are unlikely to happen in the assumed wrapper application and the virtual repository architecture since any query entering the wrapper will refer only to the wrapper. Nevertheless, if any subquery does not refer to the target wrapper, it must be evaluated separately and its result substituted to the wrapper query for local rewriting and evaluation, partial results returned and stack-processed for the final result. Nevertheless, the implementation allows executing such queries; however wrapper optimisation is not always performed but simple rewriting only. Of course, this does not exclude application of native SBQL optimisers, e.g., based on independent subqueries’ methods, which also much improves the overall performance (unfortunately more operations must be executed by virtual repository instead of wrapped resources’ engines). Multi-wrapper and mixed queries were tested with application of the both test schemata (subchapter 6.2.1) where employees’ and cars’ data are stored in separate relational databases and they are integrated into the virtual repository by separate wrappers. These schemata are logically connected by columns on employees and cars tables (this information should be used by the global administrator/designer when creating the global view and integration rules, which is reflected in the sample views in Listing 2). In the optimal environment, multi-wrapper queries should be processed in a single-wrapper context, similarly to mixed ones. This means that the global multiwrapper queries should be decomposed by the virtual repository mechanisms and sent to appropriate wrappers. This ensures that relational names corresponding to the local wrapper are recognised correctly and other ones (evaluated by other wrappers) are regarded as external ones. Such queries can be transformed as mixed ones, i.e. expressions evaluated by non-local wrappers can be simply used for parametrizing SQL query strings. . Page 137 of 235 Chapter 6 Query Analysis and Processing Example 1: Retrieve all employees with ID value equal to the variable idTest Raw: Employee where id = idTest; Fig. 57 Raw (parsed) query syntax tree for example 1 (mixed query) Typechecked: (Employee where (deref(id) = deref(idTest))) Fig. 58 Typechecked query syntax tree for example 1(mixed query) View-rewritten: ((($employees) as _$employees) as e where (((e . (_$employees . $id)) . deref(_VALUE)) = deref(idTest))) Page 138 of 235 Chapter 6 Query Analysis and Processing Fig. 59 View-rewritten query syntax tree for example 1(mixed query) Optimised: execsql((("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (employees.id = '" + (string)(deref(idTest))) + "')"), "<0 $employees | | e | none | binder 0>", "admin.wrapper1") Fig. 60 Optimised query syntax tree for example 1(mixed query) The SQL query string is evaluated as concatenation of common SQL query substrings with the pure SBQL expression (string)(deref(idTest))). The final (evaluated) form is executed by the wrapper. Page 139 of 235 Chapter 6 Query Analysis and Processing Example 2: Retrieve surname and salary of the employee whose id is equal to the value returned from an SBQL procedure procedure() Raw: (Employee where id = procedure(1)).(surname, salary); Fig. 61 Raw (parsed) query syntax tree for example 3 (mixed query) Typechecked: ((Employee where (deref(id) = procedure(1))) . (surname , salary)) Fig. 62 Typechecked (parsed) query syntax tree for example 3 (mixed query) View-rewritten: (((($employees) as _$employees) as e where (((e . (_$employees . $id)) . deref(_VALUE)) = procedure(1))) . (((e . (_$employees . $surname))) as _surname , ((e . (_$employees . $salary))) as _salary)) Page 140 of 235 Chapter 6 Query Analysis and Processing Fig. 63 View-rewritten query syntax tree for example 3 (mixed query) Optimised: execsql((("select employees.surname, employees.salary from employees where (employees.id = '" + (string)(procedure(1))) + "')"), "<0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $employees | $salary | _salary | none | binder 1> 0>", "admin.wrapper1") Fig. 64 Optimised query syntax tree for example 3 (mixed query) Page 141 of 235 Chapter 6 Query Analysis and Processing Example 3: Retrieve cars owned by employees whose surname is equal to the variable surnameTest Raw: Typechecked: (cars as c where c.owner_id in (employees as e where e.surname = surnameTest).(e.id)).c; (((cars) as c where (deref((c . owner_id)) in deref((((employees) as e where (deref((e . surname)) = deref(surnameTest))) . (e . id))))) . c) View-rewritten: (((($cars) as _$cars) as c where (((c . (_$cars . $owner_id)) . deref(_VALUE)) in ((((($employees) as _$employees) as e where (((e . (_$employees . $surname)) . deref(_VALUE)) = deref(surnameTest))) . (e . (_$employees . $id))) . deref(_VALUE)))) . c) “Optimised”: ((((execsql("select cars.owner_id, cars.year, cars.colour, cars.id, cars.model_id from cars", "<0 $cars | | | none | ref 0>", "admin.wrapper2")) as _$cars) as c where (((c . (_$cars . $owner_id)) . deref(_VALUE)) in (((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e where (((e . (_$employees . $surname)) . deref(_VALUE)) = deref(surnameTest))) . (e . (_$employees . $id))) . deref(_VALUE)))) . c) Example 4: Retrieve a string composed of a make name, a model name and a car production year for employees whose surname is equal to the variable surnameTest Raw: Typechecked: View-rewritten: (((Car as c where c.isOwnedBy in (Employee as e where e.surname = surnameTest).(e.id)) join c.isModel.Model as m) join m.isMake.Make as mm).(mm.name + " " + m.name + " " + c.year); (((((Car) as c where (deref((c . isOwnedBy)) in deref((((Employee) as e where (deref((e . surname)) = deref(surnameTest))) . (e . id))))) join (((c . isModel) . Model)) as m) join (((m . isMake) . Make)) as mm) . ((((deref((mm . name)) + " ") + deref((m . name))) + " ") + (string)(deref((c . year))))) ((((((($cars) as _$cars) as c) as c where (((c . (c . (_$cars . $owner_id))) . deref(_VALUE)) in (((((($employees) as _$employees) as e) as e where (((e . (e . (_$employees . $surname))) . deref(_VALUE)) = deref(surnameTest))) . (e . ((e . (_$employees . $id))) as _id)) . (_id . deref(_VALUE))))) join (((c . ((c . (_$cars . $model_id))) as _isModel) . ((($models) as _$models) as m where (((m . (_$models . $id)) . deref(_VALUE)) = (_isModel . deref(_VALUE)))))) as m) join (((m . ((m . (_$models . $make_id))) as _isMake) . ((($makes) as _$makes) as m where (((m . (_$makes . $id)) . deref(_VALUE)) = (_isMake . deref(_VALUE)))))) as mm) . ((((((mm . (m . (_$makes . $name))) . deref(_VALUE)) + " ") + ((m . (m . (_$models . $name))) . deref(_VALUE))) + " ") + (string)(((c . (c . (_$cars . $year))) . deref(_VALUE))))) Page 142 of 235 Chapter 6 Query Analysis and Processing “Optimised”: (((((((execsql("select cars.owner_id, cars.year, cars.colour, cars.id, cars.model_id from cars", "<0 $cars | | | none | ref 0>", "admin.wrapper2")) as _$cars) as c) as c where (((c . (c . (_$cars . $owner_id))) . deref(_VALUE)) in ((((((execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees", "<0 $employees | | | none | ref 0>", "admin.wrapper1")) as _$employees) as e) as e where (((e . (e . (_$employees . $surname))) . deref(_VALUE)) = deref(surnameTest))) . (e . ((e . (_$employees . $id))) as _id)) . (_id . deref(_VALUE))))) join (((c . ((c . (_$cars . $model_id))) as _isModel) . (((execsql("select models.name, models.make_id, models.id from models", "<0 $models | | | none | ref 0>", "admin.wrapper2")) as _$models) as m where (((m . (_$models . $id)) . deref(_VALUE)) = (_isModel . deref(_VALUE)))))) as m) join (((m . ((m . (_$models . $make_id))) as _isMake) . (((execsql("select makes.id, makes.name from makes", "<0 $makes | | | none | ref 0>", "admin.wrapper2")) as _$makes) as m where (((m . (_$makes . $id)) . deref(_VALUE)) = (_isMake . deref(_VALUE)))))) as mm) . ((((((mm . (m . (_$makes . $name))) . deref(_VALUE)) + " ") + ((m . (m . (_$models . $name))) . deref(_VALUE))) + " ") + (string)(((c . (c . (_$cars . $year))) . deref(_VALUE))))) Queries presented in examples 3 and 4 unfortunately cannot be completely optimised as no single-wrapper context can be established and the names used are not recognised correctly. The query decomposition issues are to be performed by the virtual repository so that appropriate subqueries can be analysed by corresponding wrappers and the methods similar to mixed queries applied. Nevertheless, this situation is not hopeless as the SBQL optimisation can be applied, as shown in the following subsection. Page 143 of 235 Chapter 6 Query Analysis and Processing 6.2.5 SBQL Optimisation over Multi-Wrapper Queries The examples of multi-wrapper queries shown in subchapter 6.2.4 are simply rewritten only since in the current ODRA implementation the mechanism responsible for query decomposition and distributed processing are still under development (simulating decomposition mechanism is useless unless the mechanisms are verified by real-life tests). However, in such cases SBQL optimisers still hold and can much improve the overall performance. The following examples refer to the multi-wrapper schema (employees and cars) and they invoke extremely expensive joins, therefore methods of independent subqueries are effectively employed. The SBQL code is simplified as view rewriting and wrapper rewriting steps are not shown, so that only wrapper-related names from views exist (in the actual processing these names are macro-substituted with their view definition and then wrapper rewriting is applied). The results of the SBQL optimisation over simply-rewritten wrapper queries are presented in subchapter 7.3 Application of SBQL optimisers. Page 144 of 235 Chapter 6 Query Analysis and Processing Example 1: Retrieve cars owned by employees named Nowak Raw: (Car as c where c.isOwnedBy in (Employee as e where e.surname = "Nowak").(e.id)).c; Fig. 65 Raw query syntax tree for example 1 (multi-wrapper query) Typechecked: (((Car) as c where (deref((c . isOwnedBy)) in deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id))))) . c) Page 145 of 235 Chapter 6 Query Analysis and Processing Fig. 66 Typechecked query syntax tree for example 1 (multi-wrapper query) SBQL-optimised: (((deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id)))) groupas $aux0 . ((Car) as c where (deref((c . isOwnedBy)) in $aux0))) . c) Page 146 of 235 Chapter 6 Query Analysis and Processing Fig. 67 SBQL-optimised query syntax tree for example 1 (multi-wrapper query) The subquery recognised as an independent one within the type-checked query (the red ellipse in Fig. 66) is pushed out of the rest of the query (the red ellipse in Fig. 67). The similar operation is illustrated in the next (a little more complex) example. Page 147 of 235 Chapter 6 Query Analysis and Processing Example 2: Retrieve a string composed of a make name, a model name and a car production year for employees whose surname is Nowak Raw: (((Car as c where c.isOwnedBy in (Employee as e where e.surname = "Nowak").(e.id)) join c.isModel.Model as m) join m.isMake.Make as mm).(mm.name + " " + m.name + " " + c.year); Fig. 68 Raw query syntax tree for example 2 (multi-wrapper query) Page 148 of 235 Chapter 6 Query Analysis and Processing Typechecked: (((((Car) as c where (deref((c . isOwnedBy)) in deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id))))) join (((c . isModel) . Model)) as m) join (((m . isMake) . Make)) as mm) . ((((deref((mm . name)) + " ") + deref((m . name))) + " ") + (string)(deref((c . year))))) Fig. 69 Typechecked query syntax tree for example 2 (multi-wrapper query) Page 149 of 235 Chapter 6 Query Analysis and Processing SBQL-optimised: (((((deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id)))) groupas $aux0 . ((Car) as c where (deref((c . isOwnedBy)) in $aux0))) join (((c . isModel) . Model)) as m) join (((m . isMake) . Make)) as mm) . ((((deref((mm . name)) + " ") + deref((m . name))) + " ") + (string)(deref((c . year))))) Fig. 70 SBQL-optimised query syntax tree for example 2 (multi-wrapper query) Page 150 of 235 Chapter 6 Query Analysis and Processing 6.3 Sample Use Cases The simplest wrapper usage is presented in the above examples, where only basic views realising relational models are applied. Since the main application of the wrapper is to contribute to the virtual repository, the wrapped relational schema must be transparently accessible in the global schema (the global view) according to the virtual repository model via a contributory view (or a set of cascaded views responsible for integration and global schemata, Fig. 18). In this subchapter some possible views referring to the test schemata are described (query optimisation procedures are skipped due to complexities of the corresponding examples, nevertheless the presented rules and procedures still hold). The below views rely directly on the views shown in Listing 2, which corresponds to the schema translation within the virtual repository. 6.3.1 Rich Employees The following SBQL view presents rich employees, i.e. employees who earn more than 2000: Listing 3 SBQL view code for retrieving “rich employees” view RichEmployeeDef { virtual objects RichEmployee: record { e: Employee; }[0..*] { return (Employee where salary > 2000) as e; } on_retrieve: record { fullname: string; sex: string; salary: real; } { return ((deref(e.name) + " " + deref(e.surname)) as fullname, deref(e.sex) as sex, deref(e.salary) as salary); } view fullnameDef { virtual objects fullname: record { _fullname: string; } { return (deref(e.name) + " " + deref(e.surname)) as _fullname; } on_retrieve: string { return _fullname; } } view sexDef { virtual objects sex: record { _sex: Employee.sex; } { return e.sex as _sex; } on_retrieve: string { return deref(_sex); } } view salaryDef { virtual objects salary: record { _salary: Employee.salary; } { return e.salary as _salary; } on_retrieve: real { return deref(_salary); } } } Page 151 of 235 Chapter 6 Query Analysis and Processing Queries issued to this view could be: 1. Retrieve number of rich employees: count(RichEmployee); 2. Retrieve the minimum salary of rich employees: min(RichEmployee.salary); 3. Retrieve full names of rich employees earning 5000: (RichEmployee where salary = 5000).fullname; 4. Retrieve sum of salaries of rich employees named Jan Kowalski35: sum((RichEmployee where fullname = "Jan Kowalski").salary); 6.3.2 Employees with Departments This view presents selected employees data with names of departments they work in (the join over the virtual pointer is used): Listing 4 SBQL view code for retrieving employees with their departments view EmployeeDepartmentDef { virtual objects EmployeeDepartment: record { e: Employee; d: Department; }[0..*] { return Employee as e join e.worksIn.Department as d; } on_retrieve: record { fullname: string; salary: real; department: string; } { return ((deref(e.name) + " " + deref(e.surname)) as fullname, deref(e.salary) as salary, deref(d.name) as department); } view fullnameDef { virtual objects fullname: record { _fullname: string; } { return (deref(e.name) + " " + deref(e.surname)) as _fullname; } on_retrieve: string { return _fullname; } } view salaryDef { virtual objects salary: record { _salary: real; } { return deref(e.salary) as _salary; } on_retrieve: real { return _salary; } } view departmentDef { virtual objects department: record { _department: string; } { return deref(d.name) as _department; } on_retrieve: string { return deref(d.name); 35 Please notice that query 4 cannot be optimised by the wrapper since its selection condition is evaluated over the complex expression (string concatenation), which operation is not available in the standard SQL Page 152 of 235 Chapter 6 Query Analysis and Processing } } } Queries issued to this view could be: 1. Retrieve a string composed of the employee’s full name and the department’s name: EmployeeDepartment.(fullname + "/" + department); 2. Retrieve the minimum salary of employees working in the production department: min((EmployeeDepartment where department = "Production").salary); 6.3.3 Employees with Cars This view presents selected employees with their cars data (joins over virtual pointers applied, data retrieved from separate relational schemata represented by different wrappers): Listing 5 SBQL view code for retrieving employees with their cars view EmployeeCarDef { virtual objects EmployeeCar: record { e: Employee; c: Car; ma: Make; mo: Model; }[0..*] { return Car as c join c.isOwnedBy.Employee as e join c.isModel.Model as mo join mo.isMake.Make as ma; } on_retrieve: record { fullname: string; salary: real; make: string; model: string; colour: string; year: integer; } { return ((deref(e.name) + " " + deref(e.surname)) as fullname, deref(e.salary) as salary, deref(ma.name) as make, deref(mo.name) as model, deref(c.colour) as colour, deref(c.year) as year); } view fullnameDef { virtual objects fullname: record { _fullname: string; } { return (deref(e.name) + " " + deref(e.surname)) as _fullname; } on_retrieve: string { return _fullname; } } view salaryDef { virtual objects salary: record { _salary: real; } { return deref(e.salary) as _salary; } on_retrieve: real { return _salary; } } view makeDef { virtual objects make: record { _make: string; } { return deref(ma.name) as _make; } on_retrieve: string { return _make; } } view modelDef { virtual objects model: record { _model: string; } { Page 153 of 235 Chapter 6 Query Analysis and Processing return deref(mo.name) as _model; } on_retrieve: string { return _model; } } view colourDef { virtual objects colour: record { _colour: string; } { return deref(c.colour) as _colour; } on_retrieve: string { return _colour; } } view yearDef { virtual objects year: record { _year: integer; } { return deref(c.year) as _year; } on_retrieve: integer { return _year; } } } Queries issued to this view could be: 1. Retrieve full names of employees earning more than 2000 with colours of their cars: (EmployeeCar where salary > 2000).(fullname, colour); 2. Retrieve full names and salaries of employees driving Hondas produced in 2007: (EmployeeCar where make = "Honda" and year = "2007").(fullname, salary); 6.3.4 Rich Employees with White Cars This view presents selected data of rich employees owning white cars together with selected cars’ data: Listing 6 SBQL view code for retrieving rich employees with white cars view RichEmployeeWhiteCarDef { virtual objects RichEmployeeWhiteCar: record { e: Employee; c: Car; ma: Make; mo: Model; }[0..*] { return (Car where colour = "white") as c join (c.isOwnedBy.Employee where salary > 2000) as e join c.isModel.Model as mo join mo.isMake.Make as ma; } on_retrieve: record { fullname: string; salary: real; make: string; model: string; year: integer; } { return ((deref(e.name) + " " + deref(e.surname)) as fullname, deref(e.salary) as salary, deref(ma.name) as make, deref(mo.name) as model, deref(c.year) as year); } view fullnameDef { virtual objects fullname: record { _fullname: string; } { return (deref(e.name) + " " + deref(e.surname)) as _fullname; } on_retrieve: string { return _fullname; } } view salaryDef { Page 154 of 235 Chapter 6 Query Analysis and Processing virtual objects salary: record { _salary: real; } { return deref(e.salary) as _salary; } on_retrieve: real { return _salary; } } view makeDef { virtual objects make: record { _make: string; } { return deref(ma.name) as _make; } on_retrieve: string { return _make; } } view modelDef { virtual objects model: record { _model: string; } { return deref(mo.name) as _model; } on_retrieve: string { return _model; } } view yearDef { virtual objects year: record { _year: integer; } { return deref(c.year) as _year; } on_retrieve: integer { return _year; } } } Queries issued to this view could be: 1. Retrieve the minimum salary of employees driving Hondas produced after 2005: min((RichEmployeeWhiteCar where make = "Honda" and year > 2005).salary); 2. Retrieve count of white Honda Civic cars owned by rich employees: count(RichEmployeeWhiteCar where make = "Honda" and model = "Civic"); Page 155 of 235 Chapter 7 Wrapper Optimisation Results The wrapper optimisation results are collected average values from 10 subsequent measurements performed on the test schemata (subchapter 6.2.1) populated with random data with distributions presented below. The measurements were taken in the following environment (a single machine): Table 1 Optimisation testbench configuration Property Processor RAM HDD OS JVM ODRA server Java heap Wrapper server Java heap RDBMS Value Intel Mobile Core 2 Duo T7400, 2.16 GHz, 4MB cache 2048 MB 200 GB, 5400 rpm MS Windows XP SP2, 32 bit Sun JRE SE 1.6.0_02-b06 1024 MB 512 MB PostgreSQL 8.2 Due to relatively high memory consumption at the ODRA side (both for the object store and the stack-based processing), the tests were unfortunately limited to the maximum population of 1000 employees – greater values require disc memory swapping causing misleading result. Since the measurements were taken on a single machine, network throughput and traffic delays are not considered (only local loopback communication), however in real-life applications they can seriously affect overall query evaluation time, especially when transferring huge amounts of data (e.g., in case on unoptimised queries). The results may differ for various data distributions, they become more repeatable as the number of employee records increases and their data distribution is closer to the assumed one. The results can be also dependent on the RDBMS used. Page 156 of 235 Chapter 7 Wrapper Optimisation Results 7.1 Relational Test Data Data distribution in the “employees” schema data is presented in the following figures, employees' birth dates not shown below are distributed randomly between 1950-01-01 and 1987-12-31), sexes are equally distributed. The employees.info column contains location random text (lorem ipsum) between 200 bytes and 20 kilobytes long. Warszaw a Łódź Kraków Wrocław Poznań Gdańsk Szczecin probability 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 production retail w holesale research public relations customer service human security probability 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 salary Fig. 72 Employee’s department distribution 5000 2000 1500 1200 800 500 probability 0 0,05 0,1 0,15 0,2 0,25 Fig. 73 Employee’s salary distribution info's length department Fig. 71 Department’s location distribution 20 kB 10 kB 9 kB 7 kB 5 kB 2 kB 1 kB 500 B 200 B probability 0 0,05 0,1 0,15 Fig. 74 Employee’s info’s length distribution Page 157 of 235 0,2 0,25 0,07 0 0,05 0 ANNA MARIA KATARZYNA MAŁGORZAT AGNIESZKA KRYSTYNA BARBARA EWA ELśBIETA ZOFIA JANINA TERESA JOANNA MAGDALENA MONIKA JADWIGA DANUTA IRENA HALINA HELENA BEATA ALEKSANDR MARTA DOROTA MARIANNA GRAśYNA JOLANTA STANISŁAW IWONA KAROLINA BOśENA URSZULA JUSTYNA RENATA ALICJA PAULINA SYLWIA NATALIA WANDA AGATA ANETA IZABELA EWELINA MARZENA WIESŁAWA GENOWEFA PATRYCJA KAZIMIERA EDYTA STEFANIA 0 NOWAK KOWALSKA WIŚNIEWSKA WÓJCIK KOWALCZYK KAMIŃSKA LEWANDOWSKA ZIELIŃSKA SZYMAŃSKA WOŹNIAK DĄBROWSKA KOZŁOWSKA JANKOWSKA MAZUR WOJCIECHOWSKA KWIATKOWSKA KRAWCZYK PIOTROWSKA KACZMAREK GRABOWSKA PAWŁOWSKA MICHALSKA ZAJĄC KRÓL JABŁOŃSKA WIECZOREK NOWAKOWSKA WRÓBEL MAJEWSKA OLSZEWSKA STĘPIEŃ JAWORSKA MALINOWSKA ADAMCZYK NOWICKA GÓRSKA DUDEK PAWLAK WITKOWSKA WALCZAK RUTKOWSKA SIKORA BARAN MICHALAK SZEWCZYK OSTROWSKA TOMASZEWSKA PIETRZAK JASIŃSKA WRÓBLEWSKA 0,08 JAN ANDRZEJ PIOTR KRZYSZTOF STANISŁAW TOMASZ PAWEŁ JÓZEF MARCIN MAREK MICHAŁ GRZEGORZ JERZY TADEUSZ ADAM ŁUKASZ ZBIGNIEW RYSZARD DARIUSZ HENRYK MARIUSZ KAZIMIERZ WOJCIECH ROBERT MATEUSZ MARIAN RAFAŁ JACEK JANUSZ MIROSŁAW MACIEJ SŁAWOMIR JAROSŁAW KAMIL WIESŁAW ROMAN WŁADYSŁAW JAKUB ARTUR ZDZISŁAW EDWARD MIECZYSŁAW DAMIAN DAWID PRZEMYSŁAW SEBASTIAN CZESŁAW LESZEK DANIEL WALDEMAR Chapter 7 Wrapper Optimisation Results probability 0,07 0,06 0,05 0,04 0,03 0,02 0,01 Fig. 75 Female employee’s first name distribution probability 0,06 0,05 0,04 0,03 0,02 0,01 Fig. 76 Female employee’s surname distribution probability 0,04 0,03 0,02 0,01 Fig. 77 Male employee’s first name distribution Page 158 of 235 Chapter 7 Wrapper Optimisation Results 0,07 probability 0,06 0,05 0,04 0,03 0,02 0 NOWAK KOWALSKI WIŚNIEWSKI WÓJCIK KOWALCZYK KAMIŃSKI LEWANDOWSKI ZIELIŃSKI WOŹNIAK SZYMAŃSKI DĄBROWSKI KOZŁOWSKI JANKOWSKI MAZUR WOJCIECHOWSKI KWIATKOWSKI KRAWCZYK KACZMAREK PIOTROWSKI GRABOWSKI ZAJĄC PAWŁOWSKI KRÓL MICHALSKI WRÓBEL WIECZOREK JABŁOŃSKI NOWAKOWSKI MAJEWSKI STĘPIEŃ OLSZEWSKI JAWORSKI MALINOWSKI DUDEK ADAMCZYK PAWLAK GÓRSKI NOWICKI SIKORA WALCZAK WITKOWSKI BARAN RUTKOWSKI MICHALAK SZEWCZYK OSTROWSKI TOMASZEWSKI PIETRZAK ZALEWSKI WRÓBLEWSKI 0,01 Fig. 78 Male employee’s surname distribution In the “cars” schema makes are randomly selected values of the following set: Audi, Ford, Honda, Mitsubishi, Toyota, and Volkswagen. For each make a random model is selected (a list of models is shown in the table below). A production year is a random value between 1990 and 2007. Table 2 Test data for cars Make Audi Ford Honda Mitsubishi Toyota Volkswagen Models A3, A4, A6, A08, TT Escort, Focus, GT, Fusion, Galaxy Accord, Civic, Prelude 3000GT, Diamante, Eclipse, Galant Camry, Celica, Corolla, RAV4, Yaris Golf, Jetta, Passat 7.2 Optimisation vs. Simple Rewriting Below there is presented a set of representative queries with their raw SBQL forms and the corresponding SQL query strings embedded in optimised queries (intermediate transformations with examples are described in the previous chapter). Each query is given a plot of average unoptimised, average optimised execution times for 10, 50, 100, 500 and 1000 employee records together with the optimisation gain (the evaluation times’ ratio). The reference queries (unoptimised ones) are realised with the naive wrapper approach, i.e. all processing is performed by the virtual repository on materialised data retrieved from wrapped relational databases. Additional short comments are issued to explain the optimisation gain for particular queries if needed. Please notice that the following optimisation results do not rely on referring to infrequent column values and the resulting unnaturally enforced reduction of amount of Page 159 of 235 Chapter 7 Wrapper Optimisation Results data transported and materialised (e.g. surnames “Nowak” is the most probable value regardless of an employee sex). Query 1: Retrieve surnames of all employees Employee.surname; select employees.surname from employees ref. avg. time 4500 opt. avg. time gain 4000 9,00 3500 8,00 7,00 3000 time [ms] 10,00 6,00 2500 5,00 2000 gain SBQL: SQL: 4,00 1500 3,00 1000 2,00 500 1,00 0 10 0,00 1000 100 no. of em ployees Fig. 79 Evaluation times and optimisation gain for query 1 The optimisation gain for this simple query is affected by the simple projection applied – no undesired data is transported and materialised. Query 2: Retrieve employees earning more than 1200 SBQL: SQL: Employee where salary > 1200; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (employees.salary > 1200 ref. avg. time 5000 opt. avg. time gain 3,00 4500 2,50 4000 2,00 3000 2500 1,50 gain time [ms] 3500 2000 1,00 1500 1000 0,50 500 0 10 100 no. of em ployees 0,00 1000 Fig. 80 Evaluation times and optimisation gain for query 2 In this query the selection is performed over the indexed salary column. No projection is applied, however. All records from selected rows are retrieved and materialised, which matches the original query semantics. Page 160 of 235 Chapter 7 Wrapper Optimisation Results Query 3: Retrieve surnames of employee earning more than 1200 SBQL: SQL: (Employee where salary > 1200).surname; select employees.surname from employees where (employees.salary > 1200) ref. avg. time 5000 opt. avg. time gain 25,00 4500 4000 20,00 3000 15,00 gain time [ms] 3500 2500 2000 10,00 1500 1000 5,00 500 0 10 0,00 1000 100 no. of em ployees Fig. 81 Evaluation times and optimisation gain for query 3 Query 3 introduces the projection applied on the selection as in the previous case. This increases the gain about 10 times (for 1000 employees) comparing to query 2; the similar projection was applied in query 1 with the corresponding gain value. Query 4: Retrieve employees whose surname is Kowalski SBQL: SQL: Employee where surname = "Kowalski"; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (employees.surname = 'Kowalski') ref. avg. time 5000 opt. avg. time gain 4500 35,00 30,00 4000 time [ms] 3000 20,00 2500 15,00 2000 1500 gain 25,00 3500 10,00 1000 5,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 82 Evaluation times and optimisation gain for query 4 Query 5: Retrieve first names of employees whose surname is Kowalski SBQL: SQL: (Employee where surname = "Kowalski").name; select employees.name from employees where (employees.surname = 'Kowalski') Page 161 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 6000 opt. avg. time gain 140,00 120,00 5000 100,00 80,00 3000 gain time [ms] 4000 60,00 2000 40,00 1000 20,00 0 10 0,00 1000 100 no. of em ployees Fig. 83 Evaluation times and optimisation gain for query 5 Query 5 allows the selection over the indexed surname column combined with the single-column projection. Query 6: Retrieve the employee with id equal 1 SBQL: SQL: Employee where id = 1; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (employees.id = 1) ref. avg. time 5000 opt. avg. time gain 4500 160,00 140,00 4000 120,00 100,00 3000 2500 80,00 2000 gain time [ms] 3500 60,00 1500 40,00 1000 20,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 84 Evaluation times and optimisation gain for query 6 In this query only the selection is performed, but the unique index corresponding to the primary key on the employees table is used and only one record is retrieved and materialised. Query 7: Retrieve first name of the employee with id equal 1 SBQL: SQL: (Employee where id = 1).name; select employees.name from employees where (employees.id = 1) Page 162 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 6000 opt. avg. time gain 180,00 160,00 5000 140,00 120,00 100,00 3000 80,00 2000 gain time [ms] 4000 60,00 40,00 1000 20,00 0 10 0,00 1000 100 no. of em ployees Fig. 85 Evaluation times and optimisation gain for query 7 Query 7 introduces also the projection, which slightly increases the gain comparing to query 6. Query 8: Retrieve all employees named Kowalski earning more than 1200 SBQL: SQL: Employee where salary > 1200 and surname = "Kowalski"; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where ((employees.salary > 1200) AND (employees.surname = 'Kowalski')) ref. avg. time 5000 opt. avg. time gain 4500 80,00 70,00 4000 60,00 3000 50,00 2500 40,00 2000 gain time [ms] 3500 30,00 1500 20,00 1000 10,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 86 Evaluation times and optimisation gain for query 8 Query 8 optimisation gain is based on the selections over the indexed columns surname and salary. Query 9: Retrieve first names of employees named Kowalski earning more than 1200 SBQL: SQL: (Employee where salary > 1200 and surname = "Kowalski").name; select employees.name from employees where ((employees.salary > 1200) AND (employees.surname = 'Kowalski')) Page 163 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 5000 opt. avg. time gain 140,00 4500 120,00 4000 time [ms] 3000 80,00 2500 gain 100,00 3500 60,00 2000 1500 40,00 1000 20,00 500 0 10 0,00 1000 100 no. of em ployees Fig. 87 Evaluation times and optimisation gain for query 9 Here, the projection is also applied comparing to query 8 (the same selection conditions). Therefore, the gain still increases. Query 10: Retrieve employees named Kowalski earning more than 1200 or named Nowak SBQL: SQL: Employee where salary > 1200 and surname = "Kowalski" or surname = "Nowak"; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (((employees.salary > 1200) AND (employees.surname = 'Kowalski')) OR (employees.surname = 'Nowak')) ref. avg. time 6000 opt. avg. time gain 18,00 16,00 5000 14,00 12,00 10,00 3000 8,00 2000 gain time [ms] 4000 6,00 4,00 1000 2,00 0 10 100 no. of em ployees 0,00 1000 Fig. 88 Evaluation times and optimisation gain for query 10 The gain is relatively low comparing to previous queries containing selections over indexed columns. This phenomenon is caused by the alternative (or) used in the selection. Therefore the number of retrieved records is large (according to the test data distribution). Page 164 of 235 Chapter 7 Wrapper Optimisation Results Query 11: Retrieve first names of employees named Kowalski earning more than 1200 or named Nowak SBQL: SQL: (Employee where salary > 1200 and surname = "Kowalski" or surname = "Nowak").name; select employees.name from employees where (((employees.salary > 1200) AND (employees.surname = 'Kowalski')) OR (employees.surname = 'Nowak')) ref. avg. time 6000 opt. avg. time gain 90,00 80,00 5000 70,00 60,00 50,00 3000 40,00 2000 gain time [ms] 4000 30,00 20,00 1000 10,00 0 10 0,00 1000 100 no. of employees Fig. 89 Evaluation times and optimisation gain for query 11 The projection introduced in query 11 allows increasing the gain a few times comparing to the same selection conditions as in query 10. Query 12: Retrieve employees named Kowalski or Nowak and earning more than 1200 ref. avg. time 6000 opt. avg. time gain 35,00 30,00 5000 25,00 4000 20,00 gain SQL: Employee where salary > 1200 and (surname = "Kowalski" or surname = "Nowak"); select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where ((employees.salary > 1200) AND ((employees.surname = 'Kowalski') OR (employees.surname = 'Nowak'))) time [ms] SBQL: 3000 15,00 2000 10,00 1000 5,00 0 10 100 no. of em ployees Fig. 90 Evaluation times and optimisation gain for query 12 Page 165 of 235 0,00 1000 Chapter 7 Wrapper Optimisation Results Query 13: Retrieve surnames and salaries employees named Kowalski or Nowak and earning more than 1200 SBQL: SQL: (Employee where salary > 1200 and (surname = "Kowalski" or surname = "Nowak")).(surname, salary); select employees.surname, employees.salary from employees where ((employees.salary > 1200) AND ((employees.surname = 'Kowalski') OR (employees.surname = 'Nowak'))) ref. avg. time 7000 opt. avg. time gain 6000 120,00 100,00 5000 60,00 gain time [ms] 80,00 4000 3000 40,00 2000 20,00 1000 0 10 0,00 1000 100 no. of em ployees Fig. 91 Evaluation times and optimisation gain for query 13 Query 14: Retrieve employees named Kowalski whose salaries are between 800 and 2000 SBQL: SQL: ((Employee where surname = "Kowalski") where salary > 800) where salary < 2000; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees where (((employees.surname = 'Kowalski') AND (employees.salary > 800)) AND (employees.salary < 2000)) ref. avg. time 6000 opt. avg. time gain 70,00 60,00 5000 50,00 40,00 gain time [ms] 4000 3000 30,00 2000 20,00 1000 10,00 0 10 100 no. of em ployees 0,00 1000 Fig. 92 Evaluation times and optimisation gain for query 14 In query 14 the logical correspondence of nested where SBQL operators to SQL and operators is used. The resulting selection refers to indexed columns surname and salary. Page 166 of 235 Chapter 7 Wrapper Optimisation Results Query 15: Retrieve first names of employees named Kowalski whose salaries are between 800 and 2000 SBQL: SQL: (((Employee where surname = "Kowalski") where salary > 800) where salary < 2000).name; select employees.name from employees where (((employees.surname = 'Kowalski') AND (employees.salary > 800)) AND (employees.salary < 2000)) ref. avg. time 5000 opt. avg. time gain 4500 140,00 120,00 4000 time [ms] 3000 80,00 2500 gain 100,00 3500 60,00 2000 1500 40,00 1000 20,00 500 0 10 0,00 1000 100 no. of em ployees Fig. 93 Evaluation times and optimisation gain for query 15 The gain is increased comparing to query 14 by additional projection applied. Query 16: Retrieve employees with departments they work in SBQL: SQL: Employee join worksIn.Department; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date, departments.name, departments.location_id, departments.id from employees, departments where (departments.id = employees.department_id) ref. avg. time 50000 opt. avg. time gain 12,00 45000 10,00 40000 8,00 30000 25000 6,00 gain time [ms] 35000 20000 4,00 15000 10000 2,00 5000 0 10 100 no. of em ployees 0,00 1000 Fig. 94 Evaluation times and optimisation gain for query 16 The gain in query 16 results from evaluating joins by the relational resource (primary-foreign key relationship applied). The relatively low value is caused by large amount of data retrieved and materialised (all records from the employees table with the corresponding records from the departments table). Page 167 of 235 Chapter 7 Wrapper Optimisation Results Query 17: Retrieve surnames if employees and names of departments they work in SBQL: SQL: (Employee as e join e.worksIn.Department as d).(e.surname, d.name); select employees.surname, departments.name from employees, departments where (departments.id = employees.department_id) ref. avg. time 50000 opt. avg. time gain 70,00 45000 60,00 40000 time [ms] 30000 40,00 25000 gain 50,00 35000 30,00 20000 15000 20,00 10000 10,00 5000 0 10 0,00 1000 100 no. of em ployees Fig. 95 Evaluation times and optimisation gain for query 17 The gain increases comparing to query 16 due to the projection applied. Query 18: Retrieve surname and department’s name of the employee with id equal 1 SBQL: SQL: ((Employee where id = 1) as e join e.worksIn.Department as d).(e.surname, d.name); select employees.surname, departments.name from employees, departments where ((employees.id = 1) AND (departments.id = employees.department_id)) ref. avg. time 5000 opt. avg. time gain 140,00 4500 120,00 4000 100,00 3000 80,00 gain time [ms] 3500 2500 60,00 2000 1500 40,00 1000 20,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 96 Evaluation times and optimisation gain for query 18 The join conditions and the projection are the same as in query 17. However the join is performed by the relational resource only for a single employees record pointed by the primary key column. Therefore the query is evaluated faster and the amount of data retrieved and materialised substantially decreased. Also the unoptimised evaluation time much decreases due to the selection limiting the number of joins. Page 168 of 235 Chapter 7 Wrapper Optimisation Results Query 19: Retrieve surname and department’s name of the employee named Nowak SBQL: SQL: ((Employee where surname = "Nowak") as e join e.worksIn.Department as d).(e.surname, d.name); select employees.surname, departments.name from employees, departments where ((employees.surname = 'Nowak') AND (departments.id = employees.department_id)) ref. avg. time 8000 opt. avg. time gain 7000 120,00 100,00 6000 time [ms] 4000 60,00 3000 gain 80,00 5000 40,00 2000 20,00 1000 0 10 0,00 1000 100 no. of em ployees Fig. 97 Evaluation times and optimisation gain for query 19 Similarly to query 18, here an extra selection condition is given. The gain is slightly lower as the index on the surname column is not unique and more records are joined and retrieved. The surname selection improves also the unoptimised times (less as in the case of the unique index in query 18). Query 20: Retrieve employees with departments they work in and departments’ locations Employee join worksIn.Department join isLocatedIn.Location; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date, departments.name, departments.location_id, departments.id, locations.id, locations.name from employees, departments, locations where ((departments.id = employees.department_id) AND (locations.id = departments.location_id)) opt. avg. time gain 20,00 90000 18,00 80000 16,00 70000 14,00 60000 12,00 50000 10,00 40000 8,00 30000 6,00 20000 4,00 10000 2,00 0 10 100 no. of em ployees Fig. 98 Evaluation times and optimisation gain for query 20 Page 169 of 235 0,00 1000 gain ref. avg. time 100000 time [ms] SBQL: SQL: Chapter 7 Wrapper Optimisation Results The explanation of the relatively low gain is the same as in query 16 as the triple join is evaluated without any additional selections (except from the primary-foreign keys’ dependencies). Query 21: Retrieve surnames of employees with names of the locations their departments are located in SBQL: SQL: (Employee as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); select employees.surname, locations.name from employees, locations, departments where ((departments.id = employees.department_id) AND (locations.id = departments.location_id)) ref. avg. time 100000 opt. avg. time gain 120,00 90000 100,00 80000 80,00 60000 50000 60,00 gain time [ms] 70000 40000 40,00 30000 20000 20,00 10000 0 10 100 no. of em ployees 0,00 1000 Fig. 99 Evaluation times and optimisation gain for query 21 The gain substantially increases comparing to query 20 due to the projection realised in SQL. Query 22: Retrieve surnames and names of location where employees named Nowak work SBQL: SQL: ((Employee where surname = "Nowak") as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); select employees.surname, locations.name from employees, locations, departments where (((employees.surname = 'Nowak') AND (departments.id = employees.department_id)) AND (locations.id = departments.location_id)) Page 170 of 235 Chapter 7 Wrapper Optimisation Results time [ms] opt. avg. time gain 120,00 10000 100,00 8000 80,00 6000 60,00 4000 40,00 2000 20,00 0 10 gain ref. avg. time 12000 0,00 1000 100 no. of em ployees Fig. 100 Evaluation times and optimisation gain for query 22 The gain is slightly lower than in query 21, although the additional selection over the indexed surname column is introduced. This behaviour is caused by the limited number of join operations evaluated by the virtual repository due to this selection. Query 23: Retrieve surname and name of location where the employees with id equal 1 works SBQL: SQL: ((Employee where id = 1) as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, l.name); select employees.surname, locations.name from employees, locations, departments where (((employees.id = 1) AND (departments.id = employees.department_id)) AND (locations.id = departments.location_id)) ref. avg. time 5000 opt. avg. time gain 120,00 4500 100,00 4000 80,00 3000 2500 60,00 gain time [ms] 3500 2000 40,00 1500 1000 20,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 101 Evaluation times and optimisation gain for query 23 Similarly to query 22, the gain value does not increase, although the selection with the unique index is introduced. The explanation is the same. Please notice that evaluation times for the unoptimised query is much shorter than previously. Page 171 of 235 Chapter 7 Wrapper Optimisation Results Query 24: Retrieve the number of employees SBQL: SQL: count(Employee); select COUNT(*) from employees ref. avg. time 4500 opt. avg. time gain 4000 140,00 120,00 3500 100,00 2500 80,00 2000 60,00 gain time [ms] 3000 1500 40,00 1000 20,00 500 0 10 0,00 1000 100 no. of em ployees Fig. 102 Evaluation times and optimisation gain for query 24 Query 24 uses the aggregate function that can be completely evaluated by the relational resource. No data is actually retrieved and materialised except for the function result (just a number). Query 25: Retrieve the number of employees named Kowalski SBQL: SQL: count(Employee where surname = "Kowalski"); select COUNT(*) from employees where (employees.surname = 'Kowalski') ref. avg. time 5000 opt. avg. time gain 4500 160,00 140,00 4000 120,00 100,00 3000 2500 80,00 2000 gain time [ms] 3500 60,00 1500 40,00 1000 20,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 103 Evaluation times and optimisation gain for query 25 The gain is slightly increased comparing to query 24 as the additional selection over the indexed surname column is used as the aggregate function argument. Page 172 of 235 Chapter 7 Wrapper Optimisation Results Query 26: Retrieve the average salary of employees named Kowalski SBQL: SQL: avg((Employee where surname = "Kowalski").salary); select avg(employees.salary) from employees where (employees.surname = 'Kowalski') ref. avg. time 5000 opt. avg. time gain 160,00 4500 140,00 4000 120,00 100,00 3000 2500 80,00 2000 gain time [ms] 3500 60,00 1500 40,00 1000 20,00 500 0 10 0,00 1000 100 no. of em ployees Fig. 104 Evaluation times and optimisation gain for query 26 The gain is very similar as in query 25 – the selection conditions are the same, although another aggregate function is used. Query 27: Retrieve the sum of salaries of employees earning less than 2000 SBQL: SQL: sum((Employee where salary < 2000).salary); select sum(employees.salary) from employees where (employees.salary < 2000) ref. avg. time 6000 opt. avg. time gain 140,00 120,00 5000 100,00 80,00 gain time [ms] 4000 3000 60,00 2000 40,00 1000 20,00 0 10 100 no. of em ployees 0,00 1000 Fig. 105 Evaluation times and optimisation gain for query 27 Query 28: Retrieve employees working in the production department SBQL: SQL: Employee where worksIn.Department.name = "Production"; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees, departments where ((departments.name = 'Production') AND (departments.id = employees.department_id)) Page 173 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 50000 opt. avg. time gain 30,00 45000 25,00 40000 20,00 30000 25000 15,00 gain time [ms] 35000 20000 10,00 15000 10000 5,00 5000 0 10 0,00 1000 100 no. of em ployees Fig. 106 Evaluation times and optimisation gain for query 28 The join is evaluated with the additional selection condition. Therefore the gain is increased comparing to the similar query 16, where only primary-foreign key relationship was used. Query 29: Retrieve surnames and birth dates of employees in the production department SBQL: SQL: (Employee where worksIn.Department.name = "Production").(surname, birthDate); select employees.surname, employees.birth_date from employees, departments where ((departments.name = 'Production') AND (departments.id = employees.department_id)) ref. avg. time 50000 opt. avg. time gain 140,00 45000 120,00 40000 100,00 30000 80,00 gain time [ms] 35000 25000 60,00 20000 15000 40,00 10000 20,00 5000 0 10 100 no. of em ployees 0,00 1000 Fig. 107 Evaluation times and optimisation gain for query 29 The gain improves comparing to query 28 as the projection is introduced. Query 30: Retrieve surname and birth date of the employee with id equal 1 working in the production department SBQL: SQL: (Employee where id = 1 and worksIn.Department.name = "Production").(surname, birthDate); select employees.surname, employees.birth_date from employees, departments where ((employees.id = 1) AND ((departments.name = 'Production') AND (departments.id = employees.department_id))) Page 174 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 6000 opt. avg. time gain 180,00 160,00 5000 140,00 120,00 100,00 3000 80,00 2000 gain time [ms] 4000 60,00 40,00 1000 20,00 0 10 0,00 1000 100 no. of em ployees Fig. 108 Evaluation times and optimisation gain for query 30 Again, the gain is increased as the selection with the uniquely indexed column is applied besides the one existing in query 29. Please notice that evaluation times for the unoptimised query is much shorter than previously. Query 31: Retrieve surnames and birth dates of employees named Kowalski working in the production department SBQL: SQL: (Employee where surname = "Kowalski" and worksIn.Department.name = "Production").(surname, birthDate); select employees.surname, employees.birth_date from employees, departments where ((employees.surname = 'Kowalski') AND ((departments.name = 'Production') AND (departments.id = employees.department_id))) ref. avg. time 6000 opt. avg. time gain 140,00 120,00 5000 100,00 80,00 gain time [ms] 4000 3000 60,00 2000 40,00 1000 20,00 0 10 100 no. of em ployees 0,00 1000 Fig. 109 Evaluation times and optimisation gain for query 31 The selection condition uses the non-uniquely indexed surname column. Hence the gain is slightly lower than for query 30. Page 175 of 235 Chapter 7 Wrapper Optimisation Results Query 32: Retrieve employees whose department is located in Łódź SBQL: SQL: Employee where worksIn.Department.isLocatedIn.Location.name = "Łódź"; select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date from employees, locations, departments where ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))) ref. avg. time 100000 opt. avg. time gain 90000 400,00 350,00 80000 300,00 250,00 60000 50000 200,00 40000 gain time [ms] 70000 150,00 30000 100,00 20000 50,00 10000 0 10 0,00 1000 100 no. of employees Fig. 110 Evaluation times and optimisation gain for query 32 The high gain for query 32 is caused by the complete evaluation of the triple join with the additional selection by the relational database. Further, the projection is used, as only employees records are retrieved and materialised. Query 33: Retrieve surnames and birth dates of employees whose department is located in Łódź ref. avg. time 100000 opt. avg. time gain 1200,00 90000 1000,00 80000 70000 800,00 60000 50000 600,00 40000 400,00 30000 20000 200,00 10000 0 10 100 no. of em ployees Fig. 111 Evaluation times and optimisation gain for query 33 Page 176 of 235 0,00 1000 gain SQL: (Employee where worksIn.Department.isLocatedIn.Location.name = "Łódź").(surname, birthDate); select employees.surname, employees.birth_date from employees, locations, departments where ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))) time [ms] SBQL: Chapter 7 Wrapper Optimisation Results The gain increases comparing to query 32, since the explicit projection is introduced (undesired employees columns are not retrieved). Query 34: Retrieve surname and birth date of the employee with id equal 1 whose department is located in Łódź SBQL: SQL: (Employee where id = 1 and worksIn.Department.isLocatedIn.Location.name = "Łódź").(surname, birthDate); select employees.surname, employees.birth_date from employees, locations, departments where ((employees.id = 1) AND ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id)))) ref. avg. time 5000 opt. avg. time gain 120,00 4500 100,00 4000 80,00 3000 2500 60,00 gain time [ms] 3500 2000 40,00 1500 1000 20,00 500 0 10 100 no. of em ployees 0,00 1000 Fig. 112 Evaluation times and optimisation gain for query 34 Here, the gain is much lower than in previous queries – the selection with the unique index improves much the relational query evaluation time, however it limits also the number of joins performed by the virtual repository. Evaluation times the unoptimised query is again improved. Query 35: Retrieve surnames and birth dates of named Kowalski whose department is located in Łódź SBQL: SQL: (Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").(surname, birthDate); select employees.surname, employees.birth_date from employees, locations, departments where ((employees.surname = 'Kowalski') AND ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id)))) Page 177 of 235 Chapter 7 Wrapper Optimisation Results time [ms] opt. avg. time gain 160,00 7000 140,00 6000 120,00 5000 100,00 4000 80,00 3000 60,00 2000 40,00 1000 20,00 0 10 gain ref. avg. time 8000 0,00 1000 100 no. of em ployees Fig. 113 Evaluation times and optimisation gain for query 35 Query 36: Retrieve the number of employees working in the production department SBQL: SQL: count(Employee where worksIn.Department.name = "Production"); select COUNT(*) from employees, departments where ((departments.name = 'Production') AND (departments.id = employees.department_id)) ref. avg. time 50000 opt. avg. time gain 1200,00 45000 1000,00 40000 800,00 30000 25000 600,00 gain time [ms] 35000 20000 400,00 15000 10000 200,00 5000 0 10 100 no. of em ployees 0,00 1000 Fig. 114 Evaluation times and optimisation gain for query 36 The aggregate function is evaluated over the double join directly in the relational resource without any data materialisation – only the function result (a number) is retrieved. Query 37: Retrieve the number of employees named Kowalski working in the production department SBQL: SQL: count(Employee where surname = "Kowalski" and worksIn.Department.name = "Production"); select COUNT(*) from employees, departments where ((employees.surname = 'Kowalski') AND ((departments.name = 'Production') AND (departments.id = employees.department_id))) Page 178 of 235 Chapter 7 Wrapper Optimisation Results ref. avg. time 6000 opt. avg. time gain 200,00 180,00 5000 160,00 140,00 120,00 3000 100,00 gain time [ms] 4000 80,00 2000 60,00 40,00 1000 20,00 0 10 0,00 1000 100 no. of employees Fig. 115 Evaluation times and optimisation gain for query 37 The gain lower than in case of query 36 is caused by limitation of number of navigations (from virtual pointers) evaluated by the virtual repository (the additional selection condition). The evaluation times substantially decrease for the unoptimised query comparing to the previous case. Query 38: Retrieve the sum of salaries of employees in the production department SBQL: SQL: sum((Employee where worksIn.Department.name = "Production").salary); select sum(employees.salary) from employees, departments where ((departments.name = 'Production') AND (departments.id = employees.department_id)) ref. avg. time 50000 opt. avg. time gain 1400,00 45000 1200,00 40000 1000,00 30000 800,00 gain time [ms] 35000 25000 600,00 20000 15000 400,00 10000 200,00 5000 0 10 100 no. of em ployees 0,00 1000 Fig. 116 Evaluation times and optimisation gain for query 38 The gain arises from the aggregate function over the join evaluated completely by the relational database. Page 179 of 235 Chapter 7 Wrapper Optimisation Results Query 39: Retrieve the number of employees working in Łódź SBQL: SQL: count(Employee where worksIn.Department.isLocatedIn.Location.name = "Łódź"); select COUNT(*) from employees, locations, departments where ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))) ref. avg. time 100000 opt. avg. time gain 2500,00 90000 80000 2000,00 60000 1500,00 gain time [ms] 70000 50000 40000 1000,00 30000 20000 500,00 10000 0 10 0,00 1000 100 no. of em ployees Fig. 117 Evaluation times and optimisation gain for query 39 The gain increases comparing to query 38 due to the triple join evaluation avoided in the virtual repository for the aggregate function. Query 40: Retrieve the sum of salaries of employees working in Łódź SBQL: SQL: sum((Employee where worksIn.Department.isLocatedIn.Location.name = "Łódź").salary); select sum(employees.salary) from employees, locations, departments where ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id))) ref. avg. time 100000 opt. avg. time gain 2500,00 90000 80000 2000,00 60000 1500,00 gain time [ms] 70000 50000 40000 1000,00 30000 20000 500,00 10000 0 10 100 no. of em ployees 0,00 1000 Fig. 118 Evaluation times and optimisation gain for query 40 The query and the gain are similar as in the previous example, although another aggregate function is used. Page 180 of 235 Chapter 7 Wrapper Optimisation Results Query 41: Retrieve the sum of salaries of employees named Kowalski working in Łódź SQL: sum((Employee where surname = "Kowalski" and worksIn.Department.isLocatedIn.Location.name = "Łódź").salary); select sum(employees.salary) from employees, locations, departments where ((employees.surname = 'Kowalski') AND ((locations.name = 'Łódź') AND ((departments.id = employees.department_id) AND (locations.id = departments.location_id)))) ref. avg. time time [ms] 8000 opt. avg. time gain 160,00 7000 140,00 6000 120,00 5000 100,00 4000 80,00 3000 60,00 2000 40,00 1000 20,00 0 10 100 no. of em ployees gain SBQL: 0,00 1000 Fig. 119 Evaluation times and optimisation gain for query 41 The additional selection limits the number of joins to be performed by the virtual repository; therefore the gain is lower than in query 42. Evaluation times the unoptimised query substantially decrease. 7.3 Application of SBQL optimisers As presented in subchapters 6.2.4 and 6.2.5, in the current virtual repository implementation some wrapper-oriented queries are not optimally rewritten, but they can be optimised with SBQL methods. Due to the very large memory consumption necessary for evaluation of unoptimised queries, the tests were performed only on 10, 50 and 100 records of employees and corresponding cars, so that disc memory swapping could be avoided. The plots included compare average raw and SBQL-optimised query evaluation times and they present the average evaluation time ratio (the optimisation gain). For simplification, the wrapper execsql expressions substituted for each wrapper-related name and retrieving unconditionally all records are not shown. Unfortunately, the mere number of data points does not allow the gain curve shape to be informative, which is also affected by low numbers of records not reflecting the assumed data distribution. Page 181 of 235 Chapter 7 Wrapper Optimisation Results Query 1: Retrieve cars owned by employees named Nowak Raw SBQL: (Car as c where c.isOwnedBy in (Employee as e where SBQLoptimised: e.surname = "Nowak").(e.id)).c; (((deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id)))) groupas $aux0 . ((Car) as c where (deref((c . isOwnedBy)) in $aux0))) . c) ref. avg. time 70000 opt. avg. time gain 80,00 70,00 60000 60,00 50000 40,00 gain time [ms] 50,00 40000 30000 30,00 20000 20,00 10000 10,00 0 0 0,00 100 50 no. of em ployees Fig. 120 Evaluation times and optimisation gain for query 1 (SBQL optimisation) Query 2: Retrieve a string composed of a make name, a model name and a car production year for employees whose surname is Nowak Raw SBQL: (((Car as c where c.isOwnedBy in (Employee as e where SBQLoptimised: e.surname = "Nowak").(e.id)) join c.isModel.Model as m) join m.isMake.Make as mm).(mm.name + " " + m.name + " " + c.year); (((((deref((((Employee) as e where (deref((e . surname)) = "Nowak")) . (e . id)))) groupas $aux0 . ((Car) as c where (deref((c . isOwnedBy)) in $aux0))) join (((c . isModel) . Model)) as m) join (((m . isMake) . Make)) as mm) . ((((deref((mm . name)) + " ") + deref((m . name))) + " ") + (string)(deref((c . year))))) ref. avg. time 70000 opt. avg. time gain 40,00 35,00 60000 30,00 50000 20,00 gain time [ms] 25,00 40000 30000 15,00 20000 10,00 10000 5,00 0 0 50 no. of em ployees 0,00 100 Fig. 121 Evaluation times and optimisation gain for query 2 (SBQL optimisation) The lower gain (comparing to the query 1) is caused by the remaining join not rewritten according to the independent subqueries’ algorithm and has to be evaluated in an unoptimised form. Page 182 of 235 Chapter 8 Summary and Conclusions The objectives assumed have been accomplished and the theses stated in the presented Ph.D. dissertation have been proved true: 1. Legacy relational databases can be transparently integrated to an object-oriented virtual repository and their data can be processed and updated with an objectoriented query language indistinguishably from purely object-oriented data without materialisation or replication. The designed and implemented relational schema import procedure allows generic, automated and completely transparent integration of any number of legacy relational databases into the virtual repository structures. Relational schemata are presented, accessible and processed indistinguishably from real object-oriented data. An imported relational schema is enveloped with updateable object-oriented views defined in SBQL used for the virtual repository management and maintenance and available as the toplevel end-user interface language. Therefore, a wrapped relational database is processed transparently as any other virtual repository resource. The actual distinction from object-oriented data occurs at the wrapper level, below the covering views, i.e. no intermediate virtual repository stage is “aware” of the actual data source. The wrapper, responsible for interfacing between the virtual repository and the relational database, performs analysing SBQL queries for relational names and involving them (sub)queries. Such possibly largest (sub)queries are transformed into special SBQL expressions redirecting SQL query strings to the wrapper on SBQL query evaluation. The wrapper sends SQL queries to the wrapped database and returns SBQLresults to the query processing engine, so that the results are further processed as other object-oriented data (or returned directly to the end user). The wrapper query rewriter Page 183 of 235 Chapter 8 Summary and Conclusions allows also imperative SBQL constructs and relational databases can be updated with the object-oriented query language according to the SBQL semantics. 2. Appropriate optimisation mechanisms can be developed and implemented for such a system in order to enable coaction of the object-oriented virtual repository optimisation together with native relational resource optimisers. At the query rewriting stage, the wrapper aims to find SBQL query patterns corresponding to SQL-optimiseable queries that can be efficiently evaluated by the underlying relational database, so that minimum processing is required at the SBQL side. Currently implemented patterns correspond to aggregate functions, joins, where selections and projections, however all optimisation-related resource information is available to the wrapper. Therefore one could provide some cost model allowing still more efficient decision-making solutions. Besides transforming SBQL (sub)queries into their SQL counterparts, efficient SBQL optimisation is available during query processing. This optimisation includes necessary procedures, like view rewriting (query modification methods) required for the wrapper optimiser, but also the ones much improving SBQL query evaluation, e.g. based on independent subqueries. Hence, after SQL-oriented optimisation, the whole SBQL query can be still optimised so that the overall performance is increased. 8.1 Prototype Limitations and Further Works The current wrapper implementation proves the theses and demonstrates the developed optimisation methods. The prototype ensures the complete functionality of the virtual repository based on both relational resources and object-oriented ones (including other integrated external data sources, e.g. Web Services, XML, etc.). However, some improvements can be introduced. The minor one concerns a completely automated primary wrapper view generation process – the current one creates only basic objects corresponding to the relational columns and tables with primary on_retrieve, on_update and on_delete procedures. The improved view generator should generate also virtual pointers (on_navigate procedures) corresponding to relational primary-foreign key constraints. This feature is partially implemented, but it does not allow multi-column keys and multiple foreign keys for a single table, and it was skipped in the examples provided. For the most general case, a relational database schema with table relations must be Page 184 of 235 Chapter 8 Summary and Conclusions expressed with additional object-oriented views by a system administrator (the contributory and integration views), which approach has been used in the examples presented in the thesis. The current considerations and implementation do not allow creating relational data. The SQL insert statements are irregular and the do not obey the general language syntax. Nevertheless extending the wrapper views with on_create procedures and recognising the corresponding expressions seems possible, although rather challenging. Yet another improvement, substantial to the virtual repository operation, and not directly related to the wrapper itself, is processing distributed queries and decomposing them to appropriate resources. This complex and sophisticated feature is still under development for the virtual repository and it will be tested with the presented wrapper in the feature. In the thesis, simple wrapper distribution functionalities were discussed with examples in subchapter 6.2.4 Multi-Wrapper and Mixed Queries. Distributed query decomposition and processing features are out of the scope of the thesis, however. 8.2 Additional Wrapper Functionalities The wrapper designed and implemented for the thesis is focussed on relational databases. During eGov-Bus works, however, it has been extended with additional functionalities allowing integration of other types of resources, not necessarily available with JDBC and SQL. The flexible and modular wrapper prototype architecture allowed implementing wrapper modes dedicated to SD-SQL databases and SWARD repositories. SD-SQL36 databases [204] operate on database stored procedures responsible for transparent distribution of data between separate database instances together with automated managing and querying these instances. Queries targeting SD-SQL resources are not ordinary SQL strings, but they are the appropriate stored procedure calls – the corresponding extension has been introduced to the wrapper query generator. Query strings and results are processed by JDBC, as in case of regular relational databases. SWARD (mentioned in subchapter 2.2.4.1).being a part of the Amos II project (subchapter 2.2.1.3) required extending the wrapper with an extra interface enabling communication with this datasource, instead of the standard JDBC. Further, RDQL 36 Scalable Distributed SQL Page 185 of 235 Chapter 8 Summary and Conclusions generator was implemented, instead of the standard SQL. Therefore, the presented wrapper can be also used for accessing other RDF-resources (with minor modifications, probably), provided they can be queried with RDQL (this language is currently regarded obsolete). The existing RDQL query generator can be also extended into a SPARQL query generator without much effort. For either SD-SQL or SWARD, the wrapper action is the same as in case of regular relational databases, the only difference is in the resource language specific query strings (preferably optimised) generated and sent to the resource. Result retrieval and reconstruction procedures are generic and they did not require modifications. Page 186 of 235 Appendix A The eGov-Bus Project The thesis has been accomplished under the eGov-Bus project [205], which is an acronym for the Advanced eGovernment Information Service Bus project supported by the European Community under "Information Society Technologies" priority of the Sixth Framework Programme (contract number: FP6-IST-4-026727-STP). The project is a 24-month international research aiming at designing foundations of a system providing citizens and businesses with improved access to virtual public services, which are based on existing national eGovernment services and which support cross-border “life events”. The project participants are: • Rodan Systems SA (Poland), • Centre de Recherche en Informatique Appliquée – Universite Paris Dauphine (France), • Europäisches EMIC Innovations Center GmbH (Germany), • Department of Information Technology, Uppsala University (Norway), • Polish-Japanese Institute for Information Technology (Poland), • Axway Software (France), • Zentrum für Sichere Informationstechnologie (Austria), • Ministry of Interior and Administration (MSWiA) (Poland). The overall eGov-Bus project objective is to research, design and develop technology innovations which will create and support a software environment that provides user-friendly, advanced interfaces to support “life events” of citizens and businesses – administration interactions involving many different government organisations within the European Union. The “life-events” model organises services and allows users to access services in a user-friendly and seamless manner, by hiding the functional fragmentation and the organisational complexity of the public sector. Page 187 of 235 Appendix A The eGov-Bus Project This approach transforms governmental portals into virtual agencies, which cluster functions related to the customer’s everyday life, regardless of the responsible agency or branch of government. Such virtual agencies offer single points of entry to multiple governmental agencies (European, national, regional and local) and provide citizens and businesses with the opportunity to interact easily and seamlessly with several public agencies. “Life-events” lead to a series of transactions between users (citizens and enterprises) and various public sector organisations, often crossing traditional department boundaries. There are substantial information needs and service needs for the user that can span a range of organisations and be quite complicated. An example of a straightforward life event “moving house” within only one country such as Poland may require complex interaction with a number of Government information systems. The detailed objectives are: 1. Create adaptable process management technologies by enabling virtual services to be combined dynamically from the available set of eGovernment functions. 2. Improve effective usage of advanced web service technologies by eGovernment functions by means of service-level agreements, an audit trail, semantic representations, better availability and performance. 3. Exploit and integrate current and ongoing research results in the area of natural language processing to provide user-friendly, customisable interfaces to the eGovBus. 4. Organise currently available web services according to the specific life-event requirements, creating a comprehensive workflow process that provides clear instructions for end users and allows them to personalise services as required. 5. Research a secure, non-repudable audit trail for combined Web services by promoting qualified electronic signature technology. 6. Support a virtual repository of data sources required by life-event processes, including meta-data, declarative rules, and procedural knowledge about governing life-events categories. 7. Provide these capabilities based on a highly available, distributed and secure architecture that makes use of existing systems. Generally citizens and businesses will profit from more accessible public services. The following concrete benefits will be achieved: • Improved public services for citizens and businesses, Page 188 of 235 Appendix A The eGov-Bus Project • Easier access to cross-border services and therefore a closer European Union, • Improved quality of life and quality of communication, • Reduced red tape and thus an increase in productivity. To accomplish these challenging objectives, eGov-Bus researches advances in business process and Web service technologies. Virtual repositories provide data abstraction, and a security service framework ensures adequate levels of data protection and information security. Multi-channel interfaces allow citizens easy access using their preferred interface. The eGov-Bus architecture is to comprise three distinct classes of software components, namely the newly developed features resulting from the project research and development effort, the modified and extended pre-existing software components either proprietary software components licensed to the project by the project partners or open software, and the pre-existing information system features. The latter category pertains to the eGovernment information systems to be rendered inter-operable with the use of the eGov-Bus prototype as well as to the middleware software components such as workflow management engines. Page 189 of 235 Appendix B The ODRA Platform ODRA (Object Database for Rapid Application development) [206] is a prototype object-oriented application development environment currently being constructed at the Polish-Japanese Institute of Information Technology under the eGov-Bus project. Its aim is to design a next generation development tool for future database application programmers. The tool is based on SBQL. The SBQL execution environment consists of a virtual machine, a main memory DBMS and an infrastructure supporting distributed computing. The main goal of the ODRA project is to develop new paradigms of database application development. This goal can be reached by increasing the level of abstraction at which a programmer works with application of a new, universal, declarative programming language, together with its distributed, database-oriented and objectoriented execution environment. Such an approach provides a functionality common to the variety of popular technologies (such as relational/object databases, several types of middleware, general purpose programming languages and their execution environments) in a single universal, easy to learn, interoperable and effective to use application programming environment. The principle ideas implemented in order to achieve this goal are the following: 1. Object-oriented design. Despite the principal role of object-oriented ideas in software modelling and in programming languages, these ideas have not succeeded yet in the field of databases. ODRA approach is different from current ways of perceiving object databases, represented mostly by the ODMG standard [207] and database-related Java technologies (e.g., [208, 209]). The system is built upon the SBA methodology ([210, 211]). This allows to introduce for database programming all the popular object-oriented mechanisms (like objects, classes, inheritance, Page 190 of 235 Appendix B The ODRA Platform polymorphism, encapsulation), as well as some mechanisms previously unknown (like dynamic object roles [212, 213] or interfaces based on database views [214, 215]). 2. Powerful query language extended to a programming language. The most important feature of ODRA is SBQL, an object-oriented query and programming language. SBQL differs from programming languages and from well-known query languages, because it is a query language with the full computational power of programming languages. SBQL alone makes possible to create fully fledged database-oriented applications. A chance to use the same very-high-level language for most database application development tasks may greatly improve programmers’ efficiency, as well as software stability, performance and maintenance potential. 3. Virtual repository as a middleware. In a networked environment it is possible to connect several hosts running ODRA. All systems tied in this manner can share resources in a heterogeneous and dynamically changing, but reliable and secure environment. This approach to distributed computing is based on object-oriented virtual updatable database views [216]. Views are used as wrappers (or mediators) on top of local servers, as a data integration facility for global applications, and as customisers that adopt global resources to needs of particular client applications. This technology can be perceived as contribution to distributed databases, Enterprise Application Integration (EAI), Grid Computing and Peer-To-Peer networks. The distributed nature of contemporary information systems requires highly specialised software facilitating communication and interoperability between applications in a networked environment. Such software is usually referred to as middleware and is used for application integration. ODRA supports informationoriented and service-oriented application integration. The integration can be achieved through several techniques known from research on distributed/federated databases. The key feature of ODRA-based middleware is the concept of transparency. Due to this transparency many complex technical details of the distributed data/service environment need not to be taken into account in an application code. ODRA supports the following transparency forms: • Transparency of updating made from the side of a global client, Page 191 of 235 Appendix B The ODRA Platform • Transparency of distribution and heterogeneity, • Transparency of data fragmentation, • Transparency of data/service redundancies and replications, • Transparency of indexing, • etc. These forms of transparency have not been solved to a satisfactory degree by current technologies. For example, Web Services support only transparency of location and transparency of implementation. Transparency is achieved in ODRA through the concept of a virtual repository (Fig. 1). The repository seamlessly integrates distributed resources and provides a global view on the whole system, allowing one to utilise distributed software resources (e.g., databases, services, applications) and hardware (processor speed, disk space, network, etc.). It is responsible for the global administration and security infrastructure, global transaction processing, communication mechanisms, ontology and metadata management. The repository also facilitates data access by several redundant data structures (global indexes, global caches, replicas), and protects data against random system failures. A user of the repository sees data exposed by the systems integrated by means of the virtual repository through a global integration view. The main role of the integration view is to hide complexities of mechanisms involved in access to local data sources. The view implements a CRUD behaviour which can be augmented with logic responsible for dealing with horizontal and vertical fragmentation, replication, network failures, etc. Thanks to the declarative nature of SBQL, these complex mechanisms can often be expressed in one line of code. The repository has a highly decentralised architecture. In order to get access to the integration view, clients do not send queries to any centralised location in the network. Instead, every client possesses its own copy of the global view, which is automatically downloaded from the integration server after successful authentication to the repository. A query executed on the integration view is to be optimised using such techniques as rewriting, pipelining, global indexing and global caching. Local sites are fully autonomous, which means it is not necessary to change them in order to make their content visible to the global user of the repository. Their content is visible to global clients through a set of contributory views which must conform to the integration view (be a subset of it). Non-ODRA data sources are Page 192 of 235 Appendix B The ODRA Platform available to global clients through a set of wrappers, which map data stored in them to the canonical object model assumed for ODRA. There are wrappers developed for several popular databases, languages and middleware technologies. Despite of their diversity, they can all be made available to global users of the repository. A global user may not only query local data sources, but also update their content using SBQL. Instead of exposing raw data, the repository designer may decide to expose only procedures. Calls to such procedures can be executed synchronously and asynchronously. Together with SBQL’s support for semistructured data, this feature enables a document-oriented interaction, which is characteristic to current technologies supporting Service Oriented Architecture (SOA). B.1 ODRA Optimisation Framework37 In terms of ODRA, an optimiser means any mechanism transforming a query into a semantically equivalent form, not only directly for a better performance but also to enable other optimisers to work. For example, some optimisers may require specific transformations (e.g., macro-substituting view or procedure definitions) to be executed first. These operations do not improve performance themselves, but subsequently applied optimisations do (e.g. based on independent subquery methods). Using a common name for all these mechanisms is used because all of them work in a similar way and they are served by the same framework. ODRA optimisation framework allows defining an optimisation sequence, i.e. a collection of subsequent optimisers influencing query evaluation (there is also a possibility to turn any optimisation off) for arbitrary performance tuning. Such a sequence is a session variable and does not affect other sessions. The supported and implemented optimisers are (a code name for a sequence definition given in italics): • None (none) – no optimisation performed, • Independent subquery methods (independent), • Dead subquery removal (dead), • Union-distributive (union) – parallel execution of some distributive queries, • Wrapper rewriting (wrapperrewrite) – simple relational wrapper query rewriting (the naive approach), 37 SBQL optimisation techniques has been described in details in subchapter 4.4 Page 193 of 235 Appendix B The ODRA Platform • Wrapper optimisation (wrapperoptimize) – relational wrapper query optimisation, • Procedure rewriting (rewrite) – macro-substituting procedure calls for query modification, • View rewriting (viewrewrite) – macro-substituting view calls for query modification, • Indices (index) – substituting some subqueries with appropriate index calls. The default optimisation sequence is empty (no optimisation is performed), which is correct for all queries (but not efficient, of course). Therefore the optimisers to be used should be put in a current optimisation sequence, e.g., for a relational wrapper queries the minimum optimisation sequence is view rewriting and wrapper rewriting38. Please notice, that a sequence order is also important, as the simple wrapper rewriting would not succeed if views used in a query are not macro-substituted with their definitions before. This sequence can be further expanded with some other optimisers, e.g., removing dead subqueries and using independent subquery methods. Similarly, for an optimised execution of a relational wrapper query the minimum sequence is view rewriting and wrapper optimisation. This sequence can be also extended with some other optimisers. Preferably, a default optimisation sequence should contain the most common optimisers (or just the ones required for a proper execution of an arbitrary query). The other promising solution is to set the most appropriate sequence dynamically basing on a particular query semantics. However, these functionalities are out of the scope of the thesis and they may be realised in a further virtual repository development. 38 Actually there is no need to use the simple wrapper rewriter preceded with the view rewriter. In the current ODRA implementation unoptimised (naive) wrapper calls are compiled in views’ binary code and any wrapper-related query is evaluated correctly (but with no optimisation) without any optimisation in the current sequence defined. Page 194 of 235 Appendix C The Prototype Implementation The following subchapters are focused on various implementation issues and solutions developed and implemented for the wrapper prototype. C.1 Architecture The wrapper is implemented in client-server architecture schematically presented in Fig. 122, together with the fundamental virtual repository components. Virtual repository Object-oriented views resource ODRA resource resource Wrapper client Wrapper server JDBC client RDBMS Fig. 122 Wrapper architecture The client is embedded in the ODRA (described shortly in Appendix B) instance, while the multithreaded server is an independent module located between the ODRA and the wrapped relational database. The internal wrapper communication is realised with the dedicated protocol (described below) with TCP/IP sockets, while communication with the resource relies on the embedded JDBC client. Page 195 of 235 Appendix C The Prototype Implementation C.1.1 Communication protocol The client-server communication (based on a simple internal text protocol) is established on a server listener port. In Listing 7 there is shown a sample communication procedure (single-thread, only one client connected) seen at the server side, while Listing 8 presents the same procedure at the client side. The example presents a simple procedure of retrieving the metabase from the wrapper server (the strings are URL-encoded for preserving non-ASCII characters, the example presents decoded values). Listing 7 Client-server communication example (server side) SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 #889677296@127.0.0.1 SBQL wrapper server thread #889677296 Java Service Wrapper SBQL wrapper server thread #889677296 <http://wrapper.tanukisoftware.org> SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 SBQL wrapper server thread #889677296 listening... connected from /127.0.0.1 <- hello -> hello -> SBQL wrapper server thread -> SBQL wrapper server is running under -> Big thanks to Tanuki Software -> <-> -> <-> <-> <-> <-> <<<-> <-> request.identify identity: admin welcome admin@127.0.0.1 ready send.metabase data.ready: XSD prepared get.transfer.port transfer.port: 3131 get.data.length data.length: 3794 send.data sending.data transfer finished in 16 ms transfer rate 231 kB/s data.received want.another bye bye Listing 8 Client-server communication example (client side) Connecting the server... Connecting 'localhost' on port 2000 (1 of 10)... Connection established... admin -> hello admin <- hello admin <- SBQL wrapper server thread #889677296@127.0.0.1 admin <- SBQL wrapper server is running under Java Service Wrapper admin <- Big thanks to Tanuki Software <http://wrapper.tanukisoftware.org> admin <- request.identify admin -> identity: admin admin <- welcome admin@127.0.0.1 admin <- ready admin -> send.metabase admin <- data.ready: XSD prepared admin -> get.transfer.port admin <- transfer.port: 3131 admin -> get.data.length admin <- data.length: 3794 Page 196 of 235 Appendix C The Prototype Implementation admin admin admin admin admin admin admin admin -> <-> -> -> <-> <- send.data sending.data transfer finished in 16 ms transfer rate 231 kB/s data.received want.another bye bye All the communication protocol commands and messages for the server and the client with their meanings are shown in the tables below. Please notice that there are still some miscellaneous messages not interpreted neither by the client nor the server – they contain just some useful information than can be used for debugging. Besides the communication protocol, binary data transfer is opened on another port whenever it is needed. In the sample communication procedure presented in the listings above, the port is dynamically assigned by the client’s get.transfer.port command for sending a serialised XSD document. Table 3 Wrapper protocol server commands and messages Command/message bye close data.length data.ready error Parameter(s) Data length in bytes Variant information on what data is ready Error code according to the WrapperException caught hello ready Start the communication session The server is ready for client requests Rejects a client request or identity, currently unused Asks the client for its name (for client authentication), currently only a database user name is retrieved Asks the client for establishing its mode, currently supported modes are: SQL (1), SD-SQL (2) and SWARD/RDF (3) Data sending started Returns to the client a port number for binary data transfer Asks the client if another request will be sent in this communication session reject request.identity request.mode sending.data transfer.port Meaning Finish the communication session Close the session immediately, currently unused Length of data to be sent Data is prepared and ready for retrieval Error indication Port number want.another Page 197 of 235 Appendix C The Prototype Implementation Table 4 Wrapper protocol client commands and messages Command/message hello bye close identity Parameter(s) Client identity string get.transfer.port query Query string send.data get.data.length data.received send.database send.metabase mode Mode number Meaning Start the communication session Finish the communication session Close the session immediately, currently unused Send the client identity string or name used for authentications, currently only a database user name is sent Asks the server for a port number for receiving binary data Sends a query string for evaluation in the wrapper resource Tells the server to start transferring binary data Asks the server for binary data length in bytes Tells the server binary data is received and the socket can be released Requests a database model Requests a metabase XSD document Sends the client mode, currently supported modes are: SQL (1), SDSQL (2) and SWARD/RDF (3) C.2 Relational Schema Wrapping The very first step performed only once (unless the wrapped resource schema changes) is the generation of a XML document containing description of a wrapped resource. The schema description is required by the wrapper, however its details are skipped now (the generation procedure and the document structure are described in the following parts of the appendix). A virtual repository wrapper is instantiated when a wrapper module is created. A wrapper module is a regular database module, but it asserts that all the names within the module are “relational” (i.e. imported from a relational schema) except for automatically generated views referring to these names (this procedure is described in the following paragraphs). A wrapper instance contains a wrapper client capable of communication with the appropriate server. All the wrapper instances are stored in a global (static) session object and are available by a wrapper module name (it is a simple way to check whether Page 198 of 235 Appendix C The Prototype Implementation a module is wrapper module – just test if a current session contains a wrapper for a module name). Thus, once a wrapper is created, it is available to any session (including the ones initialised in the future) as its module is. Each wrapper instance is provided with its local transient auto-expandable data store so that intermediate query results do not affect the actual ODRA data store and they do not interfere with other objects. This means that a wrapper creates its own transient store for each session. The store is destroyed (its memory is released) when session closes, similarly all stores for the particular wrapper are destroyed when this wrapper module is deleted from a database. The CLI39 command for creating a wrapper module is: add module <modulename> as wrapper on <host>:<port> where <modulename> is a name of a module to add, <host> is a wrapper server host (IP or name), and <port> is its listener port. The command is parsed and sent to the ODRA server, where a module is created as a submodule of the current one and a new wrapper instantiated. Once it is ready, it is stored in a current session wrapper pool with a global name of the created module. A wrapper instantiation consists of the following steps: 1. Initialise a local client, 2. Test communication with the server, 3. Retrieve and store locally a programmatic model, 4. Retrieve XSD and create a metabase, 5. Create primary views enveloping “relational” metaobjects. Whenever a server process needs a wrapper (e.g., for query rewriting or execution), it retrieves it from a current session with a global name of a module for which a request is served. A wrapper client is used whenever a communication with the server is required, e.g., when a “relational” query is executed. C.2.1 Example A relational schema used below is already described in subchapter 6.2.1 Relational Test Schemata, for clearness of the example it is again presented in Fig. 123. 39 Command Line Interface, the ODRA basic text console Page 199 of 235 Appendix C The Prototype Implementation employees id departments (PK) id (PK) name name surname location_id locations id (PK) name (FK) sex salary info birth_date department_id (FK) Fig. 123 Wrapped legacy relational schema This wrapper client retrieves the schema description from its server and creates appropriate metadata in the ODRA metabase (one-to-one mapping applied, each table is represented as a single complex object). The corresponding object-oriented schema is presented in the Fig. 124. $employees $id $name $surname $sex $salary $info $birth_date $department_id $departments $id $name $location_id $locations $id $name Fig. 124 Lowest-level object-oriented wrapper schema This schema import procedure employs the native ODRA XSD/XML importer. The drawback of this solution is that the importer creates complex objects (for possibility of storing XML annotations) and actual values are pushed down to subobjects named _VALUE (they appear in examples of queries with views macro-substituted included in the thesis). Therefore some primitive encapsulation is required at this stage to prevent users from this strictly implementation-dependent feature. Names used in this schema (regarded as relational ones in query analysis and optimisation procedures, subchapter 6.2 Query Analysis and Optimisation Examples) are simply names of relational tables and columns prefixed with “$”. This ensures that they are not available in ad-hoc queries as such names are not valid identifiers recognised by the ODRA SBQL parser. Thus, the metaobjects with $-prefixed names are covered by automatically generated views (Fig. 125) referring to the original relational names of wrapped tables and columns. This is the final automatically generated stage for of the wrapped relational schema. It can be already queried or covered by a set of views (subchapters Page 200 of 235 Appendix C The Prototype Implementation 6.2.1 Relational Test Schemata and 6.3 Sample Use Cases) so that it can contribute to the global schema of the virtual repository. employees id name surname sex salary info birth_date department_id department id name location_id locations id name Fig. 125 Primary wrapper views As stated in the thesis conclusions, the wrapper could be extended with the fully automated view generator, so that virtual pointers are created without any administrator interference. C.2.2 Relational Schema Models The prototype uses different relational schema models depending on its module (level in the architecture). Their structures depend on a particular application. C.2.2.1 Internal Wrapper Server Model A relational schema is stored at the server side as an XML file. When the server is started, it loads a schema model according to a database name specified as a default one in a properties file or given as a start-up parameter (described below). This model description is a base for all the other models. A schema description file is an XML document based on a DTD listed below (a modified and extended version of the original Torque 3.2 DTD [217]). The DTD instance is available at http://jacenty.kis.p.lodz.pl/relational-schema.dtd for validation purposes. Listing 9 Contents of relational-schema.dtd <!ELEMENT database (table+)> <!ATTLIST database name CDATA #REQUIRED > <!ELEMENT table (column+, best-row-id?, foreign-key*, index*)> <!ATTLIST table name CDATA #REQUIRED > <!ELEMENT column EMPTY > <!ATTLIST column name CDATA #REQUIRED nullable (true | false) "false" type (BIT | TINYINT | SMALLINT | INTEGER | BIGINT | FLOAT | REAL | NUMERIC | DECIMAL | CHAR | VARCHAR | LONGVARCHAR | DATE | TIME | TIMESTAMP | BINARY | Page 201 of 235 Appendix C The Prototype Implementation VARBINARY | LONGVARBINARY | NULL | OTHER | JAVA_OBJECT | DISTINCT | STRUCT | ARRAY | BLOB | CLOB | REF | BOOLEANINT | BOOLEANCHAR | DOUBLE) #IMPLIED size CDATA #IMPLIED scale CDATA #IMPLIED default CDATA #IMPLIED description CDATA #IMPLIED > <!ELEMENT best-row-id (best-row-id-column+)> <!ELEMENT best-row-id-column EMPTY> <!ATTLIST best-row-id-column name CDATA #REQUIRED > <!ELEMENT foreign-key (reference+)> <!ATTLIST foreign-key foreign-table CDATA #REQUIRED name CDATA #IMPLIED > <!ELEMENT reference EMPTY> <!ATTLIST reference local CDATA #REQUIRED foreign CDATA #REQUIRED > <!ELEMENT index (index-column+)> <!ATTLIST index name CDATA #REQUIRED unique (true | false) #REQUIRED type (1 | 2 | 3 | 4) #IMPLIED pages CDATA #IMPLIED cardinality CDATA #IMPLIED filter-condition CDATA #IMPLIED > <!ELEMENT index-column EMPTY> <!ATTLIST index-column name CDATA #REQUIRED > The schema generation is actually based on Torque 3.2 [218]. Prior to Torque 3.3 there was no information gathered or processed on a relational schema indices (in the current RC1 this feature is still missing), therefore the specialised SchemaGenerator class (package odra.wrapper.generator) was introduced as an extension to the standard TorqueJDBCTransformTask. Application of a Torque-based generator assures access to the most popular RDBMS via standard JDBC drivers: Axion, Cloudscape, DB2, DB2/AS400, Derby, Firebird, Hypersonic, Informix, InstantDB, Interbase, MS Access, MS SQL, MySQL, Oracle, Postgres, SapDB, Sybase, Weblogic. An appropriate driver class should be available in a classpath prior to a schema generation. Currently there are only three drivers provided in the project, for PostgreSQL 8.x [219] (postgresql-8.1-405.jdbc3.jar), Firebird 2.x [220] (jaybird-full2.1.1.jar) and MS SQL Server 2005 [221] (jtds-1.2.jar), as these RDBMSs were used for tests. A sample schema description (corresponding to the relational schema shown in Fig. 123) is provided in Listing 10. Listing 10 Sample schema description Page 202 of 235 Appendix C The Prototype Implementation <?xml version="1.0"?> <!DOCTYPE database SYSTEM "http://jacenty.kis.p.lodz.pl/relationalschema.dtd"> <!--SBQL Wrapper - relational database schema--> <!--generated at 2007-03-25 21:10:30--> <!--author: Jacek Wislicki, jacek.wislicki@gmail.com--> <database name="wrapper"> <table name="departments"> <column name="id" nullable="false" type="INTEGER"/> <column name="name" nullable="false" size="64" type="VARCHAR"/> <column name="location_id" nullable="false" type="INTEGER"/> <best-row-id> <best-row-id-column name="id"/> </best-row-id> <foreign-key foreign-table="locations"> <reference foreign="id" local="location_id"/> </foreign-key> <index cardinality="2" name="departments_pkey" pages="8" type="3" unique="true"> <index-column name="id"/> </index> </table> <table name="employees"> <column name="id" nullable="false" type="INTEGER"/> <column name="name" nullable="false" size="128" type="VARCHAR"/> <column name="surname" nullable="false" size="64" type="VARCHAR"/> <column name="sex" nullable="false" size="1" type="CHAR"/> <column name="salary" nullable="false" scale="65531" size="65535" type="NUMERIC"/> <column name="info" nullable="true" size="10240" type="VARCHAR"/> <column name="birth_date" nullable="false" type="DATE"/> <column name="department_id" nullable="false" type="INTEGER"/> <best-row-id> <best-row-id-column name="id"/> </best-row-id> <foreign-key foreign-table="departments"> <reference foreign="id" local="department_id"/> </foreign-key> <index cardinality="12" name="employee_sex_ix" pages="10" type="3" unique="false"> <index-column name="sex"/> </index> <index cardinality="10" name="employee_name_ix" pages="10" type="3" unique="false"> <index-column name="surname"/> </index> <index cardinality="11" name="employee_salary_ix" pages="10" type="3" unique="false"> <index-column name="salary"/> </index> <index cardinality="7" name="employees_pkey" pages="10" type="3" unique="true"> <index-column name="id"/> </index> </table> <table name="locations"> <column name="id" nullable="false" type="INTEGER"/> <column name="name" nullable="false" size="64" type="VARCHAR"/> <best-row-id> <best-row-id-column name="id"/> </best-row-id> <index cardinality="2" name="locations_pkey" pages="7" type="3" unique="true"> <index-column name="id"/> </index> <index cardinality="2" name="locations_name_key" pages="7" type="3" unique="true"> <index-column name="name"/> Page 203 of 235 Appendix C The Prototype Implementation </index> </table> <table name="pg_logdir_ls"> <column name="filetime" nullable="true" type="TIMESTAMP"/> <column name="filename" nullable="true" type="VARCHAR"/> </table> </database> A schema description XML file can be also created or edited manually if necessary, e.g., if a RDBMS does not offer all the required information via JDBC or only selected tables/views are to be exposed to the wrapper. In the example in Listing 10 an unnecessary element to be removed manually is the pg_logdir_ls table – the system object automatically read by the JDBC connection. C.2.2.2 Programmatic Model A programmatic model is build according to a XML schema description by classes in the odra.wrapper.model package. The model structure is similar to the one used by Torque, but it is written from scratch in order to realise correctly all relational database structures including indices and primary-foreign key dependencies. This model offers quick access to the structure of a relational database and all its features. The programmatic model is send to the client on its request, i.e. on a wrapper initialisation. It is used at the client side for SBQL query analysis and rewriting. C.2.2.3 Metabase A metabase is a regular ODRA metabase created basing on an XSD import reflecting the relational schema. The schema file is generated on the fly and sent to a client by the server on its request, i.e. on a wrapper initialisation (just after programmatic model retrieval). A metabase creation is based on a programmatic model stored at the server side. A sample XSD schema is shown in Listing 11. Listing 11 Sample XSD for metabase <?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sql="http://jacenty.kis.p.lodz.pl" targetNamespace="http://jacenty.kis.p.lodz.pl" elementFormDefault="qualified"> <xsd:element name="wrapper"> <xsd:complexType> <xsd:all minOccurs="0"> <xsd:element name="locations" minOccurs="0"> <xsd:complexType> <xsd:all minOccurs="0"> <xsd:element name="name" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="id" type="xsd:integer" minOccurs="0" maxOccurs="1"/> Page 204 of 235 Appendix C The Prototype Implementation </xsd:all> </xsd:complexType> </xsd:element> <xsd:element name="pg_logdir_ls" minOccurs="0"> <xsd:complexType> <xsd:all minOccurs="0"> <xsd:element name="filename" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="filetime" type="xsd:date" minOccurs="0" maxOccurs="1"/> </xsd:all> </xsd:complexType> </xsd:element> <xsd:element name="employees" minOccurs="0"> <xsd:complexType> <xsd:all minOccurs="0"> <xsd:element name="name" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="birth_date" type="xsd:date" minOccurs="0" maxOccurs="1"/> <xsd:element name="salary" type="xsd:double" minOccurs="0" maxOccurs="1"/> <xsd:element name="sex" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="surname" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="department_id" type="xsd:integer" minOccurs="0" maxOccurs="1"/> <xsd:element name="info" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="id" type="xsd:integer" minOccurs="0" maxOccurs="1"/> </xsd:all> </xsd:complexType> </xsd:element> <xsd:element name="departments" minOccurs="0"> <xsd:complexType> <xsd:all minOccurs="0"> <xsd:element name="name" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="location_id" type="xsd:integer" minOccurs="0" maxOccurs="1"/> <xsd:element name="id" type="xsd:integer" minOccurs="0" maxOccurs="1"/> </xsd:all> </xsd:complexType> </xsd:element> </xsd:all> </xsd:complexType> </xsd:element> </xsd:schema> During a creation of metabase from XSD a native ODRA mechanism is used with a slight modification – a root element is omitted (it is required for a well-formed XML document, but does not reflect an actual relational schema). C.2.2.4 Type Mapping During wrapping relational resources there is the strong necessity to perform reliable type mapping operations between relational and object-oriented systems. In the prototype implementation, there is also an intermediate stage corresponding to passing Page 205 of 235 Appendix C The Prototype Implementation schema description via XSD documents. All these levels (relational, XSD, objectoriented) use their specific primitive data types that are managed by the wrapper. The type mapping procedures are implemented programmatically and stored in Java classes (possibly maps’ definitions could be stored in files, but this is not the issue). The table shown below contains corresponding SQL, XSD and ODRA data types. The default type applied for an undefined relational data type (due to enormous heterogeneity between various RDBMSs there might be some types not covered by the prototype definitions, still) is string. The string type is also assumed for relational data types currently not implemented in ODRA (including binary data types like BLOB). Table 5 Type mapping between SQL, XSD and SBQL SQL varchar varchar2 char text memo clob integer int int2 int4 int8 serial smallint bigint byte serial number float real numeric decimal bool boolean bit date timestamp XSD SBQL string string integer integer double real boolean boolean date date C.2.3 Result Retrieval and Reconstruction A wrapper server transforms a SQL query result into XML string (compliant with an imported module metabase) and sends it serialised form to a client. A deserialised object received by a client is returned to a requesting wrapper instance. The only Page 206 of 235 Appendix C The Prototype Implementation exception to this rule (no XML used) is applied when a result is a primitive value (integer, double or boolean) from an aggregate function or some imperative statement (a result is a number of rows affected in a relational database, subchapters 6.1.2 and 6.1.3). Regardless of a relational result structure (retrieved columns’ order within a result row), before forming an XML document, it is analysed and grouped into tuple sub-results, where column results are grouped by their tables. Therefore ODRA results can be easily created with structures corresponding to the metabase. The sample XML document with the query result (cut to two result tuple) is presented in Listing 12. The corresponding query is: (Employee as e join e.worksIn.Department as d join d.isLocatedIn.Location as l).(e.surname, e.name, d.name, l.name); Listing 12 Sample result XML document <?xml version="1.0" encoding="UTF-8"?> <sql:wrapper xmlns:sql="http://jacenty.kis.p.lodz.pl"> <sql:tuple> <sql:employees> <sql:surname>Pietrzak</sql:surname> <sql:name>Piotr</sql:name> </sql:employees> <sql:departments> <sql:name>Security</sql:name> </sql:departments> <sql:locations> <sql:name>Warszawa</sql:name> </sql:locations> </sql:tuple> <sql:tuple> <sql:employees> <sql:surname>Wojciechowska</sql:surname> <sql:name>Agnieszka</sql:name> </sql:employees> <sql:departments> <sql:name>Retail</sql:name> </sql:departments> <sql:locations> <sql:name>Warszawa</sql:name> </sql:locations> </sql:tuple> </sql:wrapper> A result pattern is responsible for carrying type-checker signature information to the runtime environment (subchapter 6.1.5). In the implementation the pattern is a regular string (provided as a second parameter of execsql expression) that can be parsed easily and transformed into a corresponding class instance. The ResultPattern class provides methods for creating SBQL results from objects created from XML documents. A sample result pattern string (resulting from the signature of the query used above) is presented below in Listing 13. Page 207 of 235 Appendix C The Prototype Implementation Listing 13 Sample result pattern string <0 | | | none | struct <1 $employees | $surname | _surname | none | binder 1> <1 $employees | $name | _name | none | binder 1> <1 $departments | $name | _name | none | binder 1> <1 $locations | $name | _name | none | binder 1> 0> A result pattern description is stored between <i and i> markups, where i stands of a pattern nesting level. A top pattern level is 0, the first nested patterns are given level 1, etc. The mechanism allows any result pattern nesting level, however flat relational results use only a two-level structure (some more complex cases can occur for named results – binders). The example pattern shown above (Listing 13) is a simple structure consisting of four fields being binders corresponding to view seeds’ names (prefixed with “_”, Listing 2) build over the bottom-level relational names (prefixed with “$”). However, a pattern can be arbitrarily complex. The first two strings in the result pattern denote the table name and the column name, where applicable (the external structure in the example does require these values). The pattern description can contain the following additional information: • An alias name for a result (binder), • A dereference mode (none stands for no dereference, other modes are string, boolean, integer, real and date), • A result type (possible values are ref, struct, binder and value). Basing on a result pattern, SBQL result is reconstructed from an XML document containing a SQL results. As stated above, the wrapper uses a local auto-expandable store for temporary object creation; therefore the references returned by queries are volatile (valid only within the current session). Moreover, the results are deleted before a next query execution. Object keeping, locking and deleting should be controlled by some transaction mechanism, unfortunately none is currently available. For keeping reference to actual relational data, the wrapper can expand each SQL query with best row identifiers (i.e. primary key or unique indices) and foreign key column values, regardless of an actual query semantics (of course, a final result does not contain these data). Because this information is available in the materialised result, a result of a query can be further processed by SBQL program. Currently this functionality is turned off, butt can be freely (de)activated by toggling Page 208 of 235 Appendix C The Prototype Implementation WRAPPER_EXPAND_WITH_IDS and WRAPPER_EXPAND_WITH_REFS flags in odra.wrapper.net.Server.java. C.3 Installation and Launching The following subsections contain the description of the wrapper configuration and launching procedures. Prior to these activities, the Java Virtual Machine 1.6.x (standard edition) or newer must be installed in the system – it can be downloaded from http://java.sun.com/javase/downloads/index.jsp. C.3.1 CD Contents The CD included contains the ODRA project snapshot (revision 2157 with stripped local SVN information) and the preconfigured MS Windows version of Java Service Wrapper (described in subchapter Service Launch). The current version of ODRA is available via SVN from svn://odra.pjwstk.edu.pl:2401/egovbus. The file structure of the CD is as follows: • odra.zip – the ODRA project source code and resources (running described in subsection C.4 Prototype Testing), • jsw.win.zip – the preconfigured Java Service Wrapper (described in subsection C.3.6.2 Service Launch), • ph.d.thesis.jacek_wislicki.pdf – this thesis text in PDF format. The ODRA project contained in odra.zip needs to be unzipped to the local file system. The project can be edited under Eclipse 3.2 (verified); other Java IDEs might require some modifications. The organisation of the project file structure is as follows (internal Eclipse project configuration entries skipped): EGB |_build |_conf |_dist |_lib |_res |_src |_tools |_xml2xml |_build.xml – compiled classes – runtime configuration files – ODRA precompiled distribution – project libraries – resources, including sample batch files – Java source code – parser libraries used on build – XML2XML mapper distribution (irrelevant) – Ant build file Page 209 of 235 Appendix C The Prototype Implementation The project can be run directly from the provided distribution files (as shown in subsection C.4 Prototype Testing), it can be build from source with Apache Ant [222] basing on the build.xml configuration. C.3.2 Test Schemata Generation Any relational schema can be used with the prototype; still there is a possibility to create the test schemata (described subchapter 6.2.1) for the prototype testing. The resources are available in res/wrapper of the ODRA project included. First, a database schema should be created manually according to schema.sql. The SQL used in the file is rather universal, but there might be some problems on some RDBMS (e.g., unrecognised data types). Unfortunately, the author was not able to provide an appropriate automated application (e.g., Torque still does not serve indices). C.3.3 Connection Configuration The connection configuration file is connection.properties (a standard Torque configuration file) whose sample can be found the project /conf directory (Listing 14). Listing 14 Sample contents of connection properties torque.database.default = postgres_employees #configuration for the postgres database (employees) torque.database.postgres_employees.adapter = postgresql torque.dsfactory.postgres_employees.factory = org.apache.torque.dsfactory.SharedPoolDataSourceFactory torque.dsfactory.postgres_employees.connection.driver = org.postgresql.Driver torque.dsfactory.postgres_employees.connection.url = jdbc:postgresql://localhost:5432/wrapper torque.dsfactory.postgres_employees.connection.user = wrapper torque.dsfactory.postgres_employees.connection.password = wrapper #configuration for the firebird database (employees) torque.database.firebird_employees.adapter = firebird torque.dsfactory.firebird_employees.factory = org.apache.torque.dsfactory.SharedPoolDataSourceFactory torque.dsfactory.firebird_employees.connection.driver = org.firebirdsql.jdbc.FBDriver torque.dsfactory.firebird_employees.connection.url = jdbc:firebirdsql:localhost/3050:c:/tmp/wrapper.gdb torque.dsfactory.firebird_employees.connection.user = wrapper torque.dsfactory.firebird_employees.connection.password = wrapper #configuration for the postgres database (cars) torque.database.postgres_cars.adapter = postgresql torque.dsfactory.postgres_cars.factory = org.apache.torque.dsfactory.SharedPoolDataSourceFactory torque.dsfactory.postgres_cars.connection.driver = org.postgresql.Driver torque.dsfactory.postgres_cars.connection.url = jdbc:postgresql://localhost:5432/wrapper2 torque.dsfactory.postgres_cars.connection.user = wrapper torque.dsfactory.postgres_cars.connection.password = wrapper Page 210 of 235 Appendix C The Prototype Implementation #configuration for the ms sql database (SD-SQL) torque.database.sdsql.adapter = mssql torque.dsfactory.sdsql.factory = org.apache.torque.dsfactory.SharedPoolDataSourceFactory torque.dsfactory.sdsql.connection.driver = net.sourceforge.jtds.jdbc.Driver torque.dsfactory.sdsql.connection.url = jdbc:jtds:sqlserver://212.191.89.51:1433/SkyServer torque.dsfactory.sdsql.connection.user = sa torque.dsfactory.sdsql.connection.password = The sample file contains four data sources predefined (named postgres_employees, firebird_employees, postgres_cars and sdsql) for different RDBMSs and schemata. The same configuration file can be used for different wrapped databases. However, a separate server must be started for each resource. A torque.database.default property defines a default database if none is specified as an input of an application (e.g., the wrapper server). The other properties mean (the xxx word should be substituted with a unique data source name that is further used for pointing at the resource): • torque.database.xxx.adapter – a JDBC adapter/driver name, • torque.dsfactory.xxx.factory – a data source factory class, • torque.dsfactory.xxx.connection.driver – a JDBC driver class, • torque.dsfactory.xxx.connection.url – a JDBC resource-dependent connection URL, • torque.dsfactory.xxx.connection.user – a database user name, • torque.dsfactory.xxx.connection.password – a database user password. The correct configuration entered to connection.properties is required for the next wrapper launching steps. C.3.4 Test Data Population Once the relational schemata are created and their configuration entered to the connection.properties file, they can be populated with sample data (subchapter 7.1 Relational Test Data). In order to load the data into the database run odra.wrapper.misc.testschema.Inserter. This application inserts data according to the schema integrity constraints and the assumed distributions. The inserter takes two parameters: mode – the target schema (“employees” or “cars” values allowed), and number_of_employees – a number of employee records (or corresponding cars). The sample startup command can be: java odra.wrapper.misc.testschema.Inserter employees 100 Page 211 of 235 Appendix C The Prototype Implementation A sample output from the inserter is shown in Listing 15 (previous data are deleted, if any present): Listing 15 Sample inserter output Wrapper test data population started... connected 100 records deleted from employees 8 records deleted from departments 7 records deleted from locations 7 locations created 8 departments created 100 employees created Wrapper test data population finished in 1594 ms... The inserter application connects automatically to the resource defined in connection.properties as the default one (torque.database.default value). C.3.5 Schema Description Generation This step is necessary for the wrapper action as it provides its server with the description of the wrapped relational database (details in subchapter C.2.2.1 Internal Wrapper Server Model). Once the configuration.properties contains the record defined for a wrapped relational schema, the schema generator process odra.wrapper.generator.SchemaGeneratorApp. can be launched by The application can run without parameters (a configuration.properties file is searched in the application /conf directory) and a default database name (torque.database.default) is used. One can also specify an optional parameter for a configuration file path. If it is specified, also a database name can be provided as the second parameter. The sample startup command can be: java odra.wrapper.generator.SchemaGeneratorApp conf/connection.properties postgres_employees The schema generator application standard output is as below (Listing 16): Listing 16 Sample schema generator output Schema generation started... Schema generation finished in 5875 ms... As a result a schema description file is created in the application’s /conf directory. The file name is created according to a pattern: <dbname>-schema.generated.xml, where <dbname> is a database name specified as an application startup parameter or a default one in the properties file (e.g. postgres_employees-schema.generated.xml). Page 212 of 235 Appendix C The Prototype Implementation C.3.6 Server The server (odra.wrapper.net.Server) is a multithreaded application (a separate parallel thread is invoked for each client request). It can be launched as a standalone application or as a system service. C.3.6.1 Standalone Launch The standalone launch should not be used in a production environment. In order to start the server a system service, read the instructions in the next section. If the server is launched without startup parameters, it searches for the connection.properties file and schema description XML documents in the application /conf directory and uses a default database name declared in this file. Other default values are a port number to listen on (specified as 2000 with wrapper.net.Server.WRAPPER_SERVER_PORT) and a verbose mode (specified as true with wrapper.net.Server.WRAPPER_SERVER_VERBOSE). If one needs to override these values, use syntax as in the sample below: odra.wrapper.net.Server -Ddbname -Vfalse –P2001 -C/path/to/config/ All the parameters are optional and their order is arbitrary: • -D prefixes a database name (to override a default one in a properties file), • -V toggles a verbose mode (true/false), • -P specifies a listener port, • -C specifies a path to server configuration files. A path denoted with a -C parameter must be a valid directory where all the configuration files are stored, including connection.properties and schema description XML document(s). The server output at a successful startup is shown in Listing 17 below: Listing 17 Wrapper server startup output Database model successfully build from schema in 'F:/eclipse.projects/EGB/conf/postgres_employees-schema.generated.xml' SBQL wrapper listener started in JDBC mode on port 2000... SBQL wrapper listener is running under Java Service Wrapper Big thanks to Tanuki Software <http://wrapper.tanukisoftware.org> C.3.6.2 Service Launch Running the server as a system service is realised with the Java Service Wrapper (JSW) [223]. The JSW can be downloaded as binaries or a source code. It can be run on Page 213 of 235 Appendix C The Prototype Implementation different platforms (e.g., MS Windows, Linux, Solaris, MacOS X) and the appropriate version must be installed in a system (binary download should be enough). The following instructions refer to MS Windows environment (they are the same on Linux). Detailed descriptions and examples of installation and configuration procedures on various platforms are available at the JSW web site. The main JSW configuration is defined in $JSW_HOME/conf/wrapper.conf ($JSW_HOME denotes a home directory of the JSW installation). The file example is listed below in Listing 18. Listing 18 Sample contents of wrapper.conf #******************************************************************** # TestWrapper Properties # # NOTE - Please use src/conf/wrapper.conf.in as a template for your # own application rather than the values used for the # TestWrapper sample. #******************************************************************** # Java Application wrapper.java.command=java # Java Main class. This class must implement the WrapperListener interface # or guarantee that the WrapperManager class is initialized. Helper # classes are provided to do this for you. See the Integration section # of the documentation for details. wrapper.java.mainclass=org.tanukisoftware.wrapper.WrapperSimpleApp # Java Classpath (include wrapper.jar) Add class path elements as # needed starting from 1 wrapper.java.classpath.1=../lib/wrapper.jar wrapper.java.classpath.2=F:/eclipse.projects/EGB/dist/lib/odra-wrapper-1.0dev.jar wrapper.java.classpath.3=F:/eclipse.projects/EGB/dist/lib/odra-commons-1.0dev.jar wrapper.java.classpath.4=F:/eclipse.projects/EGB/lib/postgresql-8.1405.jdbc3.jar wrapper.java.classpath.5=F:/eclipse.projects/EGB/lib/jaybird-full-2.1.1.jar wrapper.java.classpath.6=F:/eclipse.projects/EGB/lib/jtds-1.2.jar wrapper.java.classpath.7=F:/eclipse.projects/EGB/lib/jdom.jar wrapper.java.classpath.8=F:/eclipse.projects/EGB/lib/zql.jar wrapper.java.classpath.9=F:/eclipse.projects/EGB/lib/commons-configuration1.1.jar wrapper.java.classpath.10=F:/eclipse.projects/EGB/lib/commons-collections3.1.jar wrapper.java.classpath.11=F:/eclipse.projects/EGB/lib/commons-lang-2.1.jar wrapper.java.classpath.12=F:/eclipse.projects/EGB/lib/commons-logging1.0.4.jar # Java Library Path (location of Wrapper.DLL or libwrapper.so) wrapper.java.library.path.1=../lib # Java Additional Parameters wrapper.java.additional.1=-ea # Initial Java Heap Size (in MB) #wrapper.java.initmemory=3 # Maximum Java Heap Size (in MB) wrapper.java.maxmemory=512 Page 214 of 235 Appendix C The Prototype Implementation # Application parameters. Add parameters as needed starting from 1 wrapper.app.parameter.1=odra.wrapper.net.Server wrapper.app.parameter.2=-C"F:/eclipse.projects/EGB/conf/" wrapper.app.parameter.2.stripquotes=TRUE #wrapper.app.parameter.3=-Dpostgres_employees #wrapper.app.parameter.4=-P2000 #wrapper.app.parameter.5=-Vtrue #******************************************************************** # Wrapper Logging Properties #******************************************************************** # Format of output for the console. (See docs for formats) wrapper.console.format=PM # Log Level for console output. wrapper.console.loglevel=INFO (See docs for log levels) # Log file to use for wrapper output logging. wrapper.logfile=../logs/wrapper.log # Format of output for the log file. wrapper.logfile.format=LPTM # Log Level for log file output. wrapper.logfile.loglevel=INFO (See docs for formats) (See docs for log levels) # Maximum size that the log file will be allowed to grow to before # the log is rolled. Size is specified in bytes. The default value # of 0, disables log rolling. May abbreviate with the 'k' (kb) or # 'm' (mb) suffix. For example: 10m = 10 megabytes. wrapper.logfile.maxsize=1m # Maximum number of rolled log files which will be allowed before old # files are deleted. The default value of 0 implies no limit. wrapper.logfile.maxfiles=10 # Log Level for sys/event log output. wrapper.syslog.loglevel=NONE (See docs for log levels) #******************************************************************** # Wrapper Windows Properties #******************************************************************** # Title to use when running as a console wrapper.console.title=ODRA wrapper server #******************************************************************** # Wrapper Windows NT/2000/XP Service Properties #******************************************************************** # WARNING - Do not modify any of these properties when an application # using this configuration file has been installed as a service. # Please uninstall the service before modifying this section. The # service can then be reinstalled. # Name of the service wrapper.ntservice.name=ODRAwrapper 1 # Display name of the service wrapper.ntservice.displayname=ODRA wrapper server 1 # Description of the service wrapper.ntservice.description=ODRA relational database wrapper server 1 # Service dependencies. Add dependencies as needed starting from 1 wrapper.ntservice.dependency.1= # Mode in which the service is installed. wrapper.ntservice.starttype=AUTO_START AUTO_START or DEMAND_START Page 215 of 235 Appendix C The Prototype Implementation # Allow the service to interact with the desktop. wrapper.ntservice.interactive=false The most important properties in wrapper.conf are: • wrapper.java.command – which JVM use (depending on a system configuration one might need to specify a full path to the java program), • wrapper.java.mainclass – an JSW integration method (with the value specified in the above listing it does not require a JSW implementation, do not modify this one), • wrapper.java.classpath.N – Java classpath elements (do not modify the first classpath element, as it denotes a JSW JAR location, the other elements refer to libraries used by the ODRA wrapper server, including JDBC drivers), • wrapper.java.additional.N – JVM startup parameters (in the example only -ea used for enabling assertions), • wrapper.java.maxmemory – JVM heap size, probably it would require more than the default 64 MB for real-life databases, • wrapper.app.parameter.1 – ODRA wrapper server main class (do not modify this one), • wrapper.app.parameter.2 – a path to ODRA wrapper server configuration files directory (i.e. connection.properties and <dbname>-schema.generated.xml) passed as a server startup parameter, • wrapper.app.parameter.2.stripquotes – important when a parameter name contains extra quotes, • wrapper.app.parameter.3 – database name passed as a server startup parameter, • wrapper.app.parameter.4 – server listener port passed as a server startup parameter, • wrapper.app.parameter.5 – server verbose mode passed as a server startup parameter, • wrapper.logfile.maxsize – a maximum size of a single log file before it is split, • wrapper.logfile.maxfiles – a maximum number of log files until the old ones are deleted. Notice that wrapper.app.parameter.[2...5] conform server application startup parameters described above. They are optional and their order is arbitrary. Other configuration properties' descriptions are available at the JSW web site. Page 216 of 235 Appendix C The Prototype Implementation In order to test a configuration one can run $JSW_HOME/bin/test.bat. The JSW is launched as a standalone application and runs the ODRA wrapper server (any misconfiguration can be easily detected). If a test succeeds, a JSW is ready to install as a system service. A service is installed with install.bat and deinstalled with uninstall.bat. A sample preconfigured JSW installation for MS Windows can be downloaded from http://jacenty.kis.p.lodz.pl/jsw.win.zip – only some paths need to be adjusted. Also a sample JSW installation for MS Windows is stored in the CD included. C.3.7 Client The client (odra.wrapper.net.Client) cannot be launched directly – its instance is a component of odra.wrapper.Wrapper and is used in the background. If one needs to use a verbose client (with a console output), set the wrapper.verbose property to true in /conf/odra-server.properties. Then the wrapper client output is displayed in the standard ODRA CLI console. C.4 Prototype Testing The ODRA server and the client (CLI) can be started up by /dist/easystart.bat(sh). The default database is created, the server started and the CLI console opened and ready for input. If the wrapper server is not running yet, it can be started as an application with /dist/wrapper-server.bat(sh) (the default parameters assume the postgres_employees schema and listening on port 2000), provided the schema description file is available in /dist/conf. A wrapper module is added with the following command: add module <modulename> as wrapper on <host>:<port> where the module name has to be specified and the wrapper server listener socket specified. A sample command can be: add module test as wrapper on localhost:2000 The module is added (provided the wrapper server is available on the socket) and the wrapper instantiated (procedures concerning the schema import and the metabase creation are executed in background). Now switch to the wrapper module by executing cm test. The test module contains only primary wrapper views available for querying. The module contents can be retrieved with ls. Page 217 of 235 Appendix C The Prototype Implementation A sample CLI session is presented in Listing 19. The list of available CLI commands is available with the help command. Listing 19 Sample CLI session Welcome to ODRA (J2)! admin> add module test as wrapper on localhost:2000 admin> cm test admin.test> compile . admin.test> ls D $employees D $departments D $locations V employeesDef VP employees V locationsDef VP locations V departmentsDef VP departments admin.test> deref((employees where salary = 500).(surname, name)) as emp; <?xml version="1.0" encoding="windows-1250"?> <RESULT> <emp>Pawłowski Stanisław</emp> <emp>Kaczmarek Anna</emp> <emp>Zając Zofia</emp> </RESULT> admin.test> C.4.1 Optimisation Testing For rewriting/optimisation testing use: explain optimization viewrewrite | wrapperrewrite : <query>; or explain optimization viewrewrite | wrapperoptimize : <query>; respectively. Other optimisation types can be switched off, only viewrewrite is important if a query is based on wrapper views. For evaluation test, it is necessary to set a current optimisation sequence for a session. Sample syntax is shown below: set optimization none | viewrewrite | wrapperoptimize The none option here is necessary to reset (clear) a previous sequence (by default it is empty as ODRA optimisers are still under constructions). In order to check the current optimisation sequence, use: show optimization Another thing to configure is the test mode. There are three modes available: • off – no tests are performed (default), Page 218 of 235 Appendix C The Prototype Implementation • plain – a query is optimised with a current optimisation sequence and optimisation results (times measured) are prepended to its actual result, • compare – no query actual result is retrieved, only a comparison of unoptimised and optimised executions (“unoptimised” means that simple rewriting is applied as otherwise a query wouldn't be evaluated via a wrapper), • comparesimple – the same as compare, but dereferenced results are compared. A full comparison would reveal errors for wrapper queries, as their results consists of different references – each query creates its own set of temporary objects. A test mode is set as an ordinary CLI variable, e.g.: set test off Similarly, it can be displayed with: show test Benchmarking optimisation results is available with the following syntax: benchmark <n> <query>; where <n> stands for a number of repeats. The results are written to CSV files in the current CLI directory. C.4.2 Sample batch files There are sample batch files provided for wrapper testing in the project's /res/wrapper/batch directory. The files require two wrapper servers running for the employees and cars test schemata (subchapter 6.2.1) on ports 2000 and 2001 (the port numbers can be changed). • init.cli – creates two wrapper module named wrapper1 and wrapper2, (for each schema), creates a new module test, imports the wrapper modules and creates views presented in Listing 2, • explain-rewrite.cli – shows rewritten sample queries, • explain-optimize.cli – shows optimised sample queries, • execute-rewrite.cli – executes sample queries after simple rewriting, • execute-optimize.cli – executes sample queries optimisation, • compare.cli – compares results of execution of rewritten and optimised sample queries. • benchmark10times.cli – performs 10-repeat benchmark of the sample queries, Page 219 of 235 Appendix C The Prototype Implementation • benchmark10times.independent.cli – performs 10-repeat benchmark of the sample multi-wrapper queries with SBQL optimisers, • mixed.cli – executes test mixed queries, • update.cli – executes test updates. Additionally, views presented in subchapter 6.3 Sample Use Cases are available from the batch files in /res/wrapper/batch/demo (they require views created from init.cli). The syntax for running batch files is as follows: batch <path/to/batch> The related commands are cd and pwd, corresponding to the same commands of operating systems. The batch and cd commands interpret both absolute and relative paths. Page 220 of 235 Index of Figures Fig. 1 Mediation system architecture.............................................................................. 25 Fig. 2 Distributed mediation in Amos II [26] ................................................................. 29 Fig. 3 eGov-Bus virtual repository architecture [205].................................................... 43 Fig. 4 Query flow through a RDBMS [124]................................................................... 48 Fig. 5 Abstract relational query optimiser architecture [124]......................................... 49 Fig. 6 Possible syntax trees for a sample query for selections and projections .............. 52 Fig. 7 Possible syntax trees for a sample query for cross products ................................ 53 Fig. 8 Possible syntax trees for a sample query for tree shapes...................................... 55 Fig. 9 Sample SBQL query syntax tree for example 1 ................................................... 68 Fig. 10 Sample SBQL query syntax tree for example 2 ................................................. 69 Fig. 11 Sample SBQL query syntax tree for example 3 ................................................. 70 Fig. 12 Sample SBQL query syntax tree for example 4 ................................................. 71 Fig. 13 Sample SBQL query syntax tree for example 5 ................................................. 72 Fig. 14 Sample SBQL query syntax tree for example 6 ................................................. 73 Fig. 15 Sample SBQL query syntax tree for example 7 ................................................. 74 Fig. 16 Architecture of query processing in SBQL [179]............................................... 78 Fig. 17 Virtual repository general architecture ............................................................... 86 Fig. 18 Schema integration in the virtual repository ...................................................... 87 Fig. 19 Query processing schema ................................................................................... 88 Fig. 20 Relational schema for the conceptual example .................................................. 91 Fig. 21 Object-oriented view-based schema for the conceptual example ...................... 92 Fig. 22 Conceptual example Input query syntax tree ..................................................... 94 Fig. 23 Conceptual example query syntax tree with dereferences ................................. 95 Fig. 24 Conceptual example query syntax tree after removing auxiliary names............ 96 Fig. 25 Conceptual example query syntax tree after SBQL optimisation ...................... 97 Fig. 26 Conceptual example query syntax tree after wrapper optimisation ................... 98 Fig. 27 Query analysis algorithm.................................................................................. 101 Fig. 28 Selecting query processing algorithm .............................................................. 102 Fig. 29 Deleting query processing algorithm................................................................ 104 Fig. 30 Updating query processing algorithm .............................................................. 105 Fig. 31 SQL generation and processing algorithm for mixed queries .......................... 107 Fig. 32 The "employees" test relational schema........................................................... 109 Fig. 33 The "cars" test relational schema...................................................................... 109 Fig. 34 The resulting object-oriented schema............................................................... 110 Fig. 35 Raw (parsed) query syntax tree for example 1 ................................................. 117 Fig. 36 Typechecked query syntax tree for example 1 ................................................. 117 Fig. 37 View-rewritten query syntax tree for example 1 .............................................. 118 Fig. 38 Optimised query syntax tree for example 1...................................................... 118 Fig. 39 Simply-rewritten query syntax tree for example 1 ........................................... 120 Fig. 40 Raw (parsed) query syntax tree for example 2 ................................................. 121 Page 221 of 235 Appendix C The Prototype Implementation Fig. 41 Typechecked query syntax tree for example 2 ................................................. 121 Fig. 42 View-rewritten query syntax tree for example 2 .............................................. 122 Fig. 43 Optimised query syntax tree for example 2...................................................... 122 Fig. 44 Simply-rewritten query syntax tree for example 2 ........................................... 124 Fig. 45 Raw (parsed) query syntax tree for example 1 (imperative query) .................. 130 Fig. 46 Typechecked query syntax tree for example 1 (imperative query) .................. 130 Fig. 47 View-rewritten query syntax tree for example 1 (imperative query) ............... 131 Fig. 48 Optimised query syntax tree for example 1 (imperative query)....................... 131 Fig. 49 Raw (parsed) query syntax tree for example 2 (imperative query) .................. 132 Fig. 50Typechecked query syntax tree for example 2 (imperative query) ................... 132 Fig. 51 View-rewritten query syntax tree for example 2 (imperative query) ............... 133 Fig. 52 Optimised query syntax tree for example 2 (imperative query)....................... 133 Fig. 53 Raw (parsed) query syntax tree for example 3 (imperative query) .................. 134 Fig. 54 Typechecked query syntax tree for example 3 (imperative query) .................. 134 Fig. 55 View-rewritten query syntax tree for example 3 (imperative query) ............... 135 Fig. 56 Optimised query syntax tree for example 3 (imperative query)....................... 136 Fig. 57 Raw (parsed) query syntax tree for example 1 (mixed query) ......................... 138 Fig. 58 Typechecked query syntax tree for example 1(mixed query) .......................... 138 Fig. 59 View-rewritten query syntax tree for example 1(mixed query) ....................... 139 Fig. 60 Optimised query syntax tree for example 1(mixed query) ............................... 139 Fig. 61 Raw (parsed) query syntax tree for example 3 (mixed query) ......................... 140 Fig. 62 Typechecked (parsed) query syntax tree for example 3 (mixed query) ........... 140 Fig. 63 View-rewritten query syntax tree for example 3 (mixed query) ...................... 141 Fig. 64 Optimised query syntax tree for example 3 (mixed query) .............................. 141 Fig. 65 Raw query syntax tree for example 1 (multi-wrapper query) .......................... 145 Fig. 66 Typechecked query syntax tree for example 1 (multi-wrapper query) ............ 146 Fig. 67 SBQL-optimised query syntax tree for example 1 (multi-wrapper query) ...... 147 Fig. 68 Raw query syntax tree for example 2 (multi-wrapper query) .......................... 148 Fig. 69 Typechecked query syntax tree for example 2 (multi-wrapper query) ............ 149 Fig. 70 SBQL-optimised query syntax tree for example 2 (multi-wrapper query) ...... 150 Fig. 71 Department’s location distribution ................................................................... 157 Fig. 72 Employee’s department distribution................................................................. 157 Fig. 73 Employee’s salary distribution ......................................................................... 157 Fig. 74 Employee’s info’s length distribution .............................................................. 157 Fig. 75 Female employee’s first name distribution ...................................................... 158 Fig. 76 Female employee’s surname distribution......................................................... 158 Fig. 77 Male employee’s first name distribution.......................................................... 158 Fig. 78 Male employee’s surname distribution ............................................................ 159 Fig. 79 Evaluation times and optimisation gain for query 1......................................... 160 Fig. 80 Evaluation times and optimisation gain for query 2......................................... 160 Fig. 81 Evaluation times and optimisation gain for query 3......................................... 161 Fig. 82 Evaluation times and optimisation gain for query 4......................................... 161 Fig. 83 Evaluation times and optimisation gain for query 5......................................... 162 Fig. 84 Evaluation times and optimisation gain for query 6......................................... 162 Fig. 85 Evaluation times and optimisation gain for query 7......................................... 163 Fig. 86 Evaluation times and optimisation gain for query 8......................................... 163 Fig. 87 Evaluation times and optimisation gain for query 9......................................... 164 Fig. 88 Evaluation times and optimisation gain for query 10....................................... 164 Fig. 89 Evaluation times and optimisation gain for query 11....................................... 165 Fig. 90 Evaluation times and optimisation gain for query 12....................................... 165 Page 222 of 235 Appendix C The Prototype Implementation Fig. 91 Evaluation times and optimisation gain for query 13....................................... 166 Fig. 92 Evaluation times and optimisation gain for query 14....................................... 166 Fig. 93 Evaluation times and optimisation gain for query 15....................................... 167 Fig. 94 Evaluation times and optimisation gain for query 16....................................... 167 Fig. 95 Evaluation times and optimisation gain for query 17....................................... 168 Fig. 96 Evaluation times and optimisation gain for query 18....................................... 168 Fig. 97 Evaluation times and optimisation gain for query 19....................................... 169 Fig. 98 Evaluation times and optimisation gain for query 20....................................... 169 Fig. 99 Evaluation times and optimisation gain for query 21....................................... 170 Fig. 100 Evaluation times and optimisation gain for query 22..................................... 171 Fig. 101 Evaluation times and optimisation gain for query 23..................................... 171 Fig. 102 Evaluation times and optimisation gain for query 24..................................... 172 Fig. 103 Evaluation times and optimisation gain for query 25..................................... 172 Fig. 104 Evaluation times and optimisation gain for query 26..................................... 173 Fig. 105 Evaluation times and optimisation gain for query 27..................................... 173 Fig. 106 Evaluation times and optimisation gain for query 28..................................... 174 Fig. 107 Evaluation times and optimisation gain for query 29..................................... 174 Fig. 108 Evaluation times and optimisation gain for query 30..................................... 175 Fig. 109 Evaluation times and optimisation gain for query 31..................................... 175 Fig. 110 Evaluation times and optimisation gain for query 32..................................... 176 Fig. 111 Evaluation times and optimisation gain for query 33..................................... 176 Fig. 112 Evaluation times and optimisation gain for query 34..................................... 177 Fig. 113 Evaluation times and optimisation gain for query 35..................................... 178 Fig. 114 Evaluation times and optimisation gain for query 36..................................... 178 Fig. 115 Evaluation times and optimisation gain for query 37..................................... 179 Fig. 116 Evaluation times and optimisation gain for query 38..................................... 179 Fig. 117 Evaluation times and optimisation gain for query 39..................................... 180 Fig. 118 Evaluation times and optimisation gain for query 40..................................... 180 Fig. 119 Evaluation times and optimisation gain for query 41..................................... 181 Fig. 120 Evaluation times and optimisation gain for query 1 (SBQL optimisation).... 182 Fig. 121 Evaluation times and optimisation gain for query 2 (SBQL optimisation).... 182 Fig. 122 Wrapper architecture ...................................................................................... 195 Fig. 123 Wrapped legacy relational schema ................................................................. 200 Fig. 124 Lowest-level object-oriented wrapper schema ............................................... 200 Fig. 125 Primary wrapper views................................................................................... 201 Page 223 of 235 Index of Listings Listing 1 Simplified updateable views for the conceptual example ............................... 92 Listing 2 Code of views for the test schemata .............................................................. 110 Listing 3 SBQL view code for retrieving “rich employees” ........................................ 151 Listing 4 SBQL view code for retrieving employees with their departments .............. 152 Listing 5 SBQL view code for retrieving employees with their cars ........................... 153 Listing 6 SBQL view code for retrieving rich employees with white cars................... 154 Listing 7 Client-server communication example (server side)..................................... 196 Listing 8 Client-server communication example (client side)...................................... 196 Listing 9 Contents of relational-schema.dtd ................................................................. 201 Listing 10 Sample schema description ......................................................................... 202 Listing 11 Sample XSD for metabase........................................................................... 204 Listing 12 Sample result XML document..................................................................... 207 Listing 13 Sample result pattern string ......................................................................... 208 Listing 14 Sample contents of connection properties................................................... 210 Listing 15 Sample inserter output................................................................................. 212 Listing 16 Sample schema generator output................................................................. 212 Listing 17 Wrapper server startup output ..................................................................... 213 Listing 18 Sample contents of wrapper.conf ................................................................ 214 Listing 19 Sample CLI session ..................................................................................... 218 Page 224 of 235 Index of Tables Table 1 Optimisation testbench configuration.............................................................. 156 Table 2 Test data for cars.............................................................................................. 159 Table 3 Wrapper protocol server commands and messages ......................................... 197 Table 4 Wrapper protocol client commands and messages .......................................... 198 Table 5 Type mapping between SQL, XSD and SBQL ............................................... 206 Page 225 of 235 Bibliography 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Kuliberda K., Wiślicki J., Adamus R., Subieta K.: Object-Oriented Wrapper for Relational Databases in the Data Grid Architecture, On the Move to Meaningful Internet Systems 2005 Proceedings. LNCS 3762, Springer 2005, pp. 367-376 Wiślicki J., Kuliberda K., Adamus R., Subieta K.: Relational to Object-Oriented Database Wrapper Solution in the Data Grid Architecture with Query Optimization Issues, IBM Research Report RC23820 (W0512-007), Proceedings SOBPI'05 (ICSOC'05), Amsterdam, Holland, 2005, pp. 30-43 Adamus R., Kuliberda K., Wiślicki J., Subieta K.: Wrapping Relational Data Model to Object-Oriented Database in the Data Grid Architecture, SOFSEM SRF 2006 Proceedings, Merin, Czech Republic, 2006, pp. 54-63 Wiślicki J., Kuliberda K., Kowalski T., Adamus R.: Integration of Relational Resources in an Object-Oriented Data Grid, SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 277-280 Wiślicki J., Kuliberda K., Kowalski T., Adamus R.: Implementation of a Relational-to-Object Data Wrapper Back-end for a Data Grid, SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 285-288 Wiślicki J., Kuliberda K., Kowalski T., Adamus R.: Integration of relational resources in an object-oriented data grid with an example, Journal of Applied Computer Science (2006), Vol. 14 No. 2, Łódź, Poland, 2006, pp. 91-108 Wislicki J., Kuliberda K., Adamus R., Subieta K.: Relational to object-oriented database wrapper solution in the data grid architecture with query optimization issues, International Journal of Business Process Integration and Management (IJBPIM), 2007/2008 (to appear) Atkinson M., Bancilhon F., DeWitt D., Dittrich K., Maier D., Zdonik S.: The Object-Oriented Database System Manifesto, Proc. of 1st Intl. Conf. on Deductive and Object Oriented Databases 89, Kyoto, Japan, 1989, pp. 40-57 Wiederhold G.: Mediators in the Architecture of Future Information Systems, IEEE Computer, 25(3), 1992, pp. 38-49 Bergamaschi, S., Garuti, A., Sartori, C., Venuta, A.: Object Wrapper: An ObjectOriented Interface for Relational Databases, EUROMICRO 1997, pp. 41-46 Subieta K.: Obiektowość w bazach danych: koncepcje, nadzieje i fakty. Część 3. Obiektowość kontra model relacyjny, Informatyka, Marzec 1998, pp. 26-33 Object-Relational Impedance Mismatch, http://www.agiledata.org/essays/impedanceMismatch.html Neward, T.: The Vietnam of Computer Science, http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx ADO.NET, http://msdn2.microsoft.com/en-us/data/aa937699.aspx Page 226 of 235 Bibliography 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Stonebraker M.: Future Trends in Database Systems, IEEE Data Engineering Conf., Los Angeles, 1988 Ahmed R., Albert J., Du W., Kent W., Litwin W., Shan M-C.: An overview of Pegasus, In: Proceedings of the Workshop on Interoperability in Multidatabase Systems, RIDE-IMS’93, Vienna, Austria, 1993 Albert J., Ahmed R., Ketabchi M., Kent W., Shan M-C.: Automatic importation of relational schemas in Pegasus, In: Proceedings of the Workshop on Interoperability in Multidatabase Systems, RIDE-IMS’93, Vienna, Austria, 1993 Fishman D.H. et al: Overview of the Iris DBMS, Object-Oriented Concepts, Databases, and Applications, Kim and Lochovsky, editors, Addison-Wesley, 1989 Important Features of Iris OSQL, Computer Standards & Interfaces 13(1991) (OODB Standardization Workshop, Atlantic City, May 1990). Ahmed R., DeSmedt P., Du W., Kent W., Ketabchi M., Litwin W., Rafii A., Shan M-C.: Using an Object Model in Pegasus to Integrate Heterogeneous Data, April 1991 Fahl G., Risch T.: Query processing over object views of relational data, The VLDB Journal (1997) 6: 261–281 Saltor F., Castellanos M., Garcia-Solaco M.: Suitability of data models as canonical models for federated databases, SIGMOD RECORD 20:4, 1991 Fahl G., Risch T., Sköld M.: AMOS – An architecture for active mediators, In: Proc. Int. Workshop on Next Generation Information Technologies and Systems, NGITS ’93, Haifa, Israel, 1993 Jarke M., Koch J.: Query optimization in database systems, ComputSurv 16:2, 1984 Amos II, http://user.it.uu.se/~udbl/amos/ Risch T., Josifovski, V., Katchaounov, T.: Functional data integration in a distributed mediator system, In Functional Approach to Computing with Data, P.Gray, L.Kerschberg, P.King, and A.Poulovassilis, Eds. Springer, 2003 Josifovski V., Risch T.: Query Decomposition for a Distributed Object-Oriented Mediator System, Distributed and Parallel Databases J., 11(3), pp 307-336, Kluwer, May 2002 Litwin W., Risch T.: Main Memory Oriented Optimization of OO Queries using Typed Datalog with Foreign Predicates, IEEE Transactions on Knowledge and Data Engineering, 4(6), 517-528, 1992 Datalog and Logic-Based Databases, http://cs.wwc.edu/~aabyan/415/Datalog.html Datalog, http://en.wikipedia.org/wiki/Datalog Tomasic A., Amouroux R., Bonnet P.: The Distributed Information Search Component Disco and the World Wide Web, SIGMOD Conference 1997, pp. 546548, 1997 Tomasic A., Raschid L., Valduriez P.: Scaling Access to Heterogeneous Data Sources with DISCO, IEEE Transactions on Knowledge and Data Engineering, Volume 10, pp 808-823, 1998 Czejdo B., Eder J., Morzy T., Wrembel R.: Designing and Implementing an Object–Relational Data Warehousing System, DAIS´01 Proceedings, Volume 198, 2001, pp. 311-316 Fussell M. L.: Foundations of Object-Relational Mapping, http://www.chimu.com/publications/objectRelational/ ORM, http://en.wikipedia.org/wiki/Object-relational_mapping Mapping Objects to Relational Databases: O/R Mapping In Detail, http://www.agiledata.org/essays/mappingObjects.html Page 227 of 235 Bibliography 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 Object-relational mapping articles, http://www.service-architecture.com/objectrelational-mapping/articles/ DAO, http://en.wikipedia.org/wiki/Data_Access_Object DAO, http://java.sun.com/blueprints/corej2eepatterns/Patterns/DataAccessObject.html Bergamaschi S., Garuti A., Sartori C., Venuta A.: Object Wrapper: An ObjectOriented Interface for Relational Databases, 23rd EUROMICRO Conference '97 New Frontiers of Information Technology, 1997, pp. 41-46 Keller W.: Object/Relational Access Layers, A Roadmap, Missing Links and More Patterns, EuroPLoP 1998 Keller W.: Mapping Objects to Tables: A Pattern Language, in Proceedings of the 1997 European Pattern Languages of Programming Conference, Irrsee, Germany, Siemens Technical Report 120/SW1/FB 1997 Keller W., Coldewey J.: Relational Database Access Layers: A Pattern Language, in Collected Papers from the PLoP’96 and EuroPLoP’96 Conferences, Washington University, Department of Computer Science, Technical Report WUCS 97-07, February 1997 Grove A.: Data Access Object (DAO) versus Object Relational Mapping (ORM), http://www.codefutures.com/weblog/andygrove/archives/2005/02/data_access_obj. html ORM software list, http://en.wikipedia.org/wiki/List_of_objectrelational_mapping_software Matthes, F., Rudloff A., Schmidt, J.W., Subieta, K.: A Gateway from DBPL to Ingres, Proc. of Intl. Conf. on Applications of Databases, Vadstena, Sweden, Springer LNCS 819, pp. 365-380, 1994 Modula-2, http://www.modula2.org/ Ingres, http://www.ingres.com/ Oracle, http://www.oracle.com/ EOF, http://en.wikipedia.org/wiki/Enterprise_Objects_Framework OpenStep, http://en.wikipedia.org/wiki/OpenStep The Objective-C Programming Language, http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/ObjC.pdf Apache Cayenne, http://cayenne.apache.org/ Ajax, http://www.ajaxgoals.com/ Velocity, http://velocity.apache.org/ IBM JDBC wrapper, http://www-128.ibm.com/developerworks/java/library/jjdbcwrap/ JDO, http://java.sun.com/products/jdo/ JDO 2.0 specification, http://www.jcp.org/en/jsr/detail?id=243 JDO, http://en.wikipedia.org/wiki/Java_Data_Objects Apache JDO, http://db.apache.org/jdo/index.html Apache OJB, http://db.apache.org/ojb/ XORM, http://www.xorm.org Speedo, http://speedo.objectweb.org/ JDO implementations, http://db.apache.org/jdo/impls.html EJB, http://java.sun.com/products/ejb/ EJB JSR 220, http://jcp.org/en/jsr/detail?id=220 EJB, http://en.wikipedia.org/wiki/Enterprise_Java_Beans JPA, http://java.sun.com/javaee/technologies/persistence.jsp Page 228 of 235 Bibliography 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 The Java Persistence API - A Simpler Programming Model for Entity Persistence, http://java.sun.com/developer/technicalArticles/J2EE/jpa/ Hibernate, http://www.hibernate.org/ .NET Framework, http://msdn2.microsoft.com/en-us/netframework/default.aspx Apache Torque, http://db.apache.org/torque/ CORBA, http://www.omg.org/gettingstarted/corbafaq.htm CORBA, http://en.wikipedia.org/wiki/CORBA OMG, http://www.omg.org/ ORB/ODBMS Integration, http://www.ime.usp.br/~reverbel/orb_odbms.html Liang K-C., Chyan D., Chang Y-S., Lo W., Yuan S-M.: Integration of CORBA and Object Relational Databases, Computer Standards and Interfaces, Vol. 25, No. 4, Sept. 2003, pp. 373-389 Sandholm T.: Object Caching in a Transactional, Object-Relational CORBA Environment, Master's Thesis, Stockholm University, October 1998, http://cis.cs.tuberlin.de/Dokumente/Diplomarbeiten/1998/sandholm.ps.gz XML, http://www.w3.org/XML/ XQuery, http://www.w3.org/XML/Query/ XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery/ Carey M., Kiernan J., Shanmugasundaram J., Shekita E., Subramanian S.: XPERANTO: A Middleware for Publishing Object-Relational Data as XML Documents Carey M., Florescu D., Ives Z., Lu Y., Shanmugasundaram J., Shekita E., Subramanian S.: XPERANTO: Publishing Object-Relational Data as XML Shanmugasundaram J., Kiernan J., Shekita E., Fan C., Funderburk J.: Querying XML Views of Relational Data Funderburk J. E., Kiernan G., Shanmugasundaram J., Shekita E., Wei C.: XTABLES: Bridging relational technology and XML, IBM Systems Journal. 41, No. 4, 2002 Braganholo V. P., Davidson S. B., Heuser C. A.: From XML view updates to relational view updates: old solutions to a new problem Braganholo V. P., Davidson S. B., Heuser C. A.: UXQuery: Building Updatable XML Views over Relational Databases Valikov A., Kazakos W., Schmidt A.: Building updateable XML views on top of relational databases CoastBase, http://www.netcoast.nl/tools/rikz/COASTBASE.htm Kazakos W., Kramer R. Schmidt A.: Coastbase – The Virtual European Coastal and Marine Data Warehouse; Computer Science for Environmental Protection 2000, Vol 2, (ed. A. Cremers, K. Greve) Metropolis-Verlag, 2000, pp. 646-654 Shao F., Novak A., Shanmugasundaram J.: Triggers over XML Views of Relational Data DB2, http://www-306.ibm.com/software/data/db2/ RDF, http://www.w3.org/RDF/ SWARD, http://user.it.uu.se/~udbl/sward.html Petrini J., Risch T.: SWARD: Semantic Web Abridged Relational Databases, http://user.it.uu.se/~udbl/sward/SWARD.pdf RDQL, http://www.w3.org/Submission/RDQL/ SPARQL query language, http://www.w3.org/TR/rdf-sparql-query/ SPARQL protocol, http://www.w3.org/TR/rdf-sparql-protocol/ SPARQL XML result format, http://www.w3.org/TR/rdf-sparql-XMLres/ Page 229 of 235 Bibliography 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 Cyganiak R.: A relational algebra for SPARQL, HP Labs, Bristol, UK Dokulil J.: Evaluation of SPARQL queries using relational databases Harris S.: SPARQL query processing with conventional relational database systems Perez de Laborda C., Conrad S.: Bringing Relational Data into the SemanticWeb using SPARQL and Relational.OWL, Data Engineering Workshops, 2006. Proceedings. 2006, pp. 55-55 Newman A.: Querying the Semantic Web using a Relational Based SPARQL, http://jrdf.sourceforge.net/RelationalBasedSPARQL.pdf ICONS, http://www.icons.rodan.pl/ Staniszkis E., Nowicki B.: ICONS based Knowledge Management in the Process of Structural Funds Projects Preparation, http://www.rodan.pl/badania/publikacje/publications/%5BStaniszkis2004a%5D.pdf Staniszkis W., Staniszkis E.: Intelligent Agent-based Expert Interactions in a Knowledge Management Portal, http://www.icons.rodan.pl/presentations/S03.ppt OMG UML, http://www.uml.org/ WfMC, http://www.wfmc.org/ Staniszkin W., Nowicki B.: Intelligent CONtent management System Presentation of the IST ICONS project, 4-th International Workshop on Distributed Data and Structures WDAS 2002, Paris, March 2002. Staniszkis W., Nowicki B.: Intelligent CONtent management System. Presentation of the IST ICONS Project, Conference TELBAT Teleworking for Business, Education, Research and e-Commerce, October 2002, Vilnus, Lithuania Staniszkis W.: ICONS Knowledge Management for Structural Fund Projects. A Case Study, DIESIS – Driving Innovative Exploits for Sardinian Information Society Knowledge Management Case Study, Calgiari, Sardinia 11-12 September 2003 Beatty J., Brodsky S., Nally M., Patel R.: Next-Generation Data Programming: Service Data Objects, A Joint Whitepaper with IBM and BEA, 2003, http://ftpna2.bea.com/pub/downloads/commonj/Next-Gen-Data-ProgrammingWhitepaper.pdf Beatty J., Brodsky S., Ellersick R., Nally M., Patel R: Service Data Objects, http://ftpna2.bea.com/pub/downloads/commonj/Commonj-SDO-Specificationv1.0.pdf Portier B., Budinsky F.: Introduction to Service Data Objects, http://www.ibm.com/developerworks/java/library/j-sdo/ EMF, http://www.eclipse.org/emf/ SDO specification, http://www128.ibm.com/developerworks/library/specification/ws-sdo/ Codd E. F.: A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387 Selinger P. G., Astrahan M. M., Chamberlin D. D., Lorie R. A., Price T. G.: Access path selection in a relational database management system, SIGMOD Conference 1979, pp. 23-34 Astrahan M. M.: System R: A relational approach to data management, ACM Transactions on Database Systems, 1(2), pp. 97-137, June 1976 Chamberlin D.: Bibliography of the System R Project, http://www.mcjones.org/System_R/bib.html Jarke M., Koch J.: Query Optimization in Database Systems, ACM Computing Surveys 16(2), 1984, pp. 111-152 Page 230 of 235 Bibliography 123 Chaudhuri S.: An Overview of Query Optimization in Relational Systems, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, Washington, United States , pp. 34-43, 1998 124 Ioannidis Y. E.: Query Optimization, ACM Computing Surveys, symposium issue on the 50th Anniversary of ACM, Vol. 28, No. 1, March 1996, pp. 121-123 125 Andler S., Ding I., Eswaran K., Hauser C., Kim W., Mehl J., Williams R.: System D: A distributed system for availability, In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). VLDB Endowment, Saratoga, 1982, pp. 33-44 126 Apers P. M. G., Hevner A. R., Yao S. B.: Optimization algorithms for distributed queries, IEEE Trans. Softw. Eng. SE-g 1,5768, 1983. 127 Bernstein P. A., Goodman N.: Concurrency control in distributed database systems, ACM Comput. Surv. 13, 2 (June), 1981, pp. 185-221 128 Bernstein P. A., Goodman N., Wong E., Reeve C. L., Rothine J. B., JR.: Query processing in a system for distributed databases (SDD-1), ACM Trans. Database Syst. 6, 4 (Dec.), 1981, pp. 602-625 129 Ceri S., Pelagattin G.: Allocation of operations in distributed database access, IEEE Trans. Comput. C-31, 2, 1982, pp. 119-128. 130 Chang J.-M.: A heuristic approach to distributed query processing, In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). VLDB Endowment, Saratoga, Calif., 1982, pp. 54-61 131 Cheung T.-Y.: A method for equijoin queries in distributed relational databases, IEEE Trans. Comput. C-31,8, 1982, pp. 746-751 132 Chiu D. M., Bernstein P. A., Ho Y. C.: Optimizing chain queries in a distributed database system, Tech. Rep. TR-01-81, Computer Science Dept., Harvard University, Cambridge, Mass, 1981 133 Chu W. W., Hurley P.: Optimal query processing for distributed database systems, IEEE Trans. Comput. C-31,9, 1982, pp. 835-850 134 Epstein R., Stonebraker M.: Analysis of distributed data base processing strategies, In Proceedings of the 6th International Conference on Very Large Data Bases (Montreal, Oct. l-3). IEEE, New York, 1980, pp. 92-101 135 Epstein R., Stonebraker M., Wong E.: Distributed query processing in a relational data base system, In Proceedings of the ACM-SIGMOD International Conference on Management of Data (Austin, Tex., May 1l-June 2). ACM, New York, 1978, pp. 169-180 136 Forker H. J.: Algebraical and operational methods for the optimization of query processing in distributed relational database management systems, In Proceedings of the 2nd International Symposium on Distributed Databases (Berlin). Elsevier North-Holland, 1982, pp. 39-59 137 Gavish B., Segev A.: Query optimization in distributed computer systems, In Management of Distributed Data Processing, J. Akoka, Ed. Elsevier North-Holland, New York, 1982, pp. 233-252 138 Hevner A. R.: The optimization of query processing on distributed database systems. Ph.D. dissertation. Computer Science Dent. Purdue University, West Lafayette, 1979 139 Kambayashi Y., Yoshikawa M., Yajima S.: Query processing for distributed databases using generalized semi-joins, In Proceedings of the ACM-SZGMOD Page 231 of 235 Bibliography 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 International Conference on Management of Data (Orlando, Fla., June 2-4). ACM, New York, 1982, pp. 151-160 Sacco G. M., Yao S. B.: Query optimization in distributed database systems, In Advances in Computers, vol. 21. Academic Press, New York, 1982, pp. 225-273 Wong E.: Dynamic rematerialization: Processing distributed queries using redundant data, IEEE Trans. Softw. Eng. SE-g, 3, 1983, pp. 228-232. Yu C. T., Chang C. C.: On the design of a query processing strategy in a distributed database environment, In SIGMOD 83, Proceedings of the Annual Meeting (San Jose, California, May 23-25), ACM, New York, 1983, pp. 30-39 Codd E. F.: A database sublanguage founded on the relational calculus, In Proceedings of the ACM-SIGFIDET Workshop, Data Description, Access, and Control (San Diego, Calif., Nov. ll-12). ACM, New York, 1971, pp. 35-68. Codd E. F.: Relational completeness of data base sublanguages, In Courant Computer Science Symposia No. 6: Data Base Systems. Prentice-Hall, New York, 1972, pp. 67-101 Lacroix M., Pirotte A.: Domain-Oriented Relational Languages, VLDB 1977, pp. 370-378 Codd E. F.: Relational Completeness of Data Base Sub-languages, In R. Rustin, editor, Data Base Systems. Prentice Hall, 1972 Ono K., Lohman G.: Measuring the complexity of join enumeration in query optimization, In Proceedings of the 16th Int. VLDB Conference, Brisbane, Australia, August 1990, pp. 314-325 Nahar S., Sahni S., Shragowitz E.: Simulated annealing and combinatorial optimization, In Proc. 23rd Design Automation Conference, 1986, pp. 293-299 Swami A., Gupta A.: Optimization of large join queries, In Proc. ACM-SIGMOD Conference on the Management of Data, Chicago, 1988, pp. 8-17 Swami A.: Optimization of large join queries: Combining heuristics and combinatorial techniques, In Proc. ACM-SIGMOD Conference on the Management of Data, Portland, 1989, pp. 367-376 Kirkpatrick S., Gelatt C. D., Jr., Vecchi M. P.: Optimization by simulated annealing, Science, 220(4598), 1983, pp. 671-680 Ioannidis Y., Wong E.: Query optimization by simulated annealing, In Proc. ACMSIGMOD Conference on the Management of Data, San Francisco, 1987, pp. 9-22 Y. Ioannidis and Y. Kang. Randomized algorithms for optimizing large join queries, In Proc. ACM-SIGMOD Conference on the Management of Data, Atlantic City, 1990, pp. 312-321 Mannino M. V., Chu P., Sager T.: Statistical profile estimation in database systems, ACM Computing Surveys, 20(3), 1988, pp. 192-221 Christodoulakis S.: On the estimation and use of selectivities in database performance evaluation, Research Report CS-89-24, Dept. of Computer Science, University of Waterloo, 1989 Olken F., Rotem D.: Simple random sampling from relational databases, In Proc. 12th Int. VLDB Conference, Kyoto, 1986, pp. 160-169 Lipton R. J., Naughton J. F., Schneider D. A.: Practical selectivity estimation through adaptive sampling, In Proc. of the 1990 ACM-SIGMOD Conference on the Management of Data, Atlantic City, 1990, pp. 1-11 Haas P., Swami A.: Sequential sampling procedures for query size estimation. In Proc. of the 1992 ACM-SIGMOD Conference on the Management of Data, San Diego, 1992, pp. 341-350 Page 232 of 235 Bibliography 159 Haas P., Swami A.: Sampling-based selectivity estimation for joins using augmented frequent value statistics, In Proc. of the 1995 IEEE Conference on Data Engineering, Taipei, 1995 160 Christodoulakis S.: Implications of certain assumptions in database performance evaluation. ACM TODS, 9(2), 1984, pp. 163-186 161 Ioannidis Y., Christodoulakis S.: On the propagation of errors in the size of join results, In Proc. of the 1991 ACM-SIGMOD Conference on the Management of Data, Denver, 1991, pp. 268-277 162 Kooi R. P.: The Optimization of Queries in Relational Databases. PhD thesis, Case Western Reserve University, 1980 163 Piatetsky-Shapiro G., Connell C.: Accurate estimation of the number of tuples satisfying a condition, In Proc. 1984 ACM-SIGMOD Conference on the Management of Data, Boston, 1984, pp. 256-276 164 Muralikrishna M., DeWitt D. J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries, In Proc. of the 1988 ACM-SIGMOD Conference on the Management of Data, Chicago, 1988, pp. 28-36 165 Ioannidis Y., Christodoulakis S.: Optimal histograms for limiting worst-case error propagation in the size of join results, ACM TODS, 18(4), 1993, pp. 709-748 166 Ioannidis Y.: Universality of serial histograms, In Proc. 19th Int. VLDB Conference, Dublin, 1993, pp. 256-267 167 Ioannidis Y., Poosala V.: Balancing histogram optimality and practicality for query result size estimation, In Proc. of the 1995 ACM-SIGMOD Conference on the Management of Data, San Jose, 1995, pp. 233-244 168 Haas L., Freytag J.C., Lohman G.M., Pirahesh H.: Extensible Query Processing in Starburst, In Proc. of ACM SIGMOD, Portland, 1989 169 Starburst, http://www.almaden.ibm.com/cs/starwinds/starburst.html 170 Pirahesh H., Hellerstein J.M., Hasan W.: Extensible/Rule Based Query Rewrite Optimization in Starburst, In Proc. of ACM SIGMOD, 1992 171 Lohman G.M.: Grammar-like Functional Rules for Representing Query Optimization Alternatives, In Proc. of ACM SIGMOD, 1988 172 Graefe G., McKenna W.J.: The Volcano Optimizer Generator: Extensibility and Efficient Search, In Proc. of the IEEE Conference on Data Engineering, Vienna, 1993 173 Graefe G.: The Cascades Framework for Query Optimization, In Data Engineering Bulletin. 1995 174 Graefe G., DeWitt D.J.: The Exodus Optimizer Generator, In Proc. of ACM SIGMOD, San Francisco, 1987 175 Kozankiewicz H., Leszczyłowski J., Subieta K.: Implementing Mediators through Virtual Updateable Views, Engineering Federated Information Systems, Proceedings of the 5th Workshop EFIS 2003, July 17-18 2003, Coventry, UK, pp.52-62 176 Kozankiewicz H., Leszczyłowski J., Subieta K.: Updateable Views for an XML Query Language, CAiSE FORUM 2003, Klagenfurt/Velden, Austria 177 Kozankiewicz H., Leszczyłowski J., Subieta K.: Updateable XML Views, ADBIS'03, Dresden, Germany, 2003 178 D.C. Tsichritzis, A. Klug (eds.): The ANSI/X3/SPARC DBMS Framework: Report of the Study Group on Data Base Management Systems, Information Systems 3, 1978. Page 233 of 235 Bibliography 179 Plodzien J.: Optimization Methods In Object Query Languages, PhD Thesis. IPIPAN, Warszawa 2000 180 Płodzień J., Kraken A.: Object Query Optimization in the Stack-Based Approach. Proc. ADBIS Conf., Springer LNCS 1691, 1999, pp. 303-316 181 Płodzień J., Subieta K.: Optimization of Object-Oriented Queries by Factoring Out Independent Subqueries, Institute of Computer Science Polish Academy of Sciences, Report 889, 1999 182 Płodzień J., Kraken A.: Object Query Optimization through Detecting Independent Subqueries, Information Systems, Pergamon Press, 2000 183 Płodzień J., Subieta K.: Applying Low-Level Query Optimization Techniques by Rewriting, Proc. DEXA Conf., Springer LNCS 2113, 2001, pp. 867-876 184 Płodzień J., Subieta K.: Query Optimization through Removing Dead Subqueries, Proc. ADBIS Conf., Springer LNCS 2151, 2001, pp. 27-40 185 Płodzień J., Subieta K.: Static Analysis of Queries as a Tool for Static Optimization, Proc. IDEAS Conf., IEEE Computer Society, 2001, pp. 117-122 186 Płodzień J., Subieta K.: Query Processing in an Object Data Model with Dynamic Roles, Proc. WSEAS Intl. Conf. on Automation and Information (ICAI), Puerto de la Cruz, Spain, CD-ROM, ISBN: 960-8052-89-0, 2002 187 Shaw G. M., Zdonik S. B.: An object-oriented query algebra, Proceedings of DBPL Workshop, 1989, pp. 103-112 188 Beeri C., Kornatzky Y.: Algebraic optimization of object-oriented query languages, ICDT, 1990, pp. 72-88 189 Mitchell G., Zdonik S.B., Dayal U.: Object-oriented query optimization: what's the problem?, Depertmant of Computer Science, Brown University, USA, Technical Report No. CS-91-41, 1991 190 Rich C., Scholl M. H.: Query optimization in an OODBMS, BTW, Informatik Aktuell, Springer, Heidelberg, 1993 191 Cluet S., Delobel C.: Towards a unification of rewrite-based optimization techniques for object-oriented queries, Query Processing for Advanced Database Systems, Morgan Kaufmann, 1994, pp. 245-272 192 Kemper A., Moerkotte G.: Query optimization in object bases: exploiting relational techniques, (in) Query Processing for Advanced Database Systems, Morgan Kaufmann, 1994, pp. 101-137 193 Leu T. W.: Compiling object-oriented queries, Department of Computer Science, Brown University, USA, Technical Report No. CS-94-05, 1994 194 Cherniack M., Zdonik S. B., Nodine M. H.: To form a more perfect union (intersection, difference), International Workshop on Database Programming Languages, Gubbio, Italy, 1995 195 Cherniack M., Zdonik S. B.: Rule languages and interval algebras for rule-based optimizer, Proceedings of SIGMOD, 1996, pp. 401-412 196 Hauer A., Kröger J.: Query optimization in CROQUE project, Proceedings of DEXA, Springer LNCS 1134, 1996, pp. 489-499 197 Abbas I., Boucelma O.: A framework for algebraic optimization of object-oriented query languages, Proceedings of DEXA, Springer LNCS 1308, 1997, pp. 478-487 198 Grust T., Kröger J., Gluche D., Heuer A., Scholl M. H.: Query evaluation in CROQUE – calculus and algebra coincide, Proceedings of 14th British National Conference on Databases, Springer LNCS 1271, 1997, pp. 84-100 199 Cherniack M., Zdonik S. B.: Changing the rules: transformations for rule-based optimizers, Proceedings of SIGMOD, 1998, pp. 61-72 Page 234 of 235 Bibliography 200 Kröger J., Illner R., Rost S., Heuer A.: Query rewriting and search in CROQUE, Proceedings of ADBIS, Springer LNCS 1691, 1999, pp. 288-302 201 Litwin W.: Linear Hashing : a new tool for file and tables addressing. Reprinted from VLDB-80 in READINGS IN DATABASES. 2nd ed. Morgan Kaufmann Publishers, Inc., Stonebraker M.(Ed.), 1994 202 Litwin W., Nejmat M. A., Schneider D. A.: LH*: Scalable, Distributed Database System, ACM Trans. Database Syst., 21(4), 1996, pp. 480-525. 203 Zql, Pierre-Yves Gibell, http://www.experlog.com/gibello/zql/ 204 Sahri S., Litwin W., Schwartz T.: SD-SQL Server: a Scalable Distributed Database System, CERIA Research Report 2005-12-13, December 2005 205 eGov-Bus, http://www.egov-bus.org/web/guest/home 206 ODRA White Paper, http://iolab.pjwstk.edu.pl:8081/forum/image.aspx?a=95 207 R.G.G.Cattell, D.K.Barry (Eds.): The Object Data Standard: ODMG 3.0. Morgan Kaufmann 2000 208 Cook W.R., Rosenberger C.: Native Queries for Persistent Objects: A Design White Paper, http://www.db4o.com/about/productinformation/whitepapers/ Native%20Queries%20Whitepaper.pdf, 2006 209 Hibernate - Relational Persistence for Java and .NET, http://www.hibernate.org/, 2006 210 Subieta K.: Theory and Construction of Object-Oriented Query Languages. PJIIT Publishing House, ISBN 83-89244-28-4, 2004, 522 pages (in Polish) 211 Subieta K.: Stack-Based Approach (SBA) and Stack-Based Query Language (SBQL). http://www.sbql.pl, 2006 212 Albano A., Bergamini R., Ghelli G., Orsini R.: An Object Data Model with Roles, Proc. VLDB Conf., 39-51, 1993 213 Jodlowski A., Habela P., Plodzien J., Subieta K.: Objects and Roles in the StackBased Approach, Proc. DEXA Conf., Springer LNCS 2453, 2002. 214 Kozankiewicz H.: Updateable Object Views. PhD Thesis, 2005, http://www.ipipan.waw.pl/~subieta/ -> Finished PhD-s -> Hanna Kozankiewicz 215 Kozankiewicz H., Leszczylowski J., Subieta K.: Updateable XML Views. Proc. Of ADBIS’03, Springer LNCS 2798, 2003, 385-399 216 Kozankiewicz H., Stencel K., Subieta K.: Integration of Heterogeneous Resources through Updatable Views, ETNGRID-2004, Proc. published by IEEE 217 Torque DTD, http://db.apache.org/torque/releases/torque3.2/generator/database.dtd.txt 218 Apache Torque, http://db.apache.org/torque/ 219 PostreSQL, http://www.postgresql.org/ 220 Firebird, http://www.firebirdsql.org/ 221 MS SQL Server 2005, http://www.microsoft.com/sql/default.mspx 222 Apache Ant, http://ant.apache.org/ 223 Java Service Wrapper, http://wrapper.tanukisoftware.org/doc/english/introduction.html Page 235 of 235