DIADEM: domain-centric, intelligent, automated data extraction
Transcription
DIADEM: domain-centric, intelligent, automated data extraction
WWW 2012 – European Projects Track April 16–20, 2012, Lyon, France DIADEM: Domain-centric, Intelligent, Automated Data Extraction Methodology∗ Tim Furche, Georg Gottlob, Giovanni Grasso, Omer Gunes, Xiaonan Guo, Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, Cheng Wang Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 3QD firstname.lastname@cs.ox.ac.uk ABSTRACT Search engines are the sinews of the web. These sinews have become strained, however: Where the web’s function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain milage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers—never to be sure a better offer isn’t just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites. Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula. Figure 1: Exploration: Form extension businesses, articles, or information about products wherever they are published. Visibility on search engines has become mission critical for most companies and persons. However, for searches where the answer is not just a single web page, search engines are failing: What is the best apartment for my needs? Who offers the best price for this product in my area? Such object queries have become an exercise in frustration: we users need to manually sift through, compare, and rank the offers from dozens of websites returned by a search engine. At the same time, businesses have to turn to stopgap measures, such as aggregators, that provide customers with search facilities in a specific domain (vertical search). This endangers the just emerging universal information market, “the great equalizer” of the Internet economy: visibility depends on deals with aggregators, the barrier to entry is raised, and the market fragments. Rather than further stopgap measures, the “publish on your website” model that has made the web such a success and so resilient against control by a few (government agents or companies) must be extended to object search: Without additional technological burdens to publishers, users should be able to search and query objects such as properties or laptops based on attributes such as price, location, or brand. Increasing the burden on publishers is not an option, as it further widens the digital divide between those that can afford the necessary expertise and those that can not. DIADEM, an ERC advanced investigator grant, is developing a system that aims to provide this bridge: Without human supervision it finds, navigates, and analyses websites of a specific domain and extracts all contained objects using highly efficient, scalable, automatically generated wrappers. The analysis is parameterized with domain knowledge that DIADEM uses to replace human annotators in traditional wrapper induction systems and to refine and verify the generated wrappers. This domain knowledge describes the ontology as well as the phenomenology of the domain: what are the entities and their relations as well as how do they occur on websites. The latter describes that, e.g., real estate properties include a location and a price and that these are displayed prominently. Figures 1 and 2 show examples (screenshots from the current prototype) of the kind of analysis required by DIADEM for fully automated data extraction: Web sites needs to be explored to locate relevant data, here real estate properties. This includes in particular forms as in Figure 1 which DIADEM automatically classifies such Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: On-line Information Services—Web-based services General Terms Languages, Experimentation Keywords data extraction, deep web, knowledge 1. INTRODUCTION Search is failing. In our society, search engines play a crucial role as information brokers. They allow us to find web pages and by ∗The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement DIADEM, no. 246858, http://diadem-project.info/. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2012 Companion, April 16–20, 2012, Lyon, France. ACM 978-1-4503-1230-1/12/04. 267 WWW 2012 – European Projects Track April 16–20, 2012, Lyon, France DIADEM tin In the Cloud La Phenomenology O x AMBER OPAL IT H AC AL GLUE M Ontology Ontology Identification r to ac Glue r xt Reasoning Datalog± Exploration O E O XP a th Extraction Figure 3: DIADEM Overview level fact finders, whose output facts may be revised in case more accurate information arises from other sources. DIADEM focuses on a dual approach: at the low level of web pattern recognition, we use machine-learning complemented by linguistic analysis and basic ontological annotation. Several of these approaches, in addition to newly developed methods, are implemented as lower level building blocks (fact finders) to extract as much knowledge as possible and integrate it into a common knowledge base. At a higher level, we use goal-directed domain-specific rules for reasoning on top of all generated lower-level knowledge, e.g. in order to identify the main input mask and the main extraction data structure, and to finalize the navigation process. DIADEM approaches data extraction as a two stage process: In a first stage, we analyse a small fraction of a web site to generate a wrapper that is then executed in the second stage to extract all the relevant data on the site at high speed and low cost. Figure 3 gives an overview of the high-level architecture of DIADEM. On the left, we show the analysis, on the right the execution stage. (1) Sampling analysis: In the first stage, a sample of the web pages of a site are used to fully automatically generate wrappers (i.e., extraction program). The analysis is based on domain knowledge that describes the objects of interest (the “ontology”) and how they appear on the web (the “phenomenology”). The result of the analysis is a wrapper program, i.e., a specification how to extract all the data from the website without further analysis. Conceputally, it is divided into two major phases, though these are closely interwoven in the actual analysis: (i) Exploration: DIADEM automatically explores a site to locate relevant objects. The major challenge here are web forms: DIADEM needs to understand such forms enough to fill them for sampling, but also to generate exhaustive queries for the extraction stage such that all the relevant data is extracted (see [1]). DIADEM’s form understanding engine OPAL [9, 10] is uses an phenomenology of forms in the domain to classify the form fields. The exploration phase is supported by the page and block classification in MALACHITE where we identify, e.g., next links in paginate results, navigation menus, and irrelevant data such as advertisements. We further cluster pages by structural and visual similarity to guide the exploration strategy and to avoid analysing many similar pages. Since such pages follow a common template, the analysis of one or two pages from a cluster usually suffices to generate a high confidence wrapper. (ii) Identification: The exploration unearths those web pages that contain actual objects. But DIADEM still needs to identify the precise boundaries of these objects as well as their attributes. To that end, DIADEM’s result page analysis AMBER [11] analyses the repeated structure within and among pages. It exploits the domain knowledge to distinguish noise from relevant data and is thus far more robust than existing data extraction approaches. AMBER is complemented by Oxtractor, that analysis the free text descriptions. It benefits in this task from the contextual knowledge in form of attributes already iden- Figure 2: Identification: Result Page that each form field is associated with a type from the domain ontology such as “minimum price” field in a “price range” segment. These form models are then used for filling the form for further analysis, but also to generate exhaustive queries for latter extraction of all relevant data. Figure 2 illustrates a (partial) result of the analysis performed on a result page, i.e., a page containing relevant objects of the domain: DIADEM identifies the objects, their boundaries on the page, as well as their attributes. Objects and attributes are typed according to the domain ontology and verified against the domain constraints. In Figure 2, e.g., we identify real estate properties (for sale) together with price, location, legal status, number of bed and bathrooms, pictures, etc. The identified objects are generalised into a wrapper that can extract such objects from any page following the same template without further analysis. In the first year and a half, DIADEM has progressed beyond our expectations: The current prototype is already able to generate wrappers for most UK real estate and used car websites with higher accuracy than existing wrapper induction systems. In this demonstration, we outline the DIADEM approach, first results, and describe the live demonstration of the current prototype. 2. DIADEM—THE METHODOLOGY DIADEM is fundamentally different from previous approaches. The integration of state-of-the-art technology with reasoning using high-level expert knowledge at the scale envisaged by this project has not yet been attempted and has a chance to become the cornerstone of next generation web data extraction technology. Standard wrapping approaches [17, 6] are limited by their dependence on templates for web pages or sites and their lack of understanding of the extracted information. They cannot be generalised and their dependence on human interaction prevents scaling to the size of the Web. Several approaches that move towards fully automatic or generic wrapping of the existing World Wide Web have been proposed and are currently under active development. Those approaches often try to iteratively learn new instances from known patterns and new patterns from known instances [5, 19] or to find generalised patterns on web pages such as record boundaries. Although current web harvesting and automated information extraction systems show significant progress, they still lack the combined recall and precision necessary to allow for very robust queries. We intend to build upon those approaches, but add deep knowledge into the extraction process and combine them in a new manner into our knowledge-based approach; we propose to use them as low- 268 WWW 2012 – European Projects Track April 16–20, 2012, Lyon, France 1 100 0 0.93 0 20 40 60 80 100 120 140 UK Real Estate (100) UK Used Car (100) #pagesICQ (98) Tel-8 (436) 50 40 400 30 300 600 20 200 400 10 100 0 0 1000 Web Harvest Chickenfoot 1600 200 800 150 50 0 0 150 20 40300 350 60 40080 0 50 100 200 250 450 100 120 140 time [sec] #pages 140 120 Chickenfoot 1000 100 200 160 OXPath 1400 memory extracted matches Lixto 1200visited Web Harvest pages 100 800 80 600 60 400 40 200 20 0 0 2 0 0 50 40 memory [MB] 200 1200 memory [MB] 300 OXPath Web Content Extractor memory Lixto visited Visual Web Ripper 500 pages pages time [sec] 400 600 time w/o page loading [sec] 0.94 60 memory [MB] 0.95 500 time w/o page loading [sec] 700 600 0.96 F-score OXPath Web Content Extractor Lixto Visual Web Ripper Web Harvest Chickenfoot 700 0.98 0.97 Recall 800 #pages [1000] / #results [100,000] Precision 0.99 extrac v 30 20 10 0 100 200 600 700 800 4 6 300 8 40010500 12 time [h]Number of pages 0 10 2 1023 269 pe pro 1022 th! ba ! rty typ e! rec ep tio n! ! om de dro be stc o po ! al! leg on ! RL loc ati ! ce sU pri tai l de da ta a rea ! rec ord s! time [sec] memory [MB] Table 2 shows the incremental improvements made in each round. For each round, we report the number of unannotated instances, the number of instances recognized through attribute alignment, and the number of correctly identified instances. For each round we also show the corresponding precision and recall met- time [sec] (a) OXPath Many pages. (b) Millions of results. (c (b) Norm. evaluation time, <150 p. (c) Norm. evaluation time, <850 p. (b) Comparative (c) OXPath Scaling Figure 5: Scaling OXPath: Memory, visited pages, and output size vs. time. Figure 7: Comparison. Figure 4: Exploration and Execution 0.4 350 a Java API based on50JAXP. Some of the issues raised by OXPath OXPath browser initialization Lixto that we plan to address in future work are: (1) OXPath is amenable0.3 300 Web Harvest page rendering 40 unannotated instances (328) Chickenfoottotal instances (1484) to significant optimization and a good precision! recall! target for automated gener250 PAAT evaluation tified from AMBER and of background knowledge from the 100.0%! 30 programs. (2) Further, OXPath is perfectly ation of web extraction 200 rnd. aligned prec. rec. prec. rec. 0.2 ontology.corr. suited for highly parallel execution: Different bindings for the same 150 20 Large-scale The wrapper by the anal1 (2) 226 196 extraction: 86.7% 59.2% 84.7% generated 81.6% variable can be filled into forms in parallel. The effective parallel0.1 98.0%! 100 be executed independently and repeatedly. 2 ysis stage 261 can248 95.0% 74.9% 93.2% 91.0%We have execution of actions10on context sets with many nodes is an open 50 3 developed 271 a new 265 wrapper 97.8% 80.6% called 95.1%OXPath 93.8% language, [13], the first issue. (3) We plan to0further investigate language features, such as 0 0 4 wrapper 271language 265 for 97.8% 80.6% 95.1% 93.8% large scale, repeated continuous) PAAT (2%) 96.0%! D1 D2 D3 D4 D5 1 2 0 100 200 300 400 500(or 600 even 700 800 more browser initialization (13%) expressive visual features and multi-property axes. query set data extraction. OXPath is powerful enough #pages to express nearly any page rendering (85%) 94.0%! (avg.). (b) Profiling OXPath (per query). (c) Contex (a) Profiling OXPath extraction Table task, yet as a 8: careful extension of XPath(<850 maintains the 1: Total learned instances Figure Comparison: Memory p.). 7. REFERENCES low data and combined complexity. In fact, it is so efficient, that Figure 6: Profiling OXPath’s components. [1] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring tion graph to rendering a depth of 3time for papers “Seattle”the (see Appendix F). page retrieval and by far on dominate execution. documents, databases, and webs. In ICDE, 24–33, 1998. rnd. unannot. recog. corr. time prec. rec.for each record evaluation memory system. In Fig5. terms EVALUATION Profiling: Page Rendering is D For largeWe scale execution, the aim and is thus to minimize page ren[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web ure 7(a), we show the (averaged) evaluation time for each system up 1 331 225 196 86.7% 59.2% 262 stage of OXPath’s evaluation perform dering and retrieval by storing pages that are possibly needed forwe confirm the scaling In this section, behaviour of OXPath: information extraction with Lixto. In VLDB, 119–128, 2001. 2 118 94.1%memory 27.1% 29 toprocessing. 150 pages.34 Though Chickenfoot and Web Harvest do not manFigure Identification: in UK realfollowing estate further At the 32 same time, should be (1) Theindepentheoretical complexity from5:Section 4.3 areXPath con[3] bounds M. Benedikt and C. Koch. Leashed. CSUR,web sites: apple.com (D1), Figure 4: AMBER Evaluation on Real-Estate Domain 3 16state 100.0% 20.3% 4 all, OXPath age 79 pagenumber and16 browser or do as nototherwise render pages bing.com (D3), www.vldb.org/2011 41(1):3:1–3:54, 2007. dent from the of pages visited, large-scale orlarge scale extraction firmed inatseveral experiments in diverse set4 100.0%that manage 0% state0similar to OXstill63 outperforms them.0become Systems on Wikipedia (D5). On each, we clic [4] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller. continuous extraction0 tasks impossible. With OXPath we tings, in particular the constant memory use even if extracting miltribute separately. AMBER achieves onofaverage 98% for page. Figure are between twocharacteristics, and four timesasslower than OXPath Automation rendered web pages. html tagaccuracy of eachInresult managePath to obtain all these shown inof Section lions records3.even fromon hundreds of thousands ofand webcustomization pages. UIST, 163–172, 2005. all these tasks, with a tendency to perform worse on attributes that this small number of pages. Figure 7(b) show the evaluation time domly selected among 2810 web sites named in the yellow pages. average and the individual averages f TableFor 2: Incrementally recognized instances and learned terms a more detailed description of DIADEM’s stages, (2) see We[12]. illustrate that OXPath’s evaluation is dominated by the C.-H. M.its Kayed, R. Girgis, K. F. Arendering time an less frequently (such as theM. number of reception rooms). discounting the page cleaning, rendering time. bing.com ,Shaalan. the pageof Foroccur each[5] site, weChang, submit main form with aand fixed sequence DIADEM’s analysis uses loading, a knowledge drivenand approach based on This browser rendering time, even for complex queries on small web survey of web informationthe extraction systems. TKDE, allows for a more comparison the different en4b and 4corsummarise results for OXPath: Itateasily low, andwith thus also fillingsFigures to OXPath obtain one, if1.2) possible, two result pages leastthe overall evalua pages. Nonebrowser of the extensions of (Section significantly a domain ontology and balanced phenomenology. To as that end, most of 1411–1428, 2006. gines or web cleaning approaches used in the systems affect the outperforms existing data extraction systems, often by a wide maron the other hand, are comparatively two result records and compare AMBER ’s results with a manually affects the scaling behaviour or the dominance of page rendering. As ananalysis example, consider theinwebpage of Figure is implemented logical rules on top 3oftaken a thinfrom layer of [6] V. Crescenzi and G. Mecca. Automatic information extraction overall For runtimereasoning considerably.DIADEM Figure 7(c) shows the same evaluagin. Its gold high performance execution leaves page retrieval and renthus the overall evaluation time is hig (3)currently Inofan extensive comparison with commercial orJACM, academic annotated standard. Using a full gazetteer, AMBER extracts fact finders. are derightmove.co.uk . Thethe page shows in properties inwe a radius 15 from large websites. 51(5):731â ĂŞ779, 2004. tion time up to 850 pages, but omits WCE web and VWR as they were dering to dominate execution (> 85%) and thus makes avoiding The second experiment analyses t extraction tools, data we show outperforms veloping a reasoning language dynamic, modu[7]that G.OXPath Gottlob, C. Koch, andprevious R. Pichler. Efficient algorithms for area, records, price, detailed page link, location, legal status, miles around Oxford. AMBER uses thetargeted contentatofhighly the seed gazetteer not able to run these tests. Again both figures show a considerable on query evaluation, comparing tim page rendering imperative. We minimize page rendering by bufferapproaches by at least an order of magnitude, although they are lar, expressive reasoning on top of a live browser. This language, processing XPath queries. TODS, 30(2):444â Ă Ş491, 2005. postcode and bedroom number with more than 98% precision and to identify theadvantage position offortheOXPath known(at terms such as locations. In par- Finally, least order of magnitude). and T. absolute actions. ing any page that may be needed in further processing, yet Our test perfo ± [4,one Where recall. OXPath requires memory independent the [8] G. M.still Haber, T.ofMatthews, and Lau.reception CoScripter: G LUE , buildsthree on Datalog 16,locations, 2, 3, more 18] videlicet. , limited. DIADEM’s For lessLeshed, regularE.attributes such as property type, ticular, called AMBER identifies potential Figure 8 illustrates the memorynew use of these systems. Again WCE not result in new page manage to keep memory consumption constant in nearly all cases, ± automating & sharing how-to knowledge in the enterprise. number of accessed pages, most others use linear memory. ontological query language. Datalog allows us to reason on top number and bathroom number, precision remains at 98%, but re-In retrievals. Figu “Oxford”, “Witney” and with confidence of 0.70, and VWR are“Wallingford” excluded, but they show a clear linear trend in memqueries memory containing one to six actions CHI, by 1719–1728, as evidenced Figure 4c2008. where we show OXPaths use of fairly complex ontologies including value invention and equality Millions of over Results at Constant call drops toJ.94%. result of our evaluation proves that AMBER 0.62, and 0.61ory respectively. acceptance forthe new usage in theSince tests the we were able to threshold run. Scaling: Among systems in uct search form. Contextual actions s [9] Lin, J.The Wong, J.Memory. Nichols, A.We Cypher,of and T. A. Lau. Enda 12h extraction task extracting millions records from hunlossto over basic datalog (see validate the complexity foruser OXPath’s PAAT algorithm and is bounds able human-quality examples forsee any web site97–106, in a time compared programming ofFor mashups with Vegemite. In onlylittle Webperformance Harvest comes close the memory usage of items isdependencies 50%, Figure all the 8,with three locations are added the to gazetteer. penalty to IUI, evaluation dredstoofgenerate thousands of pages. more details [13]. [7] for OXPath, a currentwhich state of datalog engines). We areillustrate also working on Yet, its scaling behaviour by evaluating three classes of queries 2009. is not surprising as it does not render pages. given domain. We repeat the process for several websites and ± show how AMBER Comparison: probabilisticWeb extensions [14, 15]a of Datalog forthat use in DIADEM. require complex page buffering. shows results [10] L. Figure Liu, C. 5(a) Pu, and W. the Han. XWRAP: An XML-enabled Order-of-Magn shows clear linear trend. Chickenfoot exhibits identifies neweven locationsHarvest with increasing confidence as the number Path is compared wrapper for web information sources.toInfour commerci of the it first class, which searches for papersconstruction on DEMONSTRATION “Seattle”system in Google 4. DIADEM a constant memory use just as OXPath, though uses about ten of analyzed websites grows. We then leave AMBER to run over ICDE, 611–621, 2000. Content Extractor (WCE), Lixto, Vi Scholar and repeatedly clicks on the “Cited by” links of all results 3. pages FIRST RESULTS times more memory absolute terms. The constant As[11] its approach, the of DIADEM is split into two 250 result from 150 sites ofinthe UK real estate domain, inmemory a star isoperator. M.query Liu and T. demonstration W. Ling. A rule-based query language for academic web automation and extrac using thenavigation Kleene The descends to a depth of due to Chickenfoot’s lack ofachievements support for multi-way To give an impression of the of DIADEM in its that parts. The first part illustrates its analysis current HTML. In DASFAA, 6–13, 2001. using configuration for fully automated learning, i.e., g = l3=into u =the50%, the the open sourceDIAextraction toolkit W citation graph and is evaluated in under 9 minutes (93 we we compensate by using the browser’s history. This forces reload- DEM[12] M. Marx. Conditional XPath. TODS, 30(4):929–959, 2005. year, summarise on three components of DIprototype. The second part illustrates wrapper execution with three can express at least the same ext and we first visualize thebriefly results on sampleresults pages. pages/min). Pages page retrieved are linear w.r.t. time, but memory size ingitswhen aunderstanding page is no longer cached anditsdoes notpage preserve [13]using A. O.aMendelzon, G.ofA.web Mihaila, and T. Milo. Querying the ADEM: form system, OPAL; result analOXPath combination demos and visual wrapper dee.g., Lixto goes considerably beyond Starting with the sparse gazetteer (i.e., 25% of the full gazetteer), remains constant the number of Wide pagesWeb. accessed increase. state, butand requires only a single active DOM instance at any even time.asvelopment World In DIS, 80–91,a1996. ysis, AMBER; the OXPath extraction language. tools. To give the audience better understanding ofiteration and man vest require scripted AMBER performs four learning iterations, before it saturates, as it The jitter in memory use is due to the repeated ascends and deWe also tried5toreport simulate navigation in Chickenfoot, but wrappers [14] A. and F. Azavant. Buildingidentifying light-weight wrappers andSahuguet the challenges in same automatically wrappers, Figures 4a and on1multi-way the quality of form understanding many extraction tasks, in particular w scends of the 5(b) shows the test W4F. does not learn anyresulting new terms. Table thefor outcome of each ofcitation hierarchy. Figure for legacy web data-sources using In VLDB, 738–741, the program wasshows too slow the tests shown here. and result page analysis in DIADEM’s first prototype. Figure 4 [9] we start the demonstration with the execution. needed. We do not consider tools su 1999. simulating Google Scholar pages on our web server (to avoid overly the four rounds. Using the incomplete gazetteer, we initially fail to shows that OPAL is able to identify about 99% taxing of all Google’s form fields The demonstration starts withhow Visual aH.visual for as focusIDE on automation only an [15] N. Sawa, A. Morishima, S. Sugimoto, andthey Kitagawa. servers). The number in brackets indicate we OXPath, annotate 328 out of 1484 attribute instances. In the first round, the 6. real CONCLUSION AND correctly. FUTURE WORK in the UK estate and used car domain We also show OXPath, shown inWrapping Figure 6bof and illustrated thefor screenas required extraction tasks. We a Wraplet: your web contentsfurther with a in lightweight scale the axes. OXPath extracts over 7 million pieces data from gazetteer step identifies 226 unannotated instances. 197 language.with In SITIS, 387–394, 2007. as thelearning results ICQ form benchmarks, where OPAL at pages/min) diadem-project.info/oxpath/visual . RoadRunner With this UI,[6] weor XWRAP [10] Toon thethe best of and our Tel-8 knowledge, OXPath 140, is the000 first web pages inextrac12 hourscast (194 constant memory. of thoseachieves instances are correctly identified, which yields a precision [16] W. Shen, Doan, J. F.links) Naughton, R. Ramakrishnan. > system 96% accuracy (in memory contrast recent approaches achievestrongly at demonstrate how aA. human construct wrapper and lacksupported the abilitybyto traverse to new tion with strict guarantees, which reflect We conduct similar tests (repeatedly clicking on would all for aand Declarative extraction using generator datalog with and recall 87.2% and 60.1% ofevaluation. the unannotated instances, In contrast OXPath, many of in our experimental We believe that it84.3% candifferent becomecharacteristics: an a visual interface bestof92% [8]). The latter result is without use of domain knowlandinformation how supervised cantogentasks with one on very largeapages (Wiki-wrapper embedded extraction predicates. In VLDB, 1033–1044, 2007. and 81.3% all instances. The increase in precision is stable in the scripted websites easily. Thus, we part of the toolset of developers interacting edge.ofimportant With domain knowledge we could easily achieve close to a wrapper from just a few examples. We provide a number pedia) andwith one onweb. pageseralize with[17] many results (product listings on J.-Y. Su, D.-J. Sun, I.-C. Wu, and L.-P. Chen. On design of all the learning rounds sointhat, the end ofa strong the fourth task Google Scholar, which does We arealso committed building set ofiteration, tools around OX- web 99% accuracy theseatto cases. Figure 5 [11] shows the results of standard examples, butand are 12 also suggestions for the JMST, audiGoogle) from different shops. Figures 5(c) (inopen Ap-tosystem browser-oriented data extraction andon plug-ins. scripted AMBERfor achieves a precision of and a recall 80.6% ofagain the Path. We provide a97.8% visual generator forofOXPath expressions and Once the wrapper istocreated we submit itOn andheavily let it run in thepages, the perfo data area, record, and attribute identification on result pages 18(2):189–200, pendix E) show thatence. memory is constant w.r.t.2010. pages visited even more pronounced. With each unannotated instances, an overall precision and recall offor95.1% for AMBER in and the UK real estate domain. We report each at- largerbackground while weand continue withquite the rest of isthe demonstration. even the much pages on Wikipedia the often and 93.8%, respectively. visually-involved pages reached by the Google product search. (a) Evaluation time, <150 p. (a) OPAL WWW 2012 – European Projects Track April 16–20, 2012, Lyon, France 1 2 4 5 6 3 (a) OPAL (b) OXPath Figure 6: Demonstration UIs In the second part of the demonstration, we focus on the DIADEM prototype. A screencast of the demo is available from [7] O. de Moor, G. Gottlob, T. Furche, and A. Sellers, eds. Datalog Reloaded, Revised Selected Papers, LNCS. 2011. [8] E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical approach to model web query interfaces for web source integration. In VLDB, 2009. [9] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schallhart. Real understanding of real estate forms. In WIMS, 2011. [10] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schallhart. OPAL: Automated Form Understanding for the Deep Web. In WWW, 2012. [11] T. Furche, G. Gottlob, G. Grasso, G. Orsi, C. Schallhart, and C. Wang. Little knowledge rules the web: Domain-centric result page extraction. In RR, 2011. [12] T. Furche, G. Gottlob, X. Guo, C. Schallhart, A. Sellers, and C. Wang. How the minotaur turned into Ariadne: Ontologies in web data extraction. ICWE, (keynote), 2011. [13] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. Sellers. OXPath: A language for scalable, memory-efficient data extraction from web applications. In VLDB, 2011. [14] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Answering threshold queries in probabilistic datalog+/- ontologies. In SUM, 2011. [15] G. Gottlob, T. Lukasiewicz, and G. I. Simari. Conjunctive query answering in probabilistic datalog+/- ontologies. In RR, 2011. [16] G. Gottlob, G. Orsi, and A. Pieris. Ontological queries: Rewriting and optimization. In ICDE, 2011. [17] N. Kushmerick. Wrapper induction: efficiency and expressiveness. AI, 118, 2000. [18] G. Orsi and A. Pieris. Optimizing query answering under ontological constraints. In VLDB, 2011. [19] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrunner: open information extraction on the web. In NAACL 2007. diadem-project.info/screencast/diadem-01-april-2011.mp4. In the demonstration, we let the prototype run on several pages from the UK real estate and used car domain. The prototype runs in interactive mode, i.e., it visualizes its deliberations and conclusions and prompts the user before continuing to a new page, so that the user can inspect the results on the current page. With this, we can demonstrate first how DIADEM analyses forms and explores web sites based on these analyses. In particular, we show how DIADEM deals with form constraints such as mandatory fields, multi stage forms and other advanced navigation structures. The demo fully automatically explores the web site until it finds result pages for which it performs a selective analysis sufficient to generate the intended wrapper. As before, the demonstration discusses a number of such result pages and shows in the interactive prototype how DIADEM extracts objects from these result pages as well as identifies further exploration steps. We conclude the demonstration with a brief outlook into applications of DIADEM technology beyond data extraction: The screencast at diadem-project.info/opal shows how we use DIADEM’s form understanding to provide advanced assistive form filling: where current assistive technologies are limited to simple keyword matching or filling of already encountered forms, DIADEM’s technology allows us to fill any form of a domain with values from a master form with high precision. 5. REFERENCES [1] M. Benedikt, G. Gottlob, and P. Senellart. Determining relevance of accesses at runtime. In PODS, 2011. [2] A. Calì, G. Gottlob, and A. Pieris. New expressive languages for ontological query answering. In AAAI, 2011. [3] A. Calì, G. Gottlob, and A. Pieris. Querying conceptual schemata with expressive equality constraints. In ER, 2011. [4] A. Calì, G. Gottlob, and A. Pieris. Query answering under non-guarded rules in datalog± . In RR, 2010. [5] V. Crescenzi and G. Mecca. Automatic information extraction from large websites. J. ACM, 51(5), 2004. [6] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. In VLDB, 2011. 270