The CORPUSCLE corpus search system
Transcription
The CORPUSCLE corpus search system
The Corpuscle search system Paul Meurer Uni Research Compu7ng Corpuscle Corpuscle: a corpus management and analysis system for annotated corpora. • Newly developed corpus query engine • Comparable to Corpus Workbench • Coding of hierarchical structures, structured aEributes • Integrated Web interface • Concordances, colloca7ons, word and cooccurence/distribu7on sta7s7cs Corpuscle • • • • • • Parallel corpus support Searchable user annota7ons Mul7media support Clarin CMDI metadata AAI, federated authen7ca7on REST/JSON API (useful e.g. for interfacing with R) Highlights • • • • • • Searchable user annota7ons Mul7-‐valued and set-‐valued aEributes Coding of mul7-‐word expressions (and more) Corpus text upload CLARIN Federated Content Search endpoint REST/JSON based API Searchable user annota7ons AEributes Mul7-‐valued aEributes: • useful for not fully disambiguated annota7on • transparent to the user: queried in the same way as single-‐valued aEributes • possible to query for single-‐valuedness Set-‐valued aEributes: • used for gramma7cal feature sets • can be queried efficiently, with special syntax Combina7ons of both • e.g., not fully disambiguated feature set annota7on Mul7-‐valued aEributes Ex.: word = fisker, lemma = fisk | fiske Possible queries: does match? • [ lemma = "fisk" ] yes • [ lemma == "fisk" ] no • [ lemma != "fisk" ] no • [ lemma !!= "fisk" ] yes • [ lemma == "fisk.*" ] ? (no) Set-‐valued aEributes Useful to code gramma7cal feature sets, or other types of non-‐atomic annota7ons An aEribute can be set-‐valued and mul7-‐valued at the same 7me: • word = fisker • lemma = fisk|fisker • morph = ( N m pl ) | ( N m sg ) Example queries with boolean expression syntax: • [ morph = ("N" "sg") ] • [ morph = ("N" "pl" | "A" !"sup") ] The reversed index is implemented as a suffix array, which makes this type of queries very natural and efficient. No complicated regular expressions have to be evaluated. MWEs MWE: Mul7 Word Expressions difficult to handle because of conflic7ng needs: • Want to treat a MWE as a unit (single lemma form, set of gramma7cal features) – e.g., “i dag” should be treated as an Adv, not a PP and an N; “Rio de Janeiro” is one place name. • But we want to be able to search in a uniform way, without knowing in advance which words are MWEs and which are not, at least on the token level. • Counts have to be correct MWEs: Solu7on • Every word of a MWE is a separate token • Lemma and features span the whole MWE • Counts always relate to corpus posi7ons Annota7on spans Annota7on spans as an extension of MWE coding (under development): • MWEs: An aEribute value can have a span • Extension: mul7ple values with mul7ple spans; directed spans (edges) • This allows coding of, e.g., dependency rela7ons, coreference chains, and more Corpus Text Upload Users can upload their texts via a Web form to build a corpus • Plain text (UTF-‐8) • XML text Three steps: • Corpus defini7on • Text upload • Indexing The corpus is useable right away Plans: include annota7on workflow (e.g., LAP) Federated Content Search CLARIN-‐FCS • Goal: search in heterogeneous, geographically spread resources in a unified manner • Query language: CQL – Contextual Query Language • Protocol: SRU – Search and Retrieve via URL • Return format: XML (adhering to the CLARIN-‐CQL schema, extensible) • Opera:ons: explain, (scan,) searchRetrieve • Use case: Weblicht Aggregator Federated Content Search Federated Content Search • hEp://clarino.uib.no/corpuscle/fcs? opera7on=explain • hEp://clarino.uib.no/corpuscle/fcs? opera7on=searchRetrieve REST/JSON API REST: REpresenta7onal State Transfer • An architectural style, no official standard (unlike SOAP) • Web API, Client-‐server model • Stateless (in theory, but a session-‐id token and authen7ca7on informa7on is sent with every request) • Uses HTTP GET or POST requests • response can be XML, JSON, etc. • in our case: JSON REST/JSON API JSON: JavaScript Object Nota7on • Language independent data format • light-‐weight, easy to parse, basic data types • Parsers for most programming languages As a result: Easy to implement clients for REST/ JSON API based services (Need e.g. curl, json parser) REST/JSON API • Example calls: • hEp://clarino.uib.no/corpuscle/rest? command=get-‐session • hEp://clarino.uib.no/corpuscle/rest? session=1234&corpus=avis-‐ plain&command=query&query='aske.*' REST/JSON API Problem: federated authen7ca7on via IdP to access restricted resources Two solu=ons: 1. Let your program code replicate the user interac7on with the IdP and the local SP (this involves parsing of returned HTML pages etc.) 2. Get an authen7ca7on session token from a Web login to the SP and use it in your code, e.g.: curl "hEp://clarino.uib.no/corpuscle/rest?command=get-‐session&login-‐ index=_90753bc613d96c2v19069332254ca1b8fee4f574d"