The Apertium platform: opportunities for research and - Abu
Transcription
The Apertium platform: opportunities for research and - Abu
The Apertium platform: opportunities for research and business Gema Ramírez Sánchez gramirez@prompsit.com Prompsit Language Engineering, S.L. Campus UMH. Edifici Quorum III. Av. de la Universitat, s/n Elx (Alacant). Spain www.prompsit.com Index Brief introduction Apertium: the machine translation (MT) platform apertium demo Prompsit: adding a business layer to Apertium aplica.prompsit.com UZ. Zagreb, 27th October 2014 Transducens research group Origins: 2004, Department of Software and Computing Systems at Universitat d'Alacant Areas: machine translation, human language technology applications, mark-up languages and digital libraries, computer-supported education Staff: 2 full prof., 2 associated prof., 4 assistant prof., 1 teaching assistant, 4 PhD students and 20 technicians researching in national and international projects UZ. Zagreb, 27th October 2014 Transducens research group Origins of MT in Transducens: InterNOSTRUM: Spanish ←→ Catalan Traductor Universia: Spanish ←→ Portuguese The Apertium free/open source MT platform = research platform (5 master thesis, 2 PhD thesis, around 70 publications, more than 700 citations, 6 public funded projects) = technology transfer platform UZ. Zagreb, 27th October 2014 Prompsit Origins: 2006, spin-off from Transducens Motivation: reuse know-how, commercialise services (no licenses) around Apertium, revolutionise the translation and language technology markets according to the collaborative development model powered by free software Expertise: machine translation and natural language processing for multilingual tasks Team: linguists and software engineers + Transducens + Apertium community UZ. Zagreb, 27th October 2014 Some Prompsit/Transducens people UZ. Zagreb, 27th October 2014 Apertium: the MT platform Rule-based MT platform: shallow-transfer, provides a free/open-source engine, data (38 pairs) and tools Philosophy: clear and effective separation of engine and data modularity (do one thing and do it well) standards (C++ coding, xml-based data, unicodecompliant, multi-platform, ubuntu repositories) no rocket science: stablished & robust technologies, unix pipeline, text oriented processing free/open-source + well documented + support fast, run in standard PC, easy integration UZ. Zagreb, 27th October 2014 Apertium workflow Defformatter: txt, xml, html, doc(x), ppt(x), xslx, rtf, zip, quarkpress, etc. Source text Morphological analyser Monolingual dictionary Post-gen dictionary Reformatter Post-generator PoS tagger Pre-transfer Parameters Monolingual dictionary 1 or 3-level structural & lexical tranfer Transfer rules Bilingual dictionary Morphological generator Optional modules: Target text named-entity UZ. Zagreb, 27th October 2014 tmx-handler guesser language & encoding Identifier lexical selector Apertium language pairs nl af slv en ms ar mt nn mk nb bg ast gl es eo pt fr br sv sme da eu ro is hbs id an cy kaz ca UZ. Zagreb, 27th October 2014 it tat 38 stable language pairs oc urd hin A stable language pair contains... Dictionaries: 2 monolingual: xx.dix/.metadix & yy.dix/.metadix 1 bilingual: xx-yy.dix 2 post-generator: xx-post.dix & yy-post.dix Tagger definition set and probabilities one per language: xx.tsx, xx.prob & yy.tsx, yy.prob Transfer rule files: one to three levels, per translation direction: xxyy.t[1-3]x & yy-xx.t[1-3]x UZ. Zagreb, 27th October 2014 A stable language pair contains... Closed categories Open categories nouns, verbs, adjectives, adverbs Basic operations between languages determiners, pronouns, conjunctions, prepositions, numerals, etc. gender, number, case agreement, tenses, local reorderings, etc. Coverage above 90%, word error rate below 30% UZ. Zagreb, 27th October 2014 Croatian in Apertium: the more... In the TRUNK branch... apertium-hbs-slv apertium-hbs-mks 16,607 lemas; 14,742 bilingual equivalents, 47 (hbs→slv) & 98 (slv→hbs) transfer rules 12,638 lemas; 10,452 bilingual equivalents, 71 (hbs→mkd) & 19 (mkd→hbs ) transfer rules apertium-hbs-eng 16,607 lemas; 16,226 bilingual equivalents, 56 (hbs→eng) & 6 (eng→hbs) transfer rules UZ. Zagreb, 27th October 2014 Croatian in Apertium: and more... In the NURSERY branch... apertium-hbs-rus 16,607 lemas; 5,008 bilingual equivalents, 6 (hbs→rus & 8 (rus→hbs) transfer rules In the LANGUAGES branch... apertium-hbs 33,451 lemas!!! SETimes coverage: 92.6%!!! UZ. Zagreb, 27th October 2014 Croatian in Apertium: the merrier! Who made it possible? Lots of contributors to thank: Hrvoje Peradin, Aleš Horvat, Francis Tyers, Filip Petkovski, Dejan Čabrilo, Ivica Dimitrijev, Kevin Brubeck Unhammer, Barbara Dujmic, Nikola Ljubešić, Filip Klubička and myself ;) Also Google Summer of Code! And of course the Abu-MaTran project UZ. Zagreb, 27th October 2014 Powered by Abu-MaTran What is Abu-MaTran? It is not a place in Saudi Arabia: Abu al Matran It stands for Automatic building of Machine Translation It is a European project (Marie Curie IAPP action) looking to connect companies and research institutions to work in interesting subjects for people www.abumatran.eu UZ. Zagreb, 27th October 2014 A noun in the hbs dictionary: ljubica inflects as... djevojčic/a__n <e lm="ljubica"> <i>ljubic</i> <par n="djevojčic/a__n"/> </e> This entry analyses / generates 14 forms: ljubica:ljubica<n><f><sg><nom>, ljubice:ljubica<n><f><sg><gen>, ljubicu:ljubica<n><f><sg><acc>, ljubici:ljubica<n><f><sg><dat>|<loc>, ljubice:ljubica<n><f><sg><voc>, ljubicom:ljubica<n><f><sg><ins>, etc. UZ. Zagreb, 27th October 2014 A noun in the hbs-eng dictionary: ljubica in English is... violet (also ljubičica) <e> <p> <l>ljubičica<s n=”n”/></l> <r>violet<s n=”n”/></r> </p> </e> UZ. Zagreb, 27th October 2014 More on Apertium GPL license, available at Sourceforge.net The Apertium community: 265 developers Funding: public and private funding 1 pair = from 4 to 8 person/month Testing: www.apertium.org on http://hr.wikipedia.org/wiki/Portal:Nogomet Step-by-step demo: apertium-viewer More info: http://wiki.apertium.org/wiki/Publications UZ. Zagreb, 27th October 2014 Just in case the net doesn't work apertium-hbs-eng HR EN PORTAL O NOGOMETU PORTAL On *NOGOMETU Što je to nogomet? Which is that *nogomet? Nogomet je ekipni šport koji se igra između dvije momčadi svaka sastavljena od 11 igrača. *Nogomet is *ekipni *šport who plays between two team every assembled from 11 players. abumatran-hbs-eng (statistical MT) HR EN PORTAL O NOGOMETU Portal O football Što je to nogomet? What's this football? Nogomet je ekipni šport koji se igra između dvije momčadi svaka sastavljena od 11 igrača. Football is team Education and Sports which is played between two the two teams each consisting of 11 players. UZ. Zagreb, 27th October 2014 Just in case the net doesn't work apertium-hbs-slv (no *unknowns shown) HR SL PORTAL O NOGOMETU PORTAL O NOGOMETU Što je to nogomet? Kateri je to nogomet? Nogomet je ekipni šport koji se igra između dvije momčadi svaka sastavljena od 11 igrača. Nogomet je ekipni šport ki se igra vmes dva ekipe vsaka sestavi od 11 igralcev. apertium-hbs_HR-hbs_SR (to be released) HR SR PORTAL O NOGOMETU PORTAL O FUDBALU Što je to nogomet? Što je to fudbal? Nogomet je ekipni šport koji se igra između dvije momčadi svaka sastavljena od 11 igrača. Fudbal je *ekipni šport koji se igra između dve momčadi svaka sastavljena od 11 igrača. UZ. Zagreb, 27th October 2014 Future work for Apertium Adding lexical selection: done! Adding a deeper transfer module Improving morphology management Adding other close-related language families (Slavic and Baltic on the roadmap) Eliciting knowledge from users through user interfaces Extracting automatic data from available resources UZ. Zagreb, 27th October 2014 Prompsit: adding a business layer to Apertium In 2006 we had the most mportant ingredients: the license: GNU General Public Licence the team: combination of know-hows, the will to work together, a shared goal the business model: software is free (as in freedom but also as in free beer), we get money from services = our work + margin and all the rest had to be defined... still going on... UZ. Zagreb, 27th October 2014 Apertium as a business UZ. Zagreb, 27th October 2014 Prompsit today... UZ. Zagreb, 27th October 2014 Prompsit: MT-related services Prompsit Integra Made-to-measure machine translation services Multilingual content management Prompsit Innova Hybrid MT development and services Prompsit Informa À la carte training (machine translation) Consultancy services UZ. Zagreb, 27th October 2014 Prompsit: MT technologies Apertium rule-based MT systems (+TM): closely-related languages 20,000 words/sec, more mechanical more than 38 systems already developed Apertium + Moses hybrid MT systems (+TM): more distant languages 200 words/sec, more fluent 12 systems already developed UZ. Zagreb, 27th October 2014 + Marketing for MT technologies Our MT technologies are: free/open-source: no cost per license = inexpensive customisable: each customer can ask for a particular need (domain, format, features) easily integratable: within other systems or workflows combinable: with translation memories fast, scalable, ready for production environments wide format support: Office, LibreOffice, txt, html, latex and... PDF!!! UZ. Zagreb, 27th October 2014 Some successful use cases DGT European Commission UZ. Zagreb, 27th October 2014 Use case: Autodesk Goal: quick and cheap translation from English to Brazilian Portuguese Proposal: translate from Spanish to Brazilian Portuguese with Apertium Process: terminology customisation, integration with TM's, web service set-up Results: 66% improvement in translation speed, cheapest post-editors, glossary adherence UZ. Zagreb, 27th October 2014 Beyond MT Extractium: an Apertium-based named-entity classifier Opinum: statistical opinion classifier trained on domain-specific corpora Test them at aplica.prompsit.com!! Reverso Context: a bilingual concordancer developed by Prompsit for Softissimo: context.reverso.net AltLang: an Apertium-based service focused on language variants generation: www.altlang.net UZ. Zagreb, 27th October 2014 Research results = opportunity for research for business! UZ. Zagreb, 27th October 2014 The Apertium platform: opportunities for research and business Hvala lijepa! Be welcome to www.prompsit.com Contact me at gramirez@prompsit.com Follow us at http://twitter.com/prompsit UZ. Zagreb, 27th October 2014