The Apertium platform: opportunities for research and - Abu

Transcription

The Apertium platform: opportunities for research and - Abu
The Apertium platform: opportunities
for research and business
Gema Ramírez Sánchez
gramirez@prompsit.com
Prompsit Language Engineering, S.L.
Campus UMH. Edifici Quorum III.
Av. de la Universitat, s/n
Elx (Alacant). Spain
www.prompsit.com
Index

Brief introduction

Apertium: the machine translation (MT) platform


apertium demo
Prompsit: adding a business layer to Apertium

aplica.prompsit.com
UZ. Zagreb, 27th October 2014
Transducens research group



Origins: 2004, Department of Software and
Computing Systems at Universitat d'Alacant
Areas: machine translation, human language
technology applications, mark-up languages
and digital libraries, computer-supported
education
Staff: 2 full prof., 2 associated prof., 4 assistant
prof., 1 teaching assistant, 4 PhD students and
20 technicians researching in national and
international projects
UZ. Zagreb, 27th October 2014
Transducens research group

Origins of MT in Transducens:

InterNOSTRUM: Spanish ←→ Catalan

Traductor Universia: Spanish ←→ Portuguese

The Apertium free/open source MT platform =
research platform (5 master thesis, 2 PhD thesis,
around 70 publications, more than 700 citations, 6
public funded projects) = technology transfer
platform
UZ. Zagreb, 27th October 2014
Prompsit




Origins: 2006, spin-off from Transducens
Motivation: reuse know-how, commercialise
services (no licenses) around Apertium, revolutionise the translation and language technology
markets according to the collaborative development model powered by free software
Expertise: machine translation and natural
language processing for multilingual tasks
Team: linguists and software engineers +
Transducens + Apertium community
UZ. Zagreb, 27th October 2014
Some Prompsit/Transducens people
UZ. Zagreb, 27th October 2014
Apertium: the MT platform


Rule-based MT platform: shallow-transfer, provides a
free/open-source engine, data (38 pairs) and tools
Philosophy:

clear and effective separation of engine and data

modularity (do one thing and do it well)


standards (C++ coding, xml-based data, unicodecompliant, multi-platform, ubuntu repositories)
no rocket science: stablished & robust technologies, unix
pipeline, text oriented processing

free/open-source + well documented + support

fast, run in standard PC, easy integration
UZ. Zagreb, 27th October 2014
Apertium workflow
Defformatter: txt, xml,
html, doc(x), ppt(x), xslx,
rtf, zip, quarkpress, etc.
Source
text
Morphological
analyser
Monolingual
dictionary
Post-gen
dictionary
Reformatter
Post-generator
PoS tagger
Pre-transfer
Parameters
Monolingual
dictionary
1 or 3-level
structural & lexical
tranfer
Transfer
rules
Bilingual
dictionary
Morphological
generator
Optional modules:
Target
text
named-entity
UZ. Zagreb, 27th October
2014
tmx-handler
guesser
language &
encoding
Identifier
lexical selector
Apertium language pairs
nl
af
slv
en
ms
ar
mt
nn
mk
nb
bg
ast
gl
es
eo
pt
fr
br
sv
sme da
eu
ro
is
hbs
id
an
cy
kaz
ca
UZ. Zagreb, 27th October 2014
it
tat
38 stable language pairs
oc
urd
hin
A stable language pair contains...


Dictionaries:

2 monolingual: xx.dix/.metadix & yy.dix/.metadix

1 bilingual: xx-yy.dix

2 post-generator: xx-post.dix & yy-post.dix
Tagger definition set and probabilities


one per language: xx.tsx, xx.prob & yy.tsx, yy.prob
Transfer rule files:

one to three levels, per translation direction: xxyy.t[1-3]x & yy-xx.t[1-3]x
UZ. Zagreb, 27th October 2014
A stable language pair contains...

Closed categories


Open categories


nouns, verbs, adjectives, adverbs
Basic operations between languages


determiners, pronouns, conjunctions, prepositions,
numerals, etc.
gender, number, case agreement, tenses, local
reorderings, etc.
Coverage above 90%, word error rate below 30%
UZ. Zagreb, 27th October 2014
Croatian in Apertium: the more...

In the TRUNK branch...

apertium-hbs-slv


apertium-hbs-mks


16,607 lemas; 14,742 bilingual equivalents, 47
(hbs→slv) & 98 (slv→hbs) transfer rules
12,638 lemas; 10,452 bilingual equivalents, 71
(hbs→mkd) & 19 (mkd→hbs ) transfer rules
apertium-hbs-eng

16,607 lemas; 16,226 bilingual equivalents, 56
(hbs→eng) & 6 (eng→hbs) transfer rules
UZ. Zagreb, 27th October 2014
Croatian in Apertium: and more...

In the NURSERY branch...

apertium-hbs-rus


16,607 lemas; 5,008 bilingual equivalents, 6
(hbs→rus & 8 (rus→hbs) transfer rules
In the LANGUAGES branch...

apertium-hbs

33,451 lemas!!! SETimes coverage: 92.6%!!!
UZ. Zagreb, 27th October 2014
Croatian in Apertium: the merrier!

Who made it possible?



Lots of contributors to thank: Hrvoje Peradin, Aleš
Horvat, Francis Tyers, Filip Petkovski, Dejan
Čabrilo, Ivica Dimitrijev, Kevin Brubeck
Unhammer, Barbara Dujmic, Nikola Ljubešić,
Filip Klubička and myself ;)
Also Google Summer of Code!
And of course the Abu-MaTran
project
UZ. Zagreb, 27th October 2014
Powered by Abu-MaTran

What is Abu-MaTran?



It is not a place in Saudi Arabia: Abu al Matran
It stands for Automatic building of Machine
Translation
It is a European project (Marie Curie IAPP action)
looking to connect companies and research
institutions to work in interesting subjects for people
www.abumatran.eu
UZ. Zagreb, 27th October 2014
A noun in the hbs dictionary:

ljubica inflects as... djevojčic/a__n
<e lm="ljubica">
<i>ljubic</i>
<par n="djevojčic/a__n"/>
</e>

This entry analyses / generates 14 forms:
ljubica:ljubica<n><f><sg><nom>, ljubice:ljubica<n><f><sg><gen>,
ljubicu:ljubica<n><f><sg><acc>, ljubici:ljubica<n><f><sg><dat>|<loc>,
ljubice:ljubica<n><f><sg><voc>, ljubicom:ljubica<n><f><sg><ins>, etc.
UZ. Zagreb, 27th October 2014
A noun in the hbs-eng dictionary:

ljubica in English is... violet (also ljubičica)
<e>
<p>
<l>ljubičica<s n=”n”/></l>
<r>violet<s n=”n”/></r>
</p>
</e>
UZ. Zagreb, 27th October 2014
More on Apertium

GPL license, available at Sourceforge.net

The Apertium community: 265 developers

Funding: public and private funding

1 pair = from 4 to 8 person/month

Testing: www.apertium.org on
http://hr.wikipedia.org/wiki/Portal:Nogomet

Step-by-step demo: apertium-viewer

More info: http://wiki.apertium.org/wiki/Publications
UZ. Zagreb, 27th October 2014
Just in case the net doesn't work

apertium-hbs-eng
HR

EN
PORTAL O NOGOMETU
PORTAL On *NOGOMETU
Što je to nogomet?
Which is that *nogomet?
Nogomet je ekipni šport koji se igra
između dvije momčadi svaka
sastavljena od 11 igrača.
*Nogomet is *ekipni *šport who plays
between two team every
assembled from 11 players.
abumatran-hbs-eng (statistical MT)
HR
EN
PORTAL O NOGOMETU
Portal O football
Što je to nogomet?
What's this football?
Nogomet je ekipni šport koji se igra
između dvije momčadi svaka
sastavljena od 11 igrača.
Football is team Education and Sports
which is played between two the two
teams each consisting of 11 players.
UZ. Zagreb, 27th October 2014
Just in case the net doesn't work

apertium-hbs-slv (no *unknowns shown)
HR

SL
PORTAL O NOGOMETU
PORTAL O NOGOMETU
Što je to nogomet?
Kateri je to nogomet?
Nogomet je ekipni šport koji se igra
između dvije momčadi svaka
sastavljena od 11 igrača.
Nogomet je ekipni šport ki se igra
vmes dva ekipe vsaka
sestavi od 11 igralcev.
apertium-hbs_HR-hbs_SR (to be released)
HR
SR
PORTAL O NOGOMETU
PORTAL O FUDBALU
Što je to nogomet?
Što je to fudbal?
Nogomet je ekipni šport koji se igra
između dvije momčadi svaka
sastavljena od 11 igrača.
Fudbal je *ekipni šport koji se igra
između dve momčadi svaka
sastavljena od 11 igrača.
UZ. Zagreb, 27th October 2014
Future work for Apertium

Adding lexical selection: done!

Adding a deeper transfer module

Improving morphology management



Adding other close-related language families
(Slavic and Baltic on the roadmap)
Eliciting knowledge from users through user
interfaces
Extracting automatic data from available
resources
UZ. Zagreb, 27th October 2014
Prompsit: adding a business
layer to Apertium

In 2006 we had the most mportant ingredients:




the license: GNU General Public Licence
the team: combination of know-hows, the will to
work together, a shared goal
the business model: software is free (as in freedom
but also as in free beer), we get money from
services = our work + margin
and all the rest had to be defined... still going on...
UZ. Zagreb, 27th October 2014
Apertium as a business
UZ. Zagreb, 27th October 2014
Prompsit today...
UZ. Zagreb, 27th October 2014
Prompsit: MT-related services


Prompsit Integra

Made-to-measure machine translation services

Multilingual content management
Prompsit Innova


Hybrid MT development and services
Prompsit Informa

À la carte training (machine translation)

Consultancy services
UZ. Zagreb, 27th October 2014
Prompsit: MT technologies


Apertium rule-based MT systems (+TM):

closely-related languages

20,000 words/sec, more mechanical

more than 38 systems already developed
Apertium + Moses hybrid MT systems (+TM):

more distant languages

200 words/sec, more fluent

12 systems already developed
UZ. Zagreb, 27th October 2014
+
Marketing for MT technologies

Our MT technologies are:



free/open-source: no cost per license = inexpensive
customisable: each customer can ask for a
particular need (domain, format, features)
easily integratable: within other systems or
workflows

combinable: with translation memories

fast, scalable, ready for production environments

wide format support: Office, LibreOffice, txt, html,
latex and... PDF!!!
UZ. Zagreb, 27th October 2014
Some successful use cases
DGT
European Commission
UZ. Zagreb, 27th October 2014
Use case: Autodesk




Goal: quick and cheap translation from English
to Brazilian Portuguese
Proposal: translate from Spanish to Brazilian
Portuguese with Apertium
Process: terminology customisation, integration
with TM's, web service set-up
Results: 66% improvement in translation speed,
cheapest post-editors, glossary adherence
UZ. Zagreb, 27th October 2014
Beyond MT


Extractium: an Apertium-based named-entity classifier
Opinum: statistical opinion classifier trained on
domain-specific corpora
Test them at aplica.prompsit.com!!


Reverso Context: a bilingual concordancer developed
by Prompsit for Softissimo: context.reverso.net
AltLang: an Apertium-based service focused on
language variants generation: www.altlang.net
UZ. Zagreb, 27th October 2014
Research results = opportunity
for research for business!
UZ. Zagreb, 27th October 2014
The Apertium platform: opportunities
for research and business
Hvala lijepa!
Be welcome to www.prompsit.com
Contact me at gramirez@prompsit.com
Follow us at http://twitter.com/prompsit
UZ. Zagreb, 27th October 2014