The JANES project

Transcription

The JANES project
TheJANESproject:
Toolsandresourcesforlinguis:canalysisandautoma:c
processingofuser-generatedcontentinSlovene
DarjaFišer
UniversityofLjubljana
Zurich,17December2015
Outline
1. 
2. 
3. 
4. 
5. 
Projectbackgroundandgoals
ConstrucFon&annotaFonoftheJanescorpus
ConstrucFon&annotaFonofthesubcorpusoftweets
Workinprogress&futurework
ProjectacFviFes
1.Projectbackgroundandgoals
Background
•  Basicfacts
• 
• 
• 
• 
2mioinhabitants
Mediterranean,Alpine,PannonianandDinaricregions
7dialectalgroups,42dialects
prescripFve,normaFvistculture
•  Languageresources
•  plentyofcorpora(CLARIN.SI&CC)
•  standardlanguage
•  TheJanesproject
• 
• 
• 
• 
• 
naFonalbasicresearchproject
3yr,2014-2017
2insFtuFons,8teammembers
developmentofresources,toolsandmethodsfortheanalysisofUGC
h[p://nl.ijs.si/janes
Researchgoals
§  WP1:CorpusconstrucFon
§  WP2:LinguisFcanalysis
–  Task1:vs.standardSlovene
•  orthography(punctuaFon,
capitalizaFon,spelling)
•  regionalvariaFon(PhD)
•  syntax
–  Task2:vs.spokenSlovene
•  discoursemarkers
•  interacFveelements
–  Task3:collocaFons
•  newcollocaFonsofoldwords
•  collocaFonsofnewwords
–  Task4:terminology
•  specialisaFonlevel&density
•  nonstandardelementsin
terminology
–  Task5:semanFcshi`s
•  senseexpansion/narrowing/shi`ing
•  amelioraFon/pejoraFon
–  Task6:offensivelanguage
•  refugeecrisis
•  gaymarriage
§  WP3:NLPtools
2.Construc:on&annota:on
oftheJanescorpus
CorpusJANES0.3
§  Tweets
–  TweetCat(Ljubešićetal.2014)
–  Slovene-specificseedwords->Sloveneusers->theirnetwork
–  metadata:username,Fmestamp,no.ofretweets&favourites
§  Forummessages
–  3forums:med.over.net,avtomobilizem.com,kvarkadabra.net
–  customizedextractors
–  metadata:topic,postURL,postFmestamp,username,postid
§  Newscomments
–  3newsportals:RTVSlo,Mladina,Reporter
–  customizedextractors
–  metadata:arFcleurl,arFcleid,username,postFmestamp,postid
§  [Blogs]
–  slWaC2.0(ErjavecandLjubešić2014)
–  “blog”indomainname
–  nometadata,mixedblogtextandcomments
Corpuscomposi:on
bytextsource
1% 0%
Total:161Mtokens
8%
Twi[er
38%
24%
61MT
Forumi:Avtomobilizem
Forumi:Kvarkadabra
47MT
Forumi:Medover.net
Blogi
38MT
Komentarji:RTVSlo
Komentarji:Mladina
Komentarji:Reporter
8%
5%
16%
15MT
Corpuscomposi:on
byauthors,texts,words&tokens
85,500authors
4,8Mtexts
tweet
tweet
forum
forum
comment
comment
blog
blog
0
10.000
20.000
30.000
40.000
50.000
60.000
70.000
0
161Mtokens
135Mwords
tweet
tweet
forum
forum
comment
comment
blog
blog
0
10.000.000 20.000.000 30.000.000 40.000.000 50.000.000 60.000.000
500.000 1.000.0001.500.0002.000.0002.500.0003.000.0003.500.0004.000.000
0
10.000.000 20.000.000 30.000.000 40.000.000 50.000.000 60.000.000 70.000.000
Corpuscomposi:on
byyear
Corpusprocessing
§  AnnotaFon
–  (almoststandard)tokenizaFon&sentencesegmentaFon
–  lexicalnormalisa:onwithCSMT
–  standardMSDtagging&lemmaFzaFon
§  Encoding
–  (currently)bespokeXMLformetadata
–  annotatedtextinTEIP5
§  ExploraFon
–  (no)SketchEngine
TEItext
noSketchEngine:“beUer”
noSketchEngine:“I”
3.Construc:on&annota:on
ofthesubcorpusoftweets
CorpusTweet-sl0.3.4
§  AddiFonaltweets
–  tokens:70M
–  words:53M
–  tweets:4,3M
§  Enrichedmetadata
–  attweetlevel:
•  standardness(automaFc)
•  senFment(automaFc)
–  atuserlevel:
•  private/corporate(manual)
•  male/female(manual)
•  region(automaFc)
Annota:onoftextstandardness
(Ljubešićetal.2015)
§  2standardnesslevels
–  technical:T1–T3
–  linguisFc:L1–L3
Annota:onoftextstandardness
(Ljubešićetal.2015)
§  Datasetdevelopment
–  50:50standard&non-standardtexts
–  raFoofno.ofnormalisedtokensvs.no.ofalltokenspertext(0.1)
–  manualannotaFon
•  900textssingleannotated(developmentset)
•  400textsdouble-annotated(tesFngset)
§  FeatureselecFon(29features)
–  character-based:
• 
• 
• 
repeFFonsofcharacters
raFoofalphabeFcvs.non-alphabeFccharacters
raFoofvowelsvs.consonants
• 
• 
• 
proporFonofveryshortwords
proporFoncapitalisedwords
proporFonofwordsnotincludedtheSlolekslexicon
–  token-based:
§  Regressor
–  gridsearchhyperparameteropFmisaFonvia10-foldcross-validaFonontheSVR
regressorusingRBFkernel
§  EvaluaFon
–  meanabsoluteerror
•  0.377T
•  0.424L
Corpuscomposi:on
bystandarndess
T&Lstandardness
Combinedstandardness
T3L3
T3
T3L2
T2
T3L1
T2L3
T1
T2L2
L3
T2L1
T1L3
L2
T1L2
L1
T1L1
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0
500000
1000000
1500000
2000000
2500000
Annota:onofsen:ment
(JasminaSmailović)
§  AutomaFcannotaFon
–  largemanuallyannotateddataset
–  SVM
§  EvaluaFonon1,000tweets(sports&poliFcs)
– 
– 
– 
– 
Baseline=37.7%
1-annotator~57.3%
2-annotator=62.1%
IAA=76.5%
Corpuscomposi:on
byaccounttype&gender
accounttype
corporate
private
0
500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
gender
neutral
female
male
0
500000
1000000
1500000
2000000
2500000
3000000
Annota:onofregions
(Čibej&Ljubešić2015)
§  DatacollecFon
–  harvesFngnyAug2015
–  130Kgeo-locatedtweets
–  1,7Kuniqueusers
§  RegionannotaFon
–  ray-casFng
–  7regions+Lj&MB
–  strictfiltering
• 
• 
• 
• 
privateaccounts
non-standardtweets
≥3tweetssent
≥90%tweetssent
fromthesameregion
–  learningcorpus
•  370users(6,3%)
•  75Ktokens
Corpuscomposi:on
bygeographicregion
region
Maribor
Panonska
Rovtarska
Koroška
Primorska
Dolenjska
Tujina
Gorenjska
Štajerska
Ljubljana
0
20000
40000
60000
80000
100000
120000
Tweet-sl0.3.4metadata
id="Fd.392972411765018626"
name="007_delic"
created="2013-10-23T11:14:58"
retrieved="2013-10-24T05:20:32.036451"
favorited="0"
retweeted="0"
in_reply="Fd.392965997352591361"
lang="sl"lang_prob="0.996788622457"
standard_tech="T1"standard_tech_n="1.1"
standard_ling="L1"standard_ling_n="1.2"
sen:ment="neutral"
source="private"
sex="female"
geo="-"
4.Workinprogress&futurework
Rediacri:za:on
(Ljubešićetal.,submiUed)
§ 
Datasetdevelopment
§ 
Trainingset
– 
– 
– 
– 
Wikipedia,webtexts&tweets
≥100characterspersentence
markedwithL2&L3standardnesslevel
≥20%tokenswithdiacriFcs
• 
token-alignedparalleldatasetwithoriginaltoken&tokenwithstrippeddiacriFcs
§  Approaches
–  lexiconapproach
•  mostfrequenttranslaFonintrainingdata
–  corpusapproach
•  translaFon&languagemodelcombined
§  Results
– 
– 
– 
– 
baseline:88%
charli`er93%
lexiconapproach:98.2%
corpusapproach:99,1%
§  Erroranalysis
–  ambiguiFes,tokenizaFonerrors,propernames
–  contextual(syntacFc)informaFonneeded
Manualannota:on
oftheJanesreferencecorpus
§  Goal
1.  tokeniza:on&sentencespli_ng
2.  standardiza:on
3.  POS-tagging&lemmaFzaFon
§  WebAnno
–  AnnotaFonguidelines
–  Annotatortraining
–  CuraFon
To-dos-Linguis:cs
§  Lexicology
–  annotateforeign&adoptedwords
–  compileadicFonaryoftwi[erese
–  detectsemanFcshi`s&neologisms
§  SociolinguisFcs
–  profanity
–  offensivelanguage
–  flaming
To-dos-NLP
§ 
Corpusdevelopment
– 
– 
– 
– 
§ 
addWikipediadiscussion&userpages
addblogs
createsubcorpora(e.g.poliFcians,celebriFes)
developamonitorcorpus
CorpusannotaFon
–  improvenormalizaFon,tagging&lemmaFzaFon
–  CMC-awaretagsetextension
–  addmetadata(e.g.age)
§ 
CorpusdisseminaFon
–  promise
•  searchablethroughanon-lineconcordancer
•  downloadableasadataset
–  problems
•  termsofuseissues
•  copyrightissues
•  privacyissues
–  soluFons
•  annonimysaFon,shufling,sampling
•  differentaccesslevels
5.Projectac:vi:es
JANESSummerCamp
§  UniLj,24-28Aug,2015
§  25highschoolstudentsfromalloverSlovenia
§  Format
–  5days,5topics
–  lecture,exercise,project
–  projectpresentaFon
§  Invitedtalksandeveningevents
§  On-lineslidesandotherteachingmaterials
§  Goodmediacoverage
JANESConference
§  UniLj,25-27Nov2015
§  15reviewedpapers,23authors,50delegates
§  Events
–  Beststudentpaperaward
–  Invitedlecturer
(MichaelBeißwenger,TUDortmund)
–  Paneldiscussion
(WhatisJanesSlovene?)
–  TutorialonstaFsFcsandRforlinguists
(MajaMiličević,UniBelgrade)
JANESExpress
§  Stops:
–  Zagreb,CroaFa(4Dec2015)
–  Belgrade,Serbia(10Dec2015)
–  incooperaFonwithReLDI
§  1-dayevent:
–  studentworkshop(noSkE)
–  annotaFonworkshop(WebAnno)
–  eveninglecture(annotaFngUGCcorpora)
–  150parFcipants
JournalSpecialIssue2016
§  Topic:computer-mediatedcommunicaFon
–  construcFon&distribuFonofCMCcorpora
–  tools&resourcesforprocessingofCMC
–  corpusanalysesofCMC
–  comparisonsofCMCwithstandardand/orspokendiscourse
–  sociolinguisFcstudiesofCMC
–  code-switchinginCMC
–  neologism&semanFcshi`detecFoninCMC
–  offensivelanguageinCMC
§  Deadline:31March2016
–  scienFfic,survey&posiFonpapers
–  reviews&projectreports
§  CFP:clickhere
Int.CMCconference2016
§  Topic:computer-mediatedcommunicaFon
–  developmentofCMCcorpora
–  annotaFon&analysisofCMCcorpora
–  NLPforCMCcorpora
§  Deadline:
–  extendedabstract:1March2016
–  fullpaper:1June2016rts
§  CFP:clickhere
hUp://nl.ijs.si/janes/
tenksJ

Similar documents

The JANES project - International Research Days: Social Media and

The JANES project - International Research Days: Social Media and –  metadata:  username,  Eme  stamp,  no.  of  retweets  &  favourites  

More information