The JANES project
Transcription
The JANES project
TheJANESproject: Toolsandresourcesforlinguis:canalysisandautoma:c processingofuser-generatedcontentinSlovene DarjaFišer UniversityofLjubljana Zurich,17December2015 Outline 1. 2. 3. 4. 5. Projectbackgroundandgoals ConstrucFon&annotaFonoftheJanescorpus ConstrucFon&annotaFonofthesubcorpusoftweets Workinprogress&futurework ProjectacFviFes 1.Projectbackgroundandgoals Background • Basicfacts • • • • 2mioinhabitants Mediterranean,Alpine,PannonianandDinaricregions 7dialectalgroups,42dialects prescripFve,normaFvistculture • Languageresources • plentyofcorpora(CLARIN.SI&CC) • standardlanguage • TheJanesproject • • • • • naFonalbasicresearchproject 3yr,2014-2017 2insFtuFons,8teammembers developmentofresources,toolsandmethodsfortheanalysisofUGC h[p://nl.ijs.si/janes Researchgoals § WP1:CorpusconstrucFon § WP2:LinguisFcanalysis – Task1:vs.standardSlovene • orthography(punctuaFon, capitalizaFon,spelling) • regionalvariaFon(PhD) • syntax – Task2:vs.spokenSlovene • discoursemarkers • interacFveelements – Task3:collocaFons • newcollocaFonsofoldwords • collocaFonsofnewwords – Task4:terminology • specialisaFonlevel&density • nonstandardelementsin terminology – Task5:semanFcshi`s • senseexpansion/narrowing/shi`ing • amelioraFon/pejoraFon – Task6:offensivelanguage • refugeecrisis • gaymarriage § WP3:NLPtools 2.Construc:on&annota:on oftheJanescorpus CorpusJANES0.3 § Tweets – TweetCat(Ljubešićetal.2014) – Slovene-specificseedwords->Sloveneusers->theirnetwork – metadata:username,Fmestamp,no.ofretweets&favourites § Forummessages – 3forums:med.over.net,avtomobilizem.com,kvarkadabra.net – customizedextractors – metadata:topic,postURL,postFmestamp,username,postid § Newscomments – 3newsportals:RTVSlo,Mladina,Reporter – customizedextractors – metadata:arFcleurl,arFcleid,username,postFmestamp,postid § [Blogs] – slWaC2.0(ErjavecandLjubešić2014) – “blog”indomainname – nometadata,mixedblogtextandcomments Corpuscomposi:on bytextsource 1% 0% Total:161Mtokens 8% Twi[er 38% 24% 61MT Forumi:Avtomobilizem Forumi:Kvarkadabra 47MT Forumi:Medover.net Blogi 38MT Komentarji:RTVSlo Komentarji:Mladina Komentarji:Reporter 8% 5% 16% 15MT Corpuscomposi:on byauthors,texts,words&tokens 85,500authors 4,8Mtexts tweet tweet forum forum comment comment blog blog 0 10.000 20.000 30.000 40.000 50.000 60.000 70.000 0 161Mtokens 135Mwords tweet tweet forum forum comment comment blog blog 0 10.000.000 20.000.000 30.000.000 40.000.000 50.000.000 60.000.000 500.000 1.000.0001.500.0002.000.0002.500.0003.000.0003.500.0004.000.000 0 10.000.000 20.000.000 30.000.000 40.000.000 50.000.000 60.000.000 70.000.000 Corpuscomposi:on byyear Corpusprocessing § AnnotaFon – (almoststandard)tokenizaFon&sentencesegmentaFon – lexicalnormalisa:onwithCSMT – standardMSDtagging&lemmaFzaFon § Encoding – (currently)bespokeXMLformetadata – annotatedtextinTEIP5 § ExploraFon – (no)SketchEngine TEItext noSketchEngine:“beUer” noSketchEngine:“I” 3.Construc:on&annota:on ofthesubcorpusoftweets CorpusTweet-sl0.3.4 § AddiFonaltweets – tokens:70M – words:53M – tweets:4,3M § Enrichedmetadata – attweetlevel: • standardness(automaFc) • senFment(automaFc) – atuserlevel: • private/corporate(manual) • male/female(manual) • region(automaFc) Annota:onoftextstandardness (Ljubešićetal.2015) § 2standardnesslevels – technical:T1–T3 – linguisFc:L1–L3 Annota:onoftextstandardness (Ljubešićetal.2015) § Datasetdevelopment – 50:50standard&non-standardtexts – raFoofno.ofnormalisedtokensvs.no.ofalltokenspertext(0.1) – manualannotaFon • 900textssingleannotated(developmentset) • 400textsdouble-annotated(tesFngset) § FeatureselecFon(29features) – character-based: • • • repeFFonsofcharacters raFoofalphabeFcvs.non-alphabeFccharacters raFoofvowelsvs.consonants • • • proporFonofveryshortwords proporFoncapitalisedwords proporFonofwordsnotincludedtheSlolekslexicon – token-based: § Regressor – gridsearchhyperparameteropFmisaFonvia10-foldcross-validaFonontheSVR regressorusingRBFkernel § EvaluaFon – meanabsoluteerror • 0.377T • 0.424L Corpuscomposi:on bystandarndess T&Lstandardness Combinedstandardness T3L3 T3 T3L2 T2 T3L1 T2L3 T1 T2L2 L3 T2L1 T1L3 L2 T1L2 L1 T1L1 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 0 500000 1000000 1500000 2000000 2500000 Annota:onofsen:ment (JasminaSmailović) § AutomaFcannotaFon – largemanuallyannotateddataset – SVM § EvaluaFonon1,000tweets(sports&poliFcs) – – – – Baseline=37.7% 1-annotator~57.3% 2-annotator=62.1% IAA=76.5% Corpuscomposi:on byaccounttype&gender accounttype corporate private 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 gender neutral female male 0 500000 1000000 1500000 2000000 2500000 3000000 Annota:onofregions (Čibej&Ljubešić2015) § DatacollecFon – harvesFngnyAug2015 – 130Kgeo-locatedtweets – 1,7Kuniqueusers § RegionannotaFon – ray-casFng – 7regions+Lj&MB – strictfiltering • • • • privateaccounts non-standardtweets ≥3tweetssent ≥90%tweetssent fromthesameregion – learningcorpus • 370users(6,3%) • 75Ktokens Corpuscomposi:on bygeographicregion region Maribor Panonska Rovtarska Koroška Primorska Dolenjska Tujina Gorenjska Štajerska Ljubljana 0 20000 40000 60000 80000 100000 120000 Tweet-sl0.3.4metadata id="Fd.392972411765018626" name="007_delic" created="2013-10-23T11:14:58" retrieved="2013-10-24T05:20:32.036451" favorited="0" retweeted="0" in_reply="Fd.392965997352591361" lang="sl"lang_prob="0.996788622457" standard_tech="T1"standard_tech_n="1.1" standard_ling="L1"standard_ling_n="1.2" sen:ment="neutral" source="private" sex="female" geo="-" 4.Workinprogress&futurework Rediacri:za:on (Ljubešićetal.,submiUed) § Datasetdevelopment § Trainingset – – – – Wikipedia,webtexts&tweets ≥100characterspersentence markedwithL2&L3standardnesslevel ≥20%tokenswithdiacriFcs • token-alignedparalleldatasetwithoriginaltoken&tokenwithstrippeddiacriFcs § Approaches – lexiconapproach • mostfrequenttranslaFonintrainingdata – corpusapproach • translaFon&languagemodelcombined § Results – – – – baseline:88% charli`er93% lexiconapproach:98.2% corpusapproach:99,1% § Erroranalysis – ambiguiFes,tokenizaFonerrors,propernames – contextual(syntacFc)informaFonneeded Manualannota:on oftheJanesreferencecorpus § Goal 1. tokeniza:on&sentencespli_ng 2. standardiza:on 3. POS-tagging&lemmaFzaFon § WebAnno – AnnotaFonguidelines – Annotatortraining – CuraFon To-dos-Linguis:cs § Lexicology – annotateforeign&adoptedwords – compileadicFonaryoftwi[erese – detectsemanFcshi`s&neologisms § SociolinguisFcs – profanity – offensivelanguage – flaming To-dos-NLP § Corpusdevelopment – – – – § addWikipediadiscussion&userpages addblogs createsubcorpora(e.g.poliFcians,celebriFes) developamonitorcorpus CorpusannotaFon – improvenormalizaFon,tagging&lemmaFzaFon – CMC-awaretagsetextension – addmetadata(e.g.age) § CorpusdisseminaFon – promise • searchablethroughanon-lineconcordancer • downloadableasadataset – problems • termsofuseissues • copyrightissues • privacyissues – soluFons • annonimysaFon,shufling,sampling • differentaccesslevels 5.Projectac:vi:es JANESSummerCamp § UniLj,24-28Aug,2015 § 25highschoolstudentsfromalloverSlovenia § Format – 5days,5topics – lecture,exercise,project – projectpresentaFon § Invitedtalksandeveningevents § On-lineslidesandotherteachingmaterials § Goodmediacoverage JANESConference § UniLj,25-27Nov2015 § 15reviewedpapers,23authors,50delegates § Events – Beststudentpaperaward – Invitedlecturer (MichaelBeißwenger,TUDortmund) – Paneldiscussion (WhatisJanesSlovene?) – TutorialonstaFsFcsandRforlinguists (MajaMiličević,UniBelgrade) JANESExpress § Stops: – Zagreb,CroaFa(4Dec2015) – Belgrade,Serbia(10Dec2015) – incooperaFonwithReLDI § 1-dayevent: – studentworkshop(noSkE) – annotaFonworkshop(WebAnno) – eveninglecture(annotaFngUGCcorpora) – 150parFcipants JournalSpecialIssue2016 § Topic:computer-mediatedcommunicaFon – construcFon&distribuFonofCMCcorpora – tools&resourcesforprocessingofCMC – corpusanalysesofCMC – comparisonsofCMCwithstandardand/orspokendiscourse – sociolinguisFcstudiesofCMC – code-switchinginCMC – neologism&semanFcshi`detecFoninCMC – offensivelanguageinCMC § Deadline:31March2016 – scienFfic,survey&posiFonpapers – reviews&projectreports § CFP:clickhere Int.CMCconference2016 § Topic:computer-mediatedcommunicaFon – developmentofCMCcorpora – annotaFon&analysisofCMCcorpora – NLPforCMCcorpora § Deadline: – extendedabstract:1March2016 – fullpaper:1June2016rts § CFP:clickhere hUp://nl.ijs.si/janes/ tenksJ
Similar documents
The JANES project - International Research Days: Social Media and
– metadata: username, Eme stamp, no. of retweets & favourites
More information