Challenges and Big Opportuni es for Data Scien st at VCCORP

Transcription

Challenges and Big Opportuni es for Data Scien st at VCCORP
ChallengesandBig
Opportuni3esforData
Scien3statVCCORP
HoangAnhTuan
CTOAdmicro-VCCORP
tuanhoanganh@vccorp.vn
1
Content
—  CompanyOverview
—  BigDataatVCCORP
—  Ourmainchallenges
2
COMPANYOVERVIEW
3
VCCORPMilestone
4
FOUNDERS
Mr.VuongVuThangistheFounderandChairmanofVCCorp
HissuccessindevelopingVCCtobetheleadingnew-mediacompanyinVietnambeganwhenhefoundedhisfirst
onlinecommunity,TTVNOL,fromagaragestart-upin2000.AMerfoundingthisnetwork,hefoundedthefirstonline
media&newsportalTintucvietnamin2002.In2005,hefoundedthefirstprivatecompanyinmobilevalueadded
servicesandinternetcontent,whichformedtheiniRalfoundaRonofthecurrent-dayVCC.
Asanaturalentrepreneur,ThanghasbeenthefirstmoverinalltheemerginginternetsectorsinVietnamincluding
mediacontent,socialnetwork,mobilecontent&services,andecommerce.Naturally,VCCbecameaninnovaRve
leaderofVietnam’sinternet&mediaindustry.Asthetechnologystrategist,hehasbeenthechiefarchitectof
disrupRvetechnologiesinVietnaminthelast10yearsincludingCMS,keyportaltechnologies,andcloudcompuRngto
nameafew.
Mr.VuongVuThang
Founder,Chairman
Mr.NguyenTheTanistheCo-FounderandCEOofVCCorp
Withunparalleledknow-howininternetmoneRzaRon,NguyenTheTan’sleadershipisdrivingthegrowthofVCC.His
resultsareapparentasheoversaw100%annualgrowthinadverRsing,mobileservicesandecommercerevenueforeach
ofhislast6yearsatVCC.Asastrategicvisionarywithintheindustry,NguyenTheTan’sworkhasmadeatremendous
impactwithintheVietnamesemarketplace.
BeforejoiningVCC,TanwasVice-DirectorofoneofthelargesttelcocompaniesinVietnam,Vie[elFixedTelecom.Before
Vie[elTelecom,hewasthedivisiondirectorataleadingsoMwareandsystemintegraRoncompany,CMC.Therehebuilt
ae-librarysoluRonspla]ormandexpandedthedistribuRonofCMC’ssoMwareandsoluRonstowardsbroadersegments
Mr.NguyenTheTan
withinthemarket.
Co-Founder,CEO
5
VCCORPOverview
Overview
ü  FirstmoverDNA
ü  50%YoYGrowth
ü  43Mwebaudience
ü  38Mmobile
audience
ü  1,700employees
Investors
6
VCCORPMARKETCOVERAGE
—  43Minternetuserreach(97.6%ofVNinternetpopulation)
—  38Mmobileuserreach(95%ofVNSmartphonepopulation)
—  10,000+online&mobileadvertisers
—  100,000+smallbusinessmerchants
—  12Me-marketplacevisitors&buyers
—  LargestAdnetworkinVietnamwith 1000+publishers,including200+top-
publishers, 30 ofthemareexclusive
—  22leadingproductswithpresencein20+verticals;14sitesareintop100
websitesinVietnam(news,finance,family,teenage,auto,high-tech,online
advertising,B2CandC2C,contentconsumptionmobile)
7
BIGDATAATVCCORP
8
BigDatainVCCORP
—  Inthe2007,BigDatawasappliedearlyinBaambooSearchEngine
—  Since2009,BigDataplatformhavebeeninstalledforservingad
systeminVCCORP
—  Currently,BigDataplatformisbeingdevelopedandimprovedin
majorareas.
—  Advertisement
—  DigitalContent
—  Ecommerce
—  Game
—  Currentstaffs:100DataEngineers
9
ThechallengesinVCCORP
—  BigDataskillsetsin-house
—  Thelarge-scaledata
—  Thehugeamountofspecificproblems,spreadingover
manyareaswhichisrequiredcreativeproblemsolving,self-motivedperson
—  Humanresourceisnotenough
10
SystemInfo
11
OURMAINCHALLENGES
12
—  Userbehaviors
—  AdOptimization
—  CoreNLPanditsapplication
—  NewsDistribution
—  RecommendationEngine
—  VccorpAnalytic
13
USERBEHAVIOR
14
UserbehavioranalyRcs
—  Threemainprojects:
—  Demographic:gender,age
—  Userprofile:behavior,interest
—  Crossdevices:trackinguseronmultipledevices
15
Demographic-Userprofiling
—  Detectuserprofileincludinggender,age,userinteresting(12-basedinterests),long
term,shortterm.Basedondata
—  Browsinghistory
—  Keywordsearchhistory
—  Timeusage,timeonsite
—  Data:
—  43Musersinpc
—  38Musersinmobile
—  1Terabyteloggingdata
—  Result–accuracy:
—  Gender:82.5%
—  Age:67.5%
16
Systemoverview
Sparkstreaming–
FilternewURL
Actioninwebsite
Updateurl
content
Updateuseractionforbatch
processing
Longterm
andshort
term
OurLDAmodel
SVM-
predict
user
profile
UserProfile
17
Demographic-Behavior
18
Benchmark
Benchmarkingdata:43Musers,200Mdocuments,30000*10^6actions,
23otopics
VCCORPcluster:20nodes,640cores,640GBram
OurModel
Oldclassificationmodel
LDAwithSparkMLLIB
Time:18h,
Accuracy:
Time:16h,
Accuracy:
Time:36h,
Accuracy:
Recall:92%
Recall:92%
Recall:91%
—  Gender:82.5%
—  Age:67.5%
—  Gender:79.5%
—  Age:63.4%
—  Gender:75.1%
—  Age:60.1%
19
CrossDevice
20
Crossdevice
—  Weusedinformation:
—  BothUser-IPandtimestampintheirdevices
—  Websiteandcategorieshistory
—  Userdemographicanduserinterest
—  Result:
—  Accuracy:60%
—  Numberdetectedusers:11M
21
ADOPTIMIZATION
22
AdmicroOverview
—  #1adnetworkinVietnam:cover38%marketshare
—  200+toppublishersinVietnam
—  10,000+advertisers
—  4Bpageviewspermonth
—  1,5Bimpressionsperday
—  22leadingadproducts
—  43Minternetuserreach(97.6%ofVNinternetpopulation)
—  38Mmobileuserreach(95%ofVNSmartphone
population)
23
AdOpRmizaRon
—  Theadvancedtechniqueswereimplemented:
—  Personalization
—  AudienceTargetingPlatform
—  RealTimeBidding
—  Retargeting
—  ContextualTargeting
—  SSP/DSP/DMP
24
PersonalizaRon
—  Intraditionaladvertising,adsaredisplayedtoeveryonein
thefixedlocation
—  Bycontrast,personalizationtechniquewillchoosethebest
fitadsforeachuser:
—  43Minternetuser
—  10.000ads
430Bestimatedoperationsforeachtime
—  Usingmultipletechnologies:
—  Highloadcapacitywebserver
—  Optimizationalgorithms
—  Estimateandpredictionalgorithms
25
AudienceTargeRngPla]orm
—  Advertiserscantargettheirparticularaudience
—  Subsetoftheaudiencecanbeprebuiltby“setoperators”
ofuserproperties
—  Location
—  Demographic(gender,age,relationship…)
—  Interest/Behavior
—  Especially,anaudiencecanbemadeupfrom:
—  Listofemail/phonenumber
—  Automaticallyfindsimilaraudiences(look-alike)
26
RealTimeBidding
—  Atransaction(sell/purchase)ofadimpressionsis
immediatelyproceededwhenanaudiencetriggerthe
adzones
27
RealTimeBidding
—  Atransaction(sell/purchase)ofadimpressionsis
immediatelyproceededwhenanaudiencetriggerthe
adzones
—  Challenges:
—  80msisthemaximumtimeofatransaction
—  1000sitesinVN
—  4.5billionrequest/day
—  NumberofTransaction:$200,000/day
28
RetargeRng
—  Retargetingisapowerfulbrandingandconversion
optimizationtool
—  Adswillfollowcustomersaftertheydotheshopping
—  Adswillbedisplayedin
—  AnyWebpages
—  Multipledevices
29
ContextualTargeRng
—  Contextualtargetinglooksat
thecategoryorkeywordsofthecurrentpageaconsumer
isviewingandthenservesthemadsthatarehighly
relevanttothatcontent.
Categorytarget
Keywordtarget
—  Weimplemented:
—  Contentclassificationsystems(LDAP)
—  Keywordindexandsearchenginesystem
30
SSP/DSP/DMP
31
NEWSDISTRIBUTION
32
NewsdistribuRon
—  VCCORPpossessesmanylargeonlinenewspapersinVN
—  WeareproudofbecomingthefirstcompanyinVietnam
abletoimplementanautomaticmethodforpublishing
news
AnalyticSystem
Autopublish
33
NewsdistribuRon
—  Akeychallengeofnewswebsitesistohelpusersfind
thearticlesthatareinterestingtoread
—  Manytechniquesareappliedas:
—  Real-timeengagementstatistic
—  Personalization
—  NLP
—  Eventdetection
—  Trendingdetection
—  Breakingnewsdetection
34
35
36
CORENLP
37
CORENLP
System
Accuracy (%) (VCCorp)
Others
Speed(/s)
Word Segmentation
98.8
97.0 (VLSp)
47,855 tokens
POS tagging
94.5
93 (VLSp)
~50k tokens
NER
87.0
85.0 (Baomoi.com)
~22k tokens
Chunking
84.0
81.0 (VLSp)
800 tokens
UniversalDependency
Parser
72.0(UAS),
66.0 ( LAS)
68.28%(UAS),66.30%
(LAS)
1200 sentences
Co-reference resolution
57.0
N/A
106 docs
VLSp:http://vlsp.hpda.vn:8080/demo/?page=resources
38
EntitylinkingApplication
question
AilàgiámđốcNgân
hàngACB?
[entity1,relation,
entity2]
…
[entityN,relation,
entityN]
Knowledge
base
answer
ÔngNguyễnVănHòa
làgiámđốcNgânhàng
ACB
QueryandretrieveinformationfromKB
1
Input
sentence
Dependency
tree
"Ông Dũng là Thạc_sĩ ngành cơ_điện
Trường Đại_hoc New_York (Mỹ) và
Thạc_sĩ Quan_hệ quốc_tế Trường
Đại_học Georgetown (Mỹ) . “
<Rawtext>
Outputlinks
[ÔngDũng,là,Thạc_sĩ]
[ÔngDũng,là,Thạc_sĩngành
cơ_điện]
[ÔngDũng,là,Thạc_sĩQuan_hệ
quốc_tế]
…
<CoNLLformat>
<Entity1,relation,Entity2>
2
Input
sentence
Dependency
tree
"Trong năm 2015, lãi trước thuế của
BIDV đạt 7.944 tỷ đồng (tăng 26,16
%), lãi sau thuế đạt 6.382 tỷ đồng
(tăng hơn 28%), vốn_điều_lệ của BIDV tăng lên 34.187 tỷ đồng, tổng
tài_sản đạt 850.748 tỷ đồng (tăng
30,8%). Outputlinks
[lãitrướcthuếBIDV,đạt7.944tỷtrong,
năm2015]
[lãisauthuếBIDV,đạt6.381tỷđồngtrong,
năm2015]
[vốnđiềulệBIDV,tăng34.187tỷđồng
trong,năm2015]
[tổngtàisảnBIDV,đạt850.748tỷđồng
trong,năm2015]
…
<Entity1,relation,Entity2>
<Rawtext>
<CoNLLformat>
CORENLP-KnowledgeNetwork
—  BuildingRelevantBrandusingDeepLearningand
CoreNLP
42
CORENLP-NER
43
SenRmentAnalysis
44
SenRmentAnalysis
—  Level:Doc,sentence,entity,aspectlevel
—  Data:~1billionrecords,1TBprocessingdata
—  Facebook:5Mpages,500kgroups
—  News:500
—  Forums:200
—  Approach:UsingNLP+TopicModeling+DeepLearning
—  Accuracy:~70%
45
SenRmentLexicon(Social)
46
AspectbasedsenRmentanalysis
47
RECOMMENDATION
ENGINE
48
RecommendaRonEngine
—  Buildingpurchaserecommendationsystemfore-commercesites
—  Oursuggestionbasedoninformation
—  PurchaseHistoryandweb-browserhistory
—  Productandbuyersknowledge
49
RecommendaRonEngine
—  Thealgorithmapplied:
—  NER+DeepNeuralNetwork
—  NetworkandProductknowledge
—  Collaborativefiltering
—  F-CTR:combinescollaborativealgorithmandproductknowledge
50
RecommendaRonEngine–deeplearning
51
RecommendaRonEngine
52
RecommendaRonEngine–CollaboraRveFiltering
53
RecommendaRonEngine–Ranking
NER
Knowledge
History
Recommender
Rank ( knowledge, NER, history )
r ( i ) = ∑ λk rk (i )
k
54
RECPerformance
Increase45%trafficfromtheRecommendEngineboxes
55
RecommendaRonEngine-News
56
VCCORPANALYTIC
57
VccorpAnalyRc
—  DevelopingAnalyticToolforwebsites,showinggoodperformancein
comparedwithGoogleAnalytic(GA)inVietnam
—  Technologies:
—  No-SQLselectedasdata-warehouseforlarge-scaledataanalysis
—  Real-timeanalytic:Streaminglogging
58
VccorpAnalyRc-architecture
59
VccorpAnalyRc–Framework
60
VccorpAnalyRc
—  Thealgorithmapplied:
—  Samplingdata
—  Abnormaldetection(removefaultclicks/sessions)
—  Results:agoodcandidatetoreplaceGA,ensuringbothaccuracyand
performance
61
SamplingData
strata 2
size = N2
strata 1
size = N1
population
size=N
s1
RSWR
s3
strata 3
size = N3
-  N:Sizeofthesamplingdata
-  n:Totalstrata
-  ​𝑁↓𝑖 :Sizeofthe​𝑖↑𝑡ℎ strata
s
2
s4
strata 4
size = N4
VccorpAnalyRc–abnormaldetecRon
Regression
63
VccorpAnalyRc
64
OneMoreThing…
65
Thanks
66

Similar documents

Assessed Value Ad Valorem Tax Trend

Assessed Value Ad Valorem Tax Trend Actual  &  Projected  Change  in  Ad  Valorem  Assessed  Value   Entergy    1995-­‐2025  

More information