Dictionaries

Transcription

Dictionaries
Vocabulary management
and SKOS
Putting Business in the Lead
Jan Voskuil (Taxonic)
September 5th, 2014, Leipzig
SEMANTiCS 2014
Introduction
Jan Voskuil Taxonic (co-founder)
Consultancy in Semantic Technology
“SKOS is used for findability, but should be used
also for vocabulary management in organizations.
Business owns the dictionary, not IT”
What are dictionaries and what for?
 SKOS: Tooling and benefits
 Practicalities

Dienst Justitiële Inrichtingen (DJI)


Custodial Institutions Agency

Ca. 10.000 employees

Ca. 70.000 inmates per year

Ca. 50 facilities
Four groups of detainees

Adult detainees

Juvenile offenders

Patients in forensic care

Foreign nationals
Dictionaries: Benefits
• Knowledge management
• Quality of information
• Manageability
– If your systems contain 100K+ of
attribute names, then they
contain unstructured
information (Dave McComb)
• Findability
– Document (DMS)
– Data (DBMS)
• Exchangeability
4
How many key words are enough?
Frequency of the most
frequent word
•
•
Zipf’s Law
5000 words are enough to understand
95% of any corpus. For the other 5% you
need to know the other 200,000 words
Source:
Tiberius and Schoonheim
A Frequency Dictionary of Dutch, 2014
Frequency of the second
most frequent word
Pocket dictionary: 5K
General dictionary: 100K
Lexicographic dictionary:
1M+
5
The Real World
What is the correct definition of x?
Who decides this?
My project introduces new terms,
how can I get these accepted?
Dictionary
Owner
Begrippenwoordenboek DJI
Dept X
Begrippenlijst Project Y
Project Y
Mega Glossary
ICT-Dept
Information chain dictionaries
Ketenwoordenboek Strafrecht
JustID
Ketenwoordenboek
Vreemdelingen
JustID
Justitiethesaurus
WODC
Data Dictionaries
Gegevenswoordenboek MITS
ICT-Dept
Datadictionary Tulp MIR
ICT-Dept
… It just does not work!
6
OLD SITUATION
NEW SITUATION
Various lists
Single source of truth
Various versions
Single source of truth
Word-documents
Intranet (Internet)
Distribution per mail
Intranet (Internet)
Endless discussions
Responsibility of IT dept or project
Clear-cut governance
Ownership by the business
7
Some How To’s
• Keep the dictionary lean and mean
– Create a “pocket dictionary”
– Example: 1200 key words
• Governance: be pragmatic
• Ownership within the business!
• Use clear, explanatory descriptions
– Language of the work force
– Avoid legal speak!
• Dictionary maintenance is a continuous proces!
– Release cycle
– One major, four minor releases per year
– Major release is approved by senior executives
8
Why SKOS is so great: just enough semantics
• Semantic relations
Justitiabele
(“Detainee”)
Adult detainee
narrower
– Compare one-dimensional lists
• A LIMITED number of
STANDARDIZED semantic
relations
Criminal Law
• Intuitive, easy to understand
• “GENERALIZED CLASSIFICATION”
narrower
Penal Institution
Sex
narrower
• Only most relevant info
Foreign national
Patient in forensic
care
– Broader, Narrower, Related Term
– Semantics is sufficiently vague
– Ideal for “pidginization”
– Use is far broader than Class
Diagrams, ERDs and ontologies
Juvenile offender
Male
Female
Unknown
Undisclosed
9
Why SKOS is so great: tooling
10
Tooling: PoolParty Thesaurus Manager
11
End User View
12
SKOS is an Open Standard: Project Linking
13
http://vocabulary.wolterskluwer.de
prefLabel:
Unfallverhütung
Alternative labels
Broaders
From Wolters
Kluwer
Narrowers
Related terms
From DBPedia
From lod.gesis.org
From eurovoc.org
Other
thesauri on
the web 15
prefLabel:
Unfallverhütung
Alternative labels
From Wolters
Broaders
Kluwer
DJI and the POLICE have very different meanings for the word
ARRESTANT
Narrowers
DO:
> RESPECT DIFFERENCES BETWEEN
RelatedORGANIZATIONS
terms
> MAKE LEXICOGRAPHIC DIFFERENCES EXPLICIT USING LINKED
THESAURI
DON’T
> TRY MAKING ALL ORGANIZATIONS USE EXACTLY THE SAME
LANGUAGE
From DBPedia
From lod.gesis.org
From eurovoc.org
Other
thesauri on
the web 16
Conclusion and next step:
Linking Thesauri to Datamodels
• Datamodels: not owned by business
– too detailed
– too complex
– NO ownership at the strategic level
• Thesauri
– Relatively abstract
– Relatively simple
– Ownership by the business
• SKOS bridges the gap
– With datamodels in RDF, the gap can be bridged!
17
THESAURUS AND DOMAINMODELS: SCENARIO 1
THESAURUS
Skos:Concept
“Detention Facility”
skos:Concept
rdf:type
rdfs:type
skos:exactMatch
eurovoc:C877
skos:prefLabel
“Penal Institution”@en
skos:prefLabel
skos:broader
skos:prefLabel
“Penitentiary Institution”
skos:Definition
“A prison,[3] gaol or jail[4] is a facility in which
voc:4862
inmates are forcibly confined and denied a
variety of freedoms under the authority of…
“място за лишаване от свобода ”@bg
owl:sameAs?
skos:exactMatch?
DOMAIN MODEL
| Data dictionary
:penitentiaryInstitution
rdf:type
:inmate#9818763
:isRegisteredAt
:cell
:pi_Dordrecht
“B.23.a”
18
THESAURUS AND DOMAINMODELS: SCENARIO 2
THESAURUS
Skos:Concept
skos:Concept
rdfs:type
“Detention Facility”
rdf:type
eurovoc:C877
skos:prefLabel
“Penitentiary Institution”
“Penal Institution”@en
skos:prefLabel
“A prison,[3] gaol or jail[4] is a facility in which
inmates are forcibly confined and denied a
variety of freedoms under the authority of…
skos:exactMatch
“място за лишаване от свобода ”@bg
DOMAIN MODEL
| Data dictionary
:penitentiaryInstitution
rdf:type
:inmate#9818763
:isRegisteredAt
:cell
:pi_Dordrecht
“B.23.a”
19
jan.voskuil@taxonic.com
www.taxonic.com