

Semantic Technologies in MarkLogic
Stephen Buxton
Micah Dubinko
John Snelson
April 9 2013
Today's Talk
 Semantics and MarkLogic – An Overview
Stephen Buxton
 APIs and Applications – Making It All Work
Micah Dubinko
 RDF and SPARQL in MarkLogic – The Details
John Snelson
Rich MarkLogic Applications .. Made Richer
Rich MarkLogic Applications .. Made Richer
Name: John Smith
Affiliation: IBM
Timezone: PST
Committer: Hadoop
Search With Real-World Context
Query Facts, Documents, Values Together
What is Semantics?
A different way of organizing and searching information
What is Semantics?
A different way of organizing and searching information
Data stored in Triples
Expressed as Subject : Predicate : Object
"John Smith" : livesIn : "London"
"London" : isIn : "England"
What is Semantics?
A different way of organizing and searching information
Data stored in Triples
Expressed as Subject : Predicate : Object
"John Smith" : livesIn : "London"
"London" : isIn : "England"
Rules tell us something about the triples
If (A livesIn X) AND (X isIn Y) then (A livesIn Y)
Inference: "John Smith" : livesIn : "England"
What is Semantics?
A different way of organizing and searching information
Data stored in Triples
Expressed as Subject : Predicate : Object
"John Smith" : livesIn : "London"
"London" : isIn : "England"
Rules tell us something about the triples
"John Smith"
What is Semantics?
A different way of organizing and searching information
Data stored in triples
Expressed as Subject : Predicate : Object
"John Smith" : livesIn : "London"
"London" : isIn : "England"
Rules tell us something about the triples
If (A livesIn X) AND (X isIn Y) then (A livesIn Y)
Inference: "John Smith" : livesIn : "England"
Language: SPARQL is a language designed to query triples.
It looks a bit like SQL
Why do you care about Semantics?
Companies and organizations across all verticals
 Publishing: Dynamic Semantic Publishing (BBC)
Manage and leverage facts + documents for a rich user experience
 Pharma: facts about drugs + reports on clinical trials
Find new cures for diseases
Make decisions about what to research next
 Financial Services: reduce risk, comply with regulations
Report on exposure
Know where each piece of data came from
 Government Agencies: facts on file + intelligence reports
Find bad guys
 Civilian Government: Open Data
Open Government through Open Data
Dynamic Semantic Publishing
BBC Sports
The Challenge
Size and Complexity:
 # of athletes
 # of teams
 # of assets (match
reports, statistics, etc.)
 # of relations (facts)
 Rich user experience
See information in context
Personalize content
Easy navigation
Intelligently serve ads
(outside of UK)
 Manageable
Static pages?
Too many, too fast-changing
 Limited number of journalists
Automate as much as possible
BBC page for "West Ham" shows:
 News story about Andy Carroll
Carroll playsFor West Ham
 Latest results for West Ham
West Ham isIn this match
 League table for Premier League
West Ham playsIn Premier League
 Video/audio/news related to
West Ham United Football Club
a West Ham player
West Ham's manager
West Ham's league
West Ham's venue
Dynamic Semantic Publishing: A Solution
Triple Store
 Store, manage documents
 Metadata about documents
Tagged by journalists
Added (semi-)automatically
 Store, manage values
 Full-Text search
 Performance, scalability
 Robustness
 Facts reported by journalists
 Real-world facts from the
Open Data Web
Dynamic Semantic Publishing: A Solution
Triple Store
 At query time, dynamically aggregate stories, blogs, feeds,
images, profiles, results, statistics, videos for a particular
concept such as "West Ham".
(See Jem Rayfield, BBC,
 w e are not publishing pages, but publishing
content as assets which are then organized by the
metadata dynamically into pages
(John O'Donovan, BBC and PA)
Dynamic Semantic Publishing: A Solution
MarkLogic with
Triple Store
 At query time, dynamically aggregate stories, blogs, feeds,
images, profiles, results, statistics, videos for a particular
concept such as "West Ham".
(See Jem Rayfield, BBC,
 w e are not publishing pages, but publishing
content as assets which are then organized by the
metadata dynamically into pages
(John O'Donovan, BBC and PA)
MarkLogic Semantics
New features under development
 RDF data store
 Special-purpose triples index
 MarkLogic Server includes a triple store !
 Query RDF with native SPARQL
 Query across triples, documents, values
XQuery and Triples
Load triples from an external source
Construct a triple in XQuery
Extract triples from a document…
Image credit:
A mini case study
<td class="ranking">1</td>
<td class="city-name"><a
<td class="price-index">267</td>
Extracting Facts
Case study continued
let $html := xdmp:document-get(…)
let $rows := ($html//html:tr)[html:td/@class eq 'ranking']
let $build := sem:rdf-builder(sem:prefixes("my:"))
for $row in $rows
let $node := "_:" || $row/html:td[@class eq 'ranking']
return (
$build($node, "my:rank", xs:decimal( $row/html:td[@class eq 'ranking'] )),
$build($node, "rdfs:label", xs:string( $row/html:td[@class eq 'city-name'] )),
$build($node, "my:cola", xs:int( $row/html:td[@class eq 'price-index'] ))
Photo credit:
It’s all about the connections
Subjects/Objects can point to database URIs
@prefix db: <>.
@prefix foaf: <>.
_:Person1 foaf:name “Micah Dubinko”.
_:Person1 foaf:mbox <>.
_:Person1 foaf:depiction <>.
run SPARQL query and from the results, extract the image
let $img-url := (…get from SPARQL…)
return fn:doc($img-url)
Photo credit:
It’s all about the connections
Documents can contain triple markup
<title>News for April 9, 2013</title>
Query documents based on contained triples
cts:triple-range-query((), (), $match, ‘sameTerm’ )
Photo credit:
Blended queries
Semantic queries with Search API
A semantic query can drive the construction of a custom query
declare function semquery:parse(…) {
run sparql query to determine meaning of “South Bay”
use sparql results to construct a particular polygon
feed polygon into generated query that Search API will use
Three approaches: graphs, queries, and things
CRUD on graphs
GET /v1/graphs?default
PUT /v1/graphs?graph=
DELETE /v1/graphs?graph=
Interoperable query support
POST /v1/graphs/sparql
(…SPARQL in POST body…)
Wander through your data
GET /v1/things?iri=
Photo credit:
What is RDF?
What is RDF?
Triple granularity
Open world assumption
Joins - the cost of granularity
Why use RDF?
• Born or extracted to RDF
• Denormalize into XML by default
• Lift data into RDF if you need to:
Slide 40
combine it with disparate data sources
navigate it like a graph
use it for relationships or taxonomy
expose it as RDF to end users
Copyright © 2013 MarkLogic® Corporation. All rights reserved.
Semantics Architecture
Triple Index
3 triple orders
Cached for performance
Works seamlessly with other indexes
350 bytes per triple on disk
1 billion+ triples per host
select * where {
?person :birth-place ?place;
:first-name “John”
Executed using the triple index
Cost-based optimization
Join ordering and algorithms
More in the lightning talks
MarkLogic Semantics
New features under development
 RDF data store
 Special-purpose triples index
 MarkLogic Server includes a triple store !
 Query RDF with native SPARQL
 Query across triples, documents, values
 World-class Triple Store
 Horizontally scalable
 Not restricted by physical memory limits
 Enterprise hardened
 World-beating Information Store
 Triples + Documents + Values
 All in one Enterprise NoSQL database
Any Questions?
For More Information
Stephen Buxton
