XML eXtensible Markup Language 2005

Transcription

XML eXtensible Markup Language 2005
XML
eXtensible Markup Language
2005
http://www.cs.huji.ac.il/~dbi
1
Introduction and Motivation
2005
http://www.cs.huji.ac.il/~dbi
2
XML vs. HTML
• HTML is a HyperText Markup language
– Designed for a specific application,
namely, presenting and linking hypertext
documents
• XML describes structure and content
(“semantics”)
– The presentation is defined separately
from the structure and the content
2005
http://www.cs.huji.ac.il/~dbi
3
An Address Book as
an XML document
<addresses>
<person>
<name> Donald Duck</name>
<tel> 04-828-1345 </tel>
<email> donald@cs.technion.ac.il </email>
</person>
<person>
<name> Miki Mouse</name>
<tel> 03-426-1142 </tel>
<email>miki@yahoo.com</email>
</person>
</addresses>
2005
http://www.cs.huji.ac.il/~dbi
4
Main Features of XML
• No fixed set of tags
– New tags can be added for new
applications
• An agreed upon set of tags can be used
in many applications
– Namespaces facilitate uniform and
coherent descriptions of data
• For example, a namespace for address books
determines whether to use <tel> or <phone>
2005
http://www.cs.huji.ac.il/~dbi
5
Main Features of XML (cont’d)
• XML has the concept of a schema
– DTD and the more expressive XML
Schema
• XML is a data model
– Similar to the semistructured data model
• XML supports internationalization
(Unicode) and platform independence
(an XML file is just a character file)
2005
http://www.cs.huji.ac.il/~dbi
6
XML is Self-Describing Data
• Traditionally, a data file is just a bit stream
• Only a program that reads or writes this file
has the details about
– How to break the bit stream into records
– How to break each record into fields
– The type of each data field
• Over the years, companies retained valuable
data (e.g., on magnetic tapes), but lost the
programs that have the above information
– As a result, the data was practically lost
• It cannot happen with XML data
2005
http://www.cs.huji.ac.il/~dbi
7
XML is the Standard for
Data Exchange
• Web services (e.g., ecommerce) require
exchanging data between various
applications that run on different
platforms
• XML (augmented with namespaces) is
the preferred syntax for data exchange
on the Web
2005
http://www.cs.huji.ac.il/~dbi
8
XML is not Alone
• XML Schemas strengthen the data-modeling
capabilities of XML (in comparison to XML with
only DTDs)
• XPath is a language for accessing parts of
XML documents
• XLink and XPointer support cross-references
• XSLT is a language for transforming XML
documents into other XML documents
(including XHTML, for displaying XML files)
– Limited styling of XML can be done with CSS alone
• XQuery is a lanaguage for querying XML
documents
2005
http://www.cs.huji.ac.il/~dbi
9
The Two Facets of XML
• Some XML files are just text documents with
tags that denote their structure and include
some metadata (e.g., an attribute that gives
the name of the person who did the
proofreading)
– See an example on the next slide
– XML is a subset of SGML (Standard Generalized
Markup Language)
• Other XML documents are similar to
database files (e.g., an address book)
2005
http://www.cs.huji.ac.il/~dbi
10
XML can Describe
the Structure of a Document
<paper>
<title> Complexity of Computations </title>
<author>
<name> M. O. Rabin</name>
<institute> Hebrew University </ institute>
</author>
<abstract> … </abstract>
<section> … </section>
<section> … </section>
<references> … </ references >
</paper>
2005
http://www.cs.huji.ac.il/~dbi
11
XML Syntax
W3Schools Resources on XML Syntax
2005
http://www.cs.huji.ac.il/~dbi
12
The Structure of XML
• XML consists of tags and text
• Tags come in pairs <date> ... </date>
• They must be properly nested
– good
<date> ... <day> ... </day> ... </date>
– bad
<date> ... <day> ... </date>... </day>
(You can’t do <i> ... <b> ... </i> ...</b> in HTML)
2005
http://www.cs.huji.ac.il/~dbi
13
A Useful Abbreviation
Abbreviating elements with empty contents:
• <br/> for <br></br>
• <hr width=“10”/> for <hr width=“10”></hr>
For example:
<family>
<person id = “lisa”>
<name> Lisa Simpson </name>
<mother idref = “marge”/>
<father idref = “homer”/>
</person>
...
</family>
2005
http://www.cs.huji.ac.il/~dbi
Note that a tag
may have a set
of attributes,
each consisting
of a name and
a value
14
XML Text
XML has only one “basic” type – text
It is bounded by tags, e.g.,
<title> The Big Sleep </title>
<year> 1935 </ year> – 1935 is still text
• XML text is called PCDATA
– (for parsed character data)
• It uses a 16-bit encoding, e.g., \&\#x0152 for
the Hebrew letter Mem
2005
http://www.cs.huji.ac.il/~dbi
15
XML Structure
• Nesting tags can be used to express
various structures, e.g., a tuple
(record):
<person>
<name> Lisa Simpson</name>
<tel> 02-828-1234 </tel>
<tel> 054-470-777 </tel>
<email> lisa@cs.huji.ac.il </email>
</person>
2005
http://www.cs.huji.ac.il/~dbi
16
XML Structure (cont’d)
• We can represent a list by using the
same tag repeatedly:
<addresses>
<person> … </person>
<person> … </person>
<person> … </person>
<person> … </person>
…
</addresses>
2005
http://www.cs.huji.ac.il/~dbi
17
XML Structure (cont’d)
<addresses>
<person>
<name> Donald Duck</name>
<tel> 04-828-1345 </tel>
<email> donald@cs.technion.ac.il </email>
</person>
<person>
<name> Miki Mouse</name>
<tel> 03-426-1142 </tel>
<email>miki@yahoo.com</email>
</person>
</addresses>
2005
http://www.cs.huji.ac.il/~dbi
18
Terminology
The segment of an XML document
between an opening and a corresponding
closing tag is called an element
element
element,
a sub-element of
<person>
<name> Bart Simpson </name>
<tel> 02 – 444 7777 </tel>
<tel> 051 – 011 022 </tel>
<email> bart@tau.ac.il </email>
</person>
not an element
2005
http://www.cs.huji.ac.il/~dbi
19
An XML Document is a Tree
person
name
tel
tel
email
Bart Simpson
051 – 011 022
02 – 444 7777
bart@tau.ac.il
Leaves are either empty or contain PCDATA
Note that semistructured data models
typically put the labels on the edges, and
are arbitrary graphs and not just trees
2005
http://www.cs.huji.ac.il/~dbi
20
Mixed Content
An element may contain a mixture of subelements and PCDATA
<airline>
<name> British Airways </name>
<motto>
World’s <dubious> favorite</dubious>
airline
• How many leaves are there in
</motto>
the corresponding tree?
</airline>
• How many leaves are empty?
2005
http://www.cs.huji.ac.il/~dbi
21
The Header Tag
• <?xml version="1.0" standalone="yes/no"
encoding="UTF-8"?>
– Standalone=“no” means that there is an
external DTD
– You can leave out the encoding attribute and
the processor will use the UTF-8 default
2005
http://www.cs.huji.ac.il/~dbi
22
Processing Instructions
<?xml version="1.0"?>
<?xml-stylesheet href="doc.xsl" type="text/xsl"?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc>Hello, world!<!-- Comment 1 --></doc>
<?pi-without-data?>
<!-- Comment 2 -->
<!-- Comment 3 -->
2005
http://www.cs.huji.ac.il/~dbi
23
Using CDATA
<HEAD1>
Entering a Kennel Club Member
</HEAD1>
We want to see
the text as is,
even though
it includes tags
<DESCRIPTION>
Enter the member by the name on his or her papers. Use the
NAME tag. The NAME tag has two attributes. Common (all in
lowercase, please!) is the dog's call name. Breed (also in all
lowercase) is the dog's breed. Please see the breed reference
guide for acceptable breeds. Your entry should look something
like this:
</DESCRIPTION>
<EXAMPLE>
<![CDATA[<NAME common="freddy" breed"=springerspaniel">Sir Fredrick of Ledyard's End</NAME>]]>
</EXAMPLE>
2005
http://www.cs.huji.ac.il/~dbi
24
A Complete XML Document
<?XML version ="1.0" encoding="UTF-8"
standalone="no"?>
<!DOCTYPE addresses SYSTEM
"http://www.cs.huji.ac.il/~dbi/dbi-addresses.dtd">
<addresses>
<person>
<name>Lisa Simpson</name>
<tel> 02-828-1234 </tel>
<tel> 054-470-777 </tel>
<email> lisa@cs.huji.ac.il </email>
</person>
</addresses>
2005
http://www.cs.huji.ac.il/~dbi
25
Well-Formed XML Documents
• An XML document (with or without a DTD) is
well-formed if
– Tags are syntactically correct
– Every tag has an end tag
– Tags are properly nested
– There is a root tag
An XML document
must be well formed
– A start tag does not have two occurrences of the
same attribute
2005
http://www.cs.huji.ac.il/~dbi
26
DTD
(Document Type Definition)
Imposing Structure on
XML Documents
(W3Schools on DTDs)
2005
http://www.cs.huji.ac.il/~dbi
27
Motivation
• A DTD adds syntactical requirements in
addition to the well-formed requirement
• It helps in eliminating errors when
creating or editing XML documents
• It clarifies the intended semantics
• It simplifies the processing of XML
documents
2005
http://www.cs.huji.ac.il/~dbi
28
An Example
• In an address book, where can a phone
number appear?
– Under <person>, under <name> or under both?
• If we have to check for all possibilities,
processing takes longer and it may not be
clear to whom a phone belongs
– We would like to know that a phone number is
allowed to appear under both a department and the
manager of that department
– If we don’t know that and there is only one phone
number, we may not know whether it serves both
the department and its manager or just one of them
2005
http://www.cs.huji.ac.il/~dbi
29
Document Type Definitions
• Document Type Definitions (DTDs)
impose structure on XML documents
• There is some relationship between a
DTD and a schema, but it is not close –
hence the need for additional “typing”
systems (XML schemas)
• The DTD is a syntactic specification
2005
http://www.cs.huji.ac.il/~dbi
30
Example: An Address Book
<person>
<name> Homer Simpson </name>
Exactly one name
<greet> Dr. H. Simpson </greet>
At most one greeting
As many address
<addr>1234 Springwater Road </addr>
lines as needed
<addr> Springfield USA, 98765 </addr>
(in order)
<tel> (321) 786 2543 </tel>
<fax> (321) 786 2544 </fax>
<tel> (321) 786 2544 </tel>
Mixed telephones
and faxes
<email> homer@math.springfield.edu </email>
As many
as needed
</person>
2005
http://www.cs.huji.ac.il/~dbi
31
Specifying the Structure
• name
to specify a name element
• greet?
to specify an optional
(0 or 1) greet elements
• name, greet? to specify a name followed by
an optional greet
2005
http://www.cs.huji.ac.il/~dbi
32
Specifying the Structure
(cont’d)
• addr*
to specify 0 or more address
lines
• tel | fax
a tel or a fax element
• (tel | fax)* 0 or more repeats of tel or fax
• email*
2005
0 or more email elements
http://www.cs.huji.ac.il/~dbi
33
Specifying the Structure
(cont’d)
• So the whole structure of a person entry
is specified by
name, greet?, addr*, (tel | fax)*, email*
• This is known as a regular expression
• Why is it important?
2005
http://www.cs.huji.ac.il/~dbi
34
Summary of Regular Expressions
• A
• e1,e2
•
•
•
•
•
The tag (i.e., element) A occurs
The expression e1 followed by
e2
e*
0 or more occurrences of e
e?
Optional: 0 or 1 occurrences
e+
1 or more occurrences
e1 | e2 either e1 or e2
(e)
grouping
2005
http://www.cs.huji.ac.il/~dbi
35
The Definition of an Element Consists of
Exactly One of the Following
• A regular expression (as defined
earlier)
• EMPTY means that the element has not
content
• ANY means that content can be any
mixture of PCDATA and elements
defined in the DTD
• Mixed content which is defined as
described on the next slide
• (#PCDATA)
2005
http://www.cs.huji.ac.il/~dbi
36
The Definition of Mixed Content
• Mixed content is described by a
repeatable OR group
(#PCDATA | element-name | …)*
– Inside the group, no regular expressions –
just element names
– #PCDATA must be first followed by 0 or
more element names, separated by |
– The group can be repeated 0 or more
times
2005
http://www.cs.huji.ac.il/~dbi
37
An Address-Book XML Document
with an Internal DTD
<?xml version="1.0" encoding="UTF-8"?>
The name of
<!DOCTYPE addressbook [
the DTD is
<!ELEMENT addressbook (person*)>
addressbook
<!ELEMENT person
(name, greet?, address*, (fax | tel)*, email*)>
<!ELEMENT name (#PCDATA)>
The syntax
<!ELEMENT greet (#PCDATA)>
<!ELEMENT address
(#PCDATA)> of a DTD is
not XML
<!ELEMENT tel
(#PCDATA)>
syntax
<!ELEMENT fax
(#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
2005
“Internal” means that the DTD and the
38
http://www.cs.huji.ac.il/~dbi
XML Document are in the same file
The Rest of the
Address-Book XML Document
<addressbook>
<person>
<name> Jeff Cohen </name>
<greet> Dr. Cohen </greet>
<email> jc@penny.com </email>
</person>
</addressbook>
2005
http://www.cs.huji.ac.il/~dbi
39
Regular Expressions
• Each regular expression determines a
corresponding finite-state automaton
• Let’s start with a simpler example: A double
name, addr*, email
addr
name
circle
denotes an
accepting
state
email
This suggests a simple parsing program
2005
http://www.cs.huji.ac.il/~dbi
40
Another Example
name,address*,(tel | fax)*,email*
address
name
email
tel
tel
email
fax
fax
email
Adding in the optional greet further
complicates things
2005
http://www.cs.huji.ac.il/~dbi
41
Deterministic Requirement
• If element-type declarations are
deterministic, it is easier
• Formally, the Glushkov automaton is
deterministic
• The states of this automaton are the
positions of the regular expression
(semantic actions)
• The transitions are based on the
“follows set”
2005
http://www.cs.huji.ac.il/~dbi
42
Deterministic Requirement
(cont’d)
• The associated automata are succinct
• A regular language may not have an
associated deterministic grammar, e.g.,
<!ELEMENT ndeter
((movie|director)*,movie,(movie|director))>
2005
http://www.cs.huji.ac.il/~dbi
43
Some Things are Hard to Specify
Each employee element should contain name,
age and ssn elements in some order
<!ELEMENT employee
( (name, age, ssn) | (age, ssn, name) |
(ssn, name, age) | ...
)>
Suppose that there were many more fields!
2005
http://www.cs.huji.ac.il/~dbi
44
Some Things are Hard to Specify
(cont’d)
<!ELEMENT employee
( (name, age, ssn) | (age, ssn, name) |
(ssn, name, age) | ...
)>
There are n! different
orders
of
n
elements
Suppose there were many more fields!
It is not even polynomial
2005
http://www.cs.huji.ac.il/~dbi
45
Specifying Attributes in the DTD
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
dimension CDATA #REQUIRED
accuracy CDATA #IMPLIED >
The dimension attribute is required
The accuracy attribute is optional
CDATA is the “type” of the attribute – it means
“character data,” and may take any literal string
as a value
2005
http://www.cs.huji.ac.il/~dbi
46
The Format of an Attribute Definition
• <!ATTLIST element-name attr-name
attr-type default-value>
• The default value is given inside quotes
2005
http://www.cs.huji.ac.il/~dbi
47
Summary of Attribute Types
• CDATA
• (value | … | … ) is an enumeration of
allowed values
• ID, IDREF, IDRERS
– to be explained later
• ENTITY, ENTITIES
– to be explained later
• NMTOKEN, NMTOKENS, NOTATION
2005
http://www.cs.huji.ac.il/~dbi
48
Summary of Attribute
Default Values
• #REQUIRED means that the attribute must
by included in the element
• #IMPLIED
• #FIXED “value”
– The given value (inside quotes) is the only
possible one
• “value”
– The default value of the attribute if none is given
2005
http://www.cs.huji.ac.il/~dbi
49
Recursive DTDs
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person,
-- mother
person )> -- father
...
]>
What is the problem with this?
A parser does not notice it!
2005
http://www.cs.huji.ac.il/~dbi
Each person
should have
a father and a
mother. This
leads to either
infinite data or
a person that
is a descendent
of herself.
50
Recursive DTDs (cont’d)
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person?,
-- mother
person? )> -- father
...
]>
If a person only
has a father,
how can you
tell that he has
a father and
does not have
a mother?
What is now the problem with this?
2005
http://www.cs.huji.ac.il/~dbi
51
Using ID and IDREF Attributes
<!DOCTYPE family [
<!ELEMENT family (person)*>
<!ELEMENT person (name)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST person
id
ID
#REQUIRED
mother IDREF #IMPLIED
father IDREF #IMPLIED
children IDREFS #IMPLIED>
]>
2005
http://www.cs.huji.ac.il/~dbi
52
IDs and IDREFs
• ID stands for identifier
– No two ID attributes may have the same value (of
type CDATA)
• IDREF stands for identifier reference
– Every value associated with an IDREF attribute
must exist as an ID attribute value
• IDREFS specifies several (0 or more)
identifier references
2005
http://www.cs.huji.ac.il/~dbi
53
Some Conforming Data
<family>
<person id=“lisa” mother=“marge” father=“homer”>
<name> Lisa Simpson </name>
</person>
<person id=“bart” mother=“marge” father=“homer”>
<name> Bart Simpson </name>
</person>
<person id=“marge” children=“bart lisa”>
<name> Marge Simpson </name>
</person>
<person id=“homer” children=“bart lisa”>
<name> Homer Simpson </name>
</person>
</family>
2005
http://www.cs.huji.ac.il/~dbi
54
ID References do not Have Types
• The attributes mother and father are
references to IDs of other elements
• However, those are not necessarily
person elements!
• The mother attribute is not necessarily a
reference to a female person
2005
http://www.cs.huji.ac.il/~dbi
55
An Alternative Specification
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE family [
<!ELEMENT family (person)*>
<!ELEMENT person (name, mother?, father?, children?)>
<!ATTLIST person id ID #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT mother EMPTY>
<!ATTLIST mother idref IDREF #REQUIRED>
<!ELEMENT father EMPTY>
<!ATTLIST father idref IDREF #REQUIRED>
<!ELEMENT children EMPTY>
<!ATTLIST children idrefs IDREFS #REQUIRED>
]>
2005
http://www.cs.huji.ac.il/~dbi
56
The Revised Data
<family>
<person id="marge">
<name> Marge
Simpson </name>
<children idrefs="bart lisa"/>
</person>
<person id="homer">
<name> Homer
Simpson </name>
<children idrefs="bart lisa"/>
</person>
2005
<person id="bart">
<name> Bart
Simpson </name>
<mother idref="marge"/>
<father idref="homer"/>
</person>
<person id="lisa">
<name> Lisa
Simpson </name>
<mother idref="marge"/>
<father idref="homer"/>
</person>
</family>
http://www.cs.huji.ac.il/~dbi
57
Consistency of ID and IDREF
Attribute Values
• If an attribute is declared as ID
– The associated value must be distinct, i.e., different
elements (in the given document) must have
different values for the ID attribute (no confusion)
• Even if the two elements have different element names
• If an attribute is declared as IDREF
– The associated value must exist as the value of
some ID attribute (no dangling “pointers”)
• Similarly for all the values of an IDREFS
attribute
• ID, IDREF and IDREFS attributes are not typed
2005
http://www.cs.huji.ac.il/~dbi
58
Adding a DTD to the Document
• A DTD can be internal
– The DTD is part of the document file
• or external
– The DTD and the document are on
separate files
– An external DTD may reside
• In the local file system
(where the document is)
• In a remote file system
2005
http://www.cs.huji.ac.il/~dbi
59
Connecting a Document with its DTD
• An internal DTD:
<?xml version="1.0"?>
<!DOCTYPE db [<!ELEMENT ...> … ]>
<db> ... </db>
• A DTD from the local file system:
<!DOCTYPE db SYSTEM "schema.dtd">
• A DTD from a remote file system:
<!DOCTYPE db SYSTEM
"http://www.schemaauthority.com/schema.dtd">
2005
http://www.cs.huji.ac.il/~dbi
60
Well-Formed XML Documents
• An XML document (with or without a DTD) is
well-formed if
– Tags are syntactically correct
– Every tag has an end tag
– Tags are properly nested
– There is a root tag
An XML document
must be well formed
– A start tag does not have two occurrences of the
same attribute
2005
http://www.cs.huji.ac.il/~dbi
61
Valid Documents
• A well-formed XML document isvalid if
it conforms to its DTD, that is,
– The document conforms to the regularexpression grammar,
– The types of attributes are correct, and
– The constraints on references are satisfied
2005
http://www.cs.huji.ac.il/~dbi
62
DTDs are CFGs
(Context-Free Grammars)
• Checking validity and parsing a document
according to a DTD is in polynomial time,
using a dynamic-programming algorithm
– A <lecturer> element has the same rules
regardless of whether it is under a <course>
element or a <seminar> element
• Note that XML Schemas are capable of
describing context-sensitive structures
– The complexity is higher
2005
http://www.cs.huji.ac.il/~dbi
63
XML Schemas
W3Schools on XML Schemas
2005
http://www.cs.huji.ac.il/~dbi
64
DTDs vs. Schemas (or Types)
• DTDs are rather weak specifications by DB &
programming-language standards
– Only one base type – PCDATA
– No useful “abstractions”, e.g., sets
– IDREFs are untyped – the type of the object being
referenced is not known
– No constraints, e.g., child is inverse of parent
– No methods
– Tag definitions are global
• Some extensions of XML impose a schema
or types on an XML document
2005
http://www.cs.huji.ac.il/~dbi
65