Christopher Manning CS300 talk – Fall 2000

Transcription

Christopher Manning CS300 talk – Fall 2000
Christopher Manning
CS300 talk – Fall 2000
manning@cs.stanford.edu
http://nlp.stanford.edu/~manning/
1
Research areas of interest:
NLP/CL
• Statistical NLP models: Combining linguistic and
statistical sophistication
• NLP and ML methods for extracting meaning
relations from webpages, medical texts, etc.
• Information extraction and text mining
• Lexical and structural acquisition from raw text
• Using robust NLP: dialect/style, readability, …
• Using pragmatics, genre, NLP in web searching
• Computational lexicography and the
visualization of linguistic information
2
Models for language
• What is the motivation for
statistical models for
understanding language?
• From the beginning,
logics and logical
reasoning were invented
for handling natural
language understanding
• Logics have a languagelike form that draws from
and meshes well with
natural languages
• Where are the numbers?
3
Sophisticated grammars for NL
• From NP  Det Adj* N
• there
developed
precise and
sophisticated
grammar
formalisms
(such as LFG,
HPSG)
4
The Problem of Ambiguity
• Any broad-coverage grammar is hugely
ambiguous (often hundreds of parses for 20+
word sentences).
• Making the grammar more comprehensive only
makes the ambiguity problem get worse.
• Traditional (symbolic) NLP methods don’t
provide a solution.
– Selectional restrictions fail because creative/
metaphorical use of language is everywhere:
• I swallowed his story
• The supernova swallowed up the planet
5
The problem of ambiguity
close up
• “The post office will hold out discounts and
service concessions as incentives.”
• 12 words. Real language. At least 83 parses.
6
7
Statistical NLP methods
•
•
•
•
P(to | Sarah drove)
P(time is verb | Time flies like an arrow)
P(NP  Det Adj N | mother = VP[drive] )
Statistical NLP methods:
– Estimate grammar parameters by gathering
counts from texts or structured analyses of
texts
– Assign probabilities to various things to
determine the likelihood of word sequences,
sentence structure, and interpretation
8
Probabilistic Context-Free Grammars
NP
NP
NP
NP
NP
Det N:
 NPposs N:
Pronoun:
NP PP:
N:
0.4
0.1
0.2
0.1
0.2
NP
NP
Det
PP
N
P(subtree above) = 0.1 x 0.4 = 0.04
9
Why Probabilistic Grammars?
• The predictions about grammaticality and ambiguity of categorical grammars are not in accord
with human perceptions or engineering needs.
• Categorical grammars aren’t predictive
– They don’t tell us what “sounds natural”
• Probabilistic grammars model error tolerance,
online lexical acquisition, … and have been
amazingly successful as an engineering tool
• They capture a lot of world knowledge for free
• Relevant to linguistic change and variation, too!
10
Example: near
• In Middle English, was an adjective [Maling]
• But, today, is it an adjective or a preposition?
– The near side of the moon
– We were near the station
• Not just a word with multiple parts of speech!
There is evidence of blending:
– We were nearer the bus stop than the train
– He has never been nearer the center of the
financial establishment
11
Research aim
• Most current statistical models are quite simple
(linguistically and also statistically)
• Aim: To combine the good features of statistical
NLP methods with the sophistication of rich
linguistic analyses.
12
Lexicalising a CFG
VP[looked]
V[looked]
looked
•A lexicalized CFG can
capture probabilistic
dependencies between
words
PP[inside]
P[inside]
NP[box]
D[the]
N[box]
the
box
13
Left-corner parsing
• The memory requirements of standard parsers
do not match human linguistic processing.
What humans find hardest – center embedding:
– *The man that the woman the priest met
knows couldn’t help
• is really the bread-and-butter of standard CFG
parsing:
– (((a + b)))
• As an alternative, left-corner parsing does
capture this.
14
Parsing and (stack) complexity
• She ruled that the contract between the union
and company dictated that claims from both
sides should be bargained over or arbitrated.
15
Tree geometry vs. stack depth
TD
5
LC
1
BU
1
• Kim thinks Sandy knows
she likes green apples.
1
1
7
• The rat that the cat that
Kim likes chased died
3
3
7
• Kim’s friend’s mother’s
car smells.
16
Probabilistic Left-Corner Grammars
• Use richer probabilistic conditioning
– Left corner and goal category rather than
just parent
• P(NP  Det Adj N | Det, S)
S
• Allow left-to-right online parsing (which
can hope to explain how people build
NP
partial interpretations online)
• Easy integration with lexicalization,
Det Adj N
part-of-speech tagging models, etc.
17
Probabilistic Head-driven Grammars
• The heads of phrases are the source of the
main constraining information about a
sentence structure
• We work out from heads by following the
dependency order of the sentence
• The crucial property is that we have always
built – and have available to us for
conditioning – all governing heads and all
less oblique dependents of the same head
• We can also easily integrate phrase length
18
Information from the web:
The problem
• When people see web pages, they understand
their meaning
– By and large. To the extent that they don’t,
there’s a gradual degradation
• When computers see web pages, they get only
character strings and HTML tags
19
The human view
20
The intelligent agent view
<HTML> <HEAD>
<TITLE>Ford Motor Company - Home Page</title>
<META NAME="Keywords" CONTENT="cars, automobiles, trucks, SUV,
mazda, volvo, lincoln, mercury, jaguar, aston martin, ford">
<META NAME="description" CONTENT="Ford Motor Company corporate
home page">
<SCRIPT LANGUAGE="JavaScript1.2"> … </SCRIPT>
<!-- Trustmark code --><DIV ID=trustmarkDiv>
<TABLE BORDER="0" CELLPADDING=0 CELLSPACING=0 WIDTH=768>
<TR><TD WIDTH=768 ALIGN=CENTER> <A HREF="default.asp?pageid=473"
onmouseover="logoOver('fordscript');rolloverText('ht0')"
onmouseout="logoOut('fordscript');rolloverText('ht0')"><img border="0"
src="images/homepage/fordscript.gif" ALT="Learn more about Ford
Motor Company" WIDTH="521" HEIGHT="39"></A><br>
… </TD></TR></TABLE></DIV> </BODY></HTML>
21
The problem (cont.)
• We'd like computers to see meanings as well, so
that computer agents could more intelligently
process the web
• These desires have led to XML, RDF, agent
markup languages, and a host of other
proposals and technologies which attempt to
impose more syntax and semantics on the web –
in order to make life easier for agents.
22
Thesis
• The problem can’t and won’t be solved by
mandating a universal semantics for the web
• The solution is rather agents that can
‘understand’ the human web by text and image
processing
23
(1) The semantics
• Are there adequate and adequately understood
methods for marking up pages with such a
consistent semantics, in such a way that it
would support simple reasoning by agents?
• No.
24
What are
some
AI people saying?
“Anyone familiar with AI must realize that the study of
knowledge representation—at least as it applies to the
“commensense” knowledge required for reading typical
texts such as newspapers—is not going anywhere fast.
This subfield of AI has become notorious for the
production of countless non-monotonic logics and almost
as many logics of knowledge and belief, and none of the
work shows any obvious application to actual knowledgerepresentation problems. Indeed, the only person who has
had the courage to actually try to create large knowledge
bases full of commonsense knowledge, Doug Lenat …, is
believed by everyone save himself to be failing in his
attempt.”
(Charniak 1993:xvii–xviii)
25
(2) Pragmatics not semantics
pragmatic relating to matters of fact or practical affairs
often to the exclusion of intellectual or artistic matters
pragmatics linguistics concerned with the relationship of
the meaning of sentences to their meaning in the
environment in which they occur
• A lot of the meaning in web pages (as in any
communication) derives from the context – what
is referred to in the philosophy of language
tradition as pragmatics
• Communication is situated
26
Pragmatics on the web
• Information supplied is incomplete – humans
will interpret it
– Numbers are often missing units
– A “rubber band” for sale at a stationery site is
a very different item to a rubber band on a
metal lathe
– A “sidelight” means something different to a
glazier than to a regular person
• Humans will evaluate content using information
about the site, and the style of writing
– value filtering
27
(3) The world changes
• The way in which business is being done is
changing at an astounding rate
– or at least that’s what the ads from ebusiness
companies scream at us
• Semantic needs and usages evolve (like
languages) more rapidly than standards (cf. the
Académie française)
• People use words that aren’t in the dictionary.
• Their listeners understand them.
28
(4) Interoperation
Ontology: a shared formal conceptualization of a
particular domain
• Meaning transfer frequently has to occur across
the subcommunities that are currently designing
*ML languages, and then all the problems
reappear, and the current proposals don't do
much to help
29
Many products cross
industries
http://www.interfilm-usa.com/Polyester.htm
• Interfilm offers a complete range of SKC's
Skyrol® brand polyester films for use in a wide
variety of packaging and industrial processes.
•
Gauges: 48 - 1400
• Typical End Uses: Packaging, Electrical, Labels,
Graphic Arts, Coating and Laminating
– labels: milk jugs, beer/wine, combination
forms, laminated coupons, …
30
(5) Pain but no gain
• A lot of the time people won't put in information
according to standards for semantic/agent
markup, even if they exist.
• Three reasons…
– Laziness: Only 0.3% of sites currently use the
(simple) Dublin Core metadata standard.
– Profits: Having an easily robot-crawlable site
is a recipe for turning what you sell into a
commodity, and hence making little profit
– Cheats: There are people out there that will
abuse any standard, if it’s profitable
31
(6) Less structure to come
• “the convergence of voice and data is creating
the next key interface between people and their
technology. By 2003, an estimated $450 billion
worth of e-commerce transactions will be voicecommanded.*”
• Question: will these customers speak XML tags?
Intel ad, NYT, 28 Sep 2000
*Data Source: Forrester Research.
32
The connection to language
Decker et al. IEEE Internet Computing (2000):
• “The Web is the first widely exploited many-tomany data-interchange medium, and it poses
new requirements for any exchange format:
– Universal expressive power
– Syntactic interoperability
– Semantic interoperability”
But human languages have all these properties,
and maintain superior expressivity and
interoperability through their flexibility and
context dependence
33
NLP and information access
• Solution: use robust natural language
processing and machine learning techniques
• NLP comes into its own when you want to do
more than just standard IR.
• E.g., defined information needs over text:
– “An apartment with 2 bedrooms in Menlo Park
for less than $1,500.”
– “Where was there an airline accident today?”
– “What proteins is this gene known to
regulate?”
34
Example of extracting textual
relations: Real Estate Ads
• System starts with plain text of ads
– These are hardly exactly “English”
• But an unstructured information source,
close to English
– Chosen as lowest common denominator
• Output: database records
– A variety of tables giving information about:
• the property: bedrooms, garages, price
• the real estate agency
• inspection times
35
Real Estate Ads: Input
<ADNUM>2067206v1</ADNUM>
<DATE>March 02, 1998</DATE>
<ADTITLE>MADDINGTON $89,000</ADTITLE>
<ADTEXT>
OPEN 1.00 - 1.45<BR>
U 11 / 10 BERTRAM ST<BR>
NEW TO MARKET Beautiful<BR>
3 brm freestanding<BR>
villa, close to shops & bus<BR>
Owner moved to Melbourne<BR>
ideally suit 1st home buyer,<BR>
investor & 55 and over.<BR>
Brian Hazelden 0418 958 996<BR>
R WHITE LEEMING 9332 3477
</ADTEXT>
36
Real Estate Ads: Output
• Output is database tables
• But the general idea in slot-filler format:
SUBURB:
ADDRESS:
INSPECTION:
BEDROOMS:
TYPE:
AGENT:
BUS PHONE:
MOB PHONE:
MADDINGTON
(11,10,BERTRAM,ST)
(1.00,1.45,11/Nov/98)
3
HOUSE
BRIAN HAZELDEN
9332 3477
0418 958 996
[Manning & Whitelaw, U. Sydney 1998; in daily use at News Corp.]
37
38
One needs a little NLP
• There is no semantic coding to use
• Standard IR doesn’t work:
– suburbs
• the Paddington of the west
• one hours drive from Sydney
• real estate agent
– prices
• recently sold for $x. Was $y now $z. Rent.
– bedrooms
– multi-property ads
40
Text Segmentation
Real-estate ads have an hiearchical text structure!!
SOUTHPORT UNIT SPECIALS
$58,900 o.n.o. 2 brm close to water and shops.
$114,000 "Grandview", excellent value, good returns
LJ Coleman Real Estate
Contact Steve 5527 0572
GLEBE 2br yd $250; 4br yd $430
COOGEE 3br yd $320; 1br $150
BALMAIN 1br $180
H.R. Licensed FEE 9516-3211
41
The End
42