Building Better Models with Less Data

Transcription

Building Better Models with Less Data
All inquiries:
Paolo Gaudiano
Icosystem Corporation
10 Fawcett Street
Cambridge, MA 02138
USA
+1-617-520-1070
paolo@icosystem.com
Paolo Gaudiano, President & CTO
Icosystem Corporation
Prepared for:
Boston Hadoop Users Group
April 26, 2012
2
©2000-2012 Icosystem Corp., all rights reserved
3
©2000-2012 Icosystem Corp., all rights reserved
4
©2000-2012 Icosystem Corp., all rights reserved
5
©2000-2012 Icosystem Corp., all rights reserved
Predictive modeling is the process
of creating or selecting a model to
predict the likelihood of an event
or outcome.
Why is this useful?
Use predictive modeling to figure
out what we are about to get into
before we actually get into it.
6
©2000-2012 Icosystem Corp., all rights reserved
Traditional predictive models are data-centric:
• 
Collect observations (Data)
• 
Identify mathematical relationships (Model)
• 
Extrapolate to new conditions (Prediction)
Prediction
Data
7
Model
©2000-2012 Icosystem Corp., all rights reserved
• 
Accurate fit to historical data does not
guarantee predictive accuracy!
• 
Many types of data-centric models require
the assumption of statistical stationarity
In other words:
Tomorrow will be just like yesterday!
• 
8
Only one case where there holds true:
©2000-2012 Icosystem Corp., all rights reserved
9
©2000-2012 Icosystem Corp., all rights reserved
10
• 
Simulate behavior of
individuals (agents)
• 
Capture key elements
of agents
• 
Simulate interactions
between agents
• 
Let the simulation
unfold over time
• 
Look for patterns and
trends
©2000-2012 Icosystem Corp., all rights reserved
The data-centric approach:
• 
• 
• 
Collect data from current situation
Identify correlations with key
variables (time of day, number of
lanes, weather, ...)
Extrapolate to new conditions
The ABS approach:
• 
• 
• 
11
Simulate behavior of individual
drivers (driving style, start/end
points, response to weather...)
Adjust behaviors until overall traffic
patterns are replicated accurately
Test results of changing conditions
©2000-2012 Icosystem Corp., all rights reserved
11
• 
Most of us are experts when it comes to driving:
accelerate, decelerate, change lanes
• 
This domain expertise is sufficient to build a
surprisingly accurate simulation of traffic jams!
Real-world experiment
12
NetLogo simulation
©2000-2012 Icosystem Corp., all rights reserved
• 
Build the structure of the simulation using
domain expertise (e.g., people accelerate,
decelerate, change lanes)
• 
Run the simulation to determine what
factors really matter (e.g., deceleration
causes more jams than acceleration)
• 
Gather quantitative data to improve the
accuracy of the simulation, e.g.:
• 
• 
13
How hard to people decelerate?
How often do they change lanes?
©2000-2012 Icosystem Corp., all rights reserved
14
©2000-2012 Icosystem Corp., all rights reserved
15
©2000-2012 Icosystem Corp., all rights reserved
Decision makers use simulations in
two primary ways:
What-if scenarios
•  Use simulation to test the impact of
assumptions, actions and external
factors
Strategy design
•  Use simulations to identify strategies
that will lead to success
16
©2000-2012 Icosystem Corp., all rights reserved
Client:
PepsiCo - a Fortune 500
consumer goods
company
target
Deli
entrance
Produce
Entrance
Complexities:
• Correlate cart tracking
data with consumer
behavior
• Predict behavior for
novel supermarket
configurations
17
Frozen food
CSD
Challenge:
Understand behavior of
shoppers moving
through a supermarket
• How do they navigate?
• What do they purchase?
• Where to place
products?
Parking lot
©2000-2012 Icosystem Corp., all rights reserved
Registers
Client:
W.K. Kellogg
Foundation
Challenge:
• Identify non-traditional
skills and experience to
help disconnected
youth find and retain
entry-level positions
while pursuing a
successful career path.
• Demonstrate value of
non-traditional skills to
employers
Outcome:
Developed simulation
of employer “path”
through entry-level
position; identified
quantitative metrics to
maximize success.
18
©2000-2012 Icosystem Corp., all rights reserved
Client:
Leading semiconductors
manufacturer
Challenge:
Allocate distributed
computing resources
across multiple data
centers to minimize cost
and maximize resource
availability
Outcome:
Developed scenariotesting tool that
integrates most aspects
of high-level decision
process while simulating
low-level details of
project flows, resource
distribution and
connectivity.
19
©2000-2012 Icosystem Corp., all rights reserved
20
©2000-2012 Icosystem Corp., all rights reserved
• 
Your knowledge and domain expertise are just as
valuable as (or more than) quantitative data.
• 
Would you throw out a good data set that you had
taken the time to collect?
• 
If not, then why would you accept data-centric
models that ignore your knowledge?
• 
Agent-based simulations combine domain expertise
and quantitative data, e.g.:
• 
• 
Drivers change lanes when they can [domain expertise]
Drivers change lanes every 3 minutes [quantitative data]
Models that combine expertise and
quantitative data will ALWAYS do better
than those using only one or the other!
21
©2000-2012 Icosystem Corp., all rights reserved
22
©2000-2012 Icosystem Corp., all rights reserved