Building Better Models with Less Data
Transcription
Building Better Models with Less Data
All inquiries: Paolo Gaudiano Icosystem Corporation 10 Fawcett Street Cambridge, MA 02138 USA +1-617-520-1070 paolo@icosystem.com Paolo Gaudiano, President & CTO Icosystem Corporation Prepared for: Boston Hadoop Users Group April 26, 2012 2 ©2000-2012 Icosystem Corp., all rights reserved 3 ©2000-2012 Icosystem Corp., all rights reserved 4 ©2000-2012 Icosystem Corp., all rights reserved 5 ©2000-2012 Icosystem Corp., all rights reserved Predictive modeling is the process of creating or selecting a model to predict the likelihood of an event or outcome. Why is this useful? Use predictive modeling to figure out what we are about to get into before we actually get into it. 6 ©2000-2012 Icosystem Corp., all rights reserved Traditional predictive models are data-centric: • Collect observations (Data) • Identify mathematical relationships (Model) • Extrapolate to new conditions (Prediction) Prediction Data 7 Model ©2000-2012 Icosystem Corp., all rights reserved • Accurate fit to historical data does not guarantee predictive accuracy! • Many types of data-centric models require the assumption of statistical stationarity In other words: Tomorrow will be just like yesterday! • 8 Only one case where there holds true: ©2000-2012 Icosystem Corp., all rights reserved 9 ©2000-2012 Icosystem Corp., all rights reserved 10 • Simulate behavior of individuals (agents) • Capture key elements of agents • Simulate interactions between agents • Let the simulation unfold over time • Look for patterns and trends ©2000-2012 Icosystem Corp., all rights reserved The data-centric approach: • • • Collect data from current situation Identify correlations with key variables (time of day, number of lanes, weather, ...) Extrapolate to new conditions The ABS approach: • • • 11 Simulate behavior of individual drivers (driving style, start/end points, response to weather...) Adjust behaviors until overall traffic patterns are replicated accurately Test results of changing conditions ©2000-2012 Icosystem Corp., all rights reserved 11 • Most of us are experts when it comes to driving: accelerate, decelerate, change lanes • This domain expertise is sufficient to build a surprisingly accurate simulation of traffic jams! Real-world experiment 12 NetLogo simulation ©2000-2012 Icosystem Corp., all rights reserved • Build the structure of the simulation using domain expertise (e.g., people accelerate, decelerate, change lanes) • Run the simulation to determine what factors really matter (e.g., deceleration causes more jams than acceleration) • Gather quantitative data to improve the accuracy of the simulation, e.g.: • • 13 How hard to people decelerate? How often do they change lanes? ©2000-2012 Icosystem Corp., all rights reserved 14 ©2000-2012 Icosystem Corp., all rights reserved 15 ©2000-2012 Icosystem Corp., all rights reserved Decision makers use simulations in two primary ways: What-if scenarios • Use simulation to test the impact of assumptions, actions and external factors Strategy design • Use simulations to identify strategies that will lead to success 16 ©2000-2012 Icosystem Corp., all rights reserved Client: PepsiCo - a Fortune 500 consumer goods company target Deli entrance Produce Entrance Complexities: • Correlate cart tracking data with consumer behavior • Predict behavior for novel supermarket configurations 17 Frozen food CSD Challenge: Understand behavior of shoppers moving through a supermarket • How do they navigate? • What do they purchase? • Where to place products? Parking lot ©2000-2012 Icosystem Corp., all rights reserved Registers Client: W.K. Kellogg Foundation Challenge: • Identify non-traditional skills and experience to help disconnected youth find and retain entry-level positions while pursuing a successful career path. • Demonstrate value of non-traditional skills to employers Outcome: Developed simulation of employer “path” through entry-level position; identified quantitative metrics to maximize success. 18 ©2000-2012 Icosystem Corp., all rights reserved Client: Leading semiconductors manufacturer Challenge: Allocate distributed computing resources across multiple data centers to minimize cost and maximize resource availability Outcome: Developed scenariotesting tool that integrates most aspects of high-level decision process while simulating low-level details of project flows, resource distribution and connectivity. 19 ©2000-2012 Icosystem Corp., all rights reserved 20 ©2000-2012 Icosystem Corp., all rights reserved • Your knowledge and domain expertise are just as valuable as (or more than) quantitative data. • Would you throw out a good data set that you had taken the time to collect? • If not, then why would you accept data-centric models that ignore your knowledge? • Agent-based simulations combine domain expertise and quantitative data, e.g.: • • Drivers change lanes when they can [domain expertise] Drivers change lanes every 3 minutes [quantitative data] Models that combine expertise and quantitative data will ALWAYS do better than those using only one or the other! 21 ©2000-2012 Icosystem Corp., all rights reserved 22 ©2000-2012 Icosystem Corp., all rights reserved