How eHarmony Turns Big Data into True Love
Transcription
How eHarmony Turns Big Data into True Love
How eHarmony Turns Big Data into True Love Sridhar Chiguluri, Lead ETL Developer eHarmony Grant Parsamyan, Director of BI & Data Warehousing eHarmony 1 Agenda • Company Overview • What is Big Data? • Challenges • Implementation Phase 1 • Architecture 2 Company Overview • eHarmony was founded in 2000 and pioneered the use of relationship science to match singles seeking long-term relationships. Today the company offers a variety of relationship services in the United States, Canada, Australia, the United Kingdom and Brazil—with members in more than 150 countries around the world. • With more than 40 million registered users, eHarmony’s highly regarded singles matching service is a market leader in online relationships. • On average, 542 eHarmony members marry every day in the United States as a result of being matched in the site.* • eHarmony also operates Jazzed.com, casual and fun dating site where users can browse their matches directly. 3 Data Analytics Group • Our team (DAG) is responsible for providing Business Analytics and reporting solutions to internal Business Users across all departments. • Each person in the team is responsible for a specific business unit: Accounting, Finance, Marketing, Customer Care, Life Cycle Marketing and International. • Very limited direct data access to business users. All the data is provided through Adhoc SQL and MicroStrategy reports. 4 Big Data Gartner 'Big Data' Is Only the Beginning of Extreme Information Management McKinsey & Company “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. 5 Big Data Event: JSON JavaScript Object Notation Widely hailed as the successor to XML in the browser, JSON aspires to be nothing more than a simple, and elegant data format for the exchange of information between the browser and server; and in doing this simple task it will usher in the next version of the World Wide Web itself. o JSON can be represented in two structures • Object - Unordered set of name/value pairs • Array - Ordered collection of values 6 Sample JSON event Context Changes Header 7 JSON rows as they appear in the database after being flattened out by Hparser CATEGORY ENTITY_ID qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 qaasAnswers.data.up singles-7date 41440669 ID a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f a2547c49-6a754c50-9ad48c7bc023447f PRODUCER EVENT_TIMESTAMP PROPERTY_NAME PROPERTY_NEW_VALUE PROPERTY _SOURCE QAAS 2/16/2012 22:31 locale CONTEXT QAAS 2/16/2012 22:31 userAnswers[singles7-1-6-63].desc QAAS 2/16/2012 22:31 site QAAS 2/16/2012 22:31 userAnswers[singles7-1-6-63].ignored TRUE CHANGE QAAS 2/16/2012 22:31 type CONTEXT QAAS 2/16/2012 22:31 userAnswers[singles7-1-6-63].type MULTISELECT CHANGE QAAS 2/16/2012 22:31 userAnswers CONTEXT QAAS 2/16/2012 22:31 userAnswers[singles7-1-6-63].answer [] CHANGE QAAS 2/16/2012 22:31 userAnswers[singles7-1-6-63].date 1329460263580 CHANGE QAAS 2/16/2012 22:31 userId 41440669 CONTEXT QAAS 2/16/2012 22:31 version 1 CONTEXT en_US CHANGE singles 7 {"type":7,"version":1} CONTEXT 8 Sections in a JSON • Changes – contains list of variables that have changed which resulted in this event’s generation • Sample row where a User chose their desired age range for their match "changes":[{"name":"ageRangeMin","newValue":18,"oldValue":0},{"name":"ageRangeMax","newValue":24,"oldValue":0}] • Context – Provides contextual information to the changes such as User Id, User Name, etc. • Sample row showing User’s Name and Match details "context":{"userFirstName":―John","userLocation":―Santa Monica, CA","matchId":"353861","matchUserId":"2936522"} • Header – Provides Header level information • Sample header row "headers": {"id":"03c57fe3-21bd-4bde-8c5a-679b5fb3c38a","X-category":"mds_savematch.resource.post","Xinstance":"matchdata01-i8","X-timestamp":"2012-01-18T00:46:35.648+0000" } 9 Challenges • Millions of Events generated every hour as JSON files • How to handle the large volume? • No relational source database, how to process JSON? • How do you create reporting that finds trends in that large amount of data? • Quick turnaround time for prototypes • Create a analytics stack that could process large amounts of data and have real time reporting. • Achieve a 3 Weeks release cycle to provide reporting solutions on new event structure 10 Phase 1 - Duration : 3 Months Step 1: Processing the JSON event files each hour Step 2: Flattening the JSON events (most tricky) Step 4: Finding the relationships Step 5: Defining the Data Model Step 6 : ETL (Extract, Transform and Load) Step 7: Building MicroStrategy Reports and Dashboards Step 8 : Storing Historical Data/ Events 11 Step 1, 2 & 3: Reading, Flattening and Loading Events • Events are stored in text file. • Hparser & scripts process the files every hour, flattens each event into CSV files (also a Hive table) • PWX HDFS plug-in is used to load the CSV rows into Netezza staging tables • Using PowerCenter mapping properties are then changed become rows and Contextual Information in the event becomes columns 12 The Big Staging Table • Contains all events • Grows exponentially • 200 million new rows per day : 30 Billion so far • Current Size: 1.2 TB with 4x Compression • Basis for the whole Data Model • Needs to be archived 13 Finding Relationships • Top Down Approach • Get the Business Reporting Requirements • Analyze the Flattened events in Hadoop • Write Adhoc Hive queries directly on HDFS or Netezza staging tables • Outline the findings and define the relationships • Define the Data Model 14 Data Model • Define Logical Data Model based on: • Business and Analytics Requirements • Relationships and Findings from the last step Tips and Tricks o Only Define/Build what is needed for Reporting and Analytics, don’t model anything you don’t need right away o Easy to get lost in the amount of information o Keep it simple 15 ETL • Pass Logical Data Model and Relationships on to ETL team • PowerCenter reads the files in HDFS and loads into the individual tables using PWX HDFS plug-in • Data is loaded hourly and nightly • Goal: To process with in 2 hours, from the time event is fired to the data in tables. 16 Reporting • Keep the Reporting Requirements in mind • Define MicroStrategy Architecture : Attributes/ Facts and Hierarchies • Pass it on to team of BI Developers • Build MicroStrategy Intelligent Cubes and Dashboards based on these cubes • Triggers in place to run the Cubes hourly as soon as the data is updated in the tables 17 Storing Historical Data • Processed event logs are stored in local HDFS (< 1 year) and ins S3 for long term storage • Data can be reprocessed from the JSON event files in case an unused event has to be analyzed 18 Flow of Events : NFS HDFS Netezza Amazon S3 Oracle Event Server Network Drive Hadoop Copy Parse JSON’s in Informatica HParser Hive Staging Table Informatica PowerCenter Grid with PWX for HDFS In-house Hadoop Cluster MicroStrategy Reports Netezza 19 High Level Systems Overview & Data Flow 20 HParser – How Does It Work? hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt 1. Define JSON parser in HParser visual studio 2. Deploy the parser on Hadoop Distributed File System (HDFS) 3. Run HParser to extract data from JSON, flatten, and stage in Hadoop 21 Sample JSON to CSV Transformation in DT 22 Sample mapping that reads Hparser output to Netezza HDFS Application Connection Sample workflow that calls a Hparser script and parses the output data into Netezza 23 Workflow Controlled by Informatica Informatica HParser Staging Table Informatica PowerCenter Netezza 24 Next Steps • Phase 1 was about capturing huge volumes of data and creating MSTR architecture, Operational reports and dashboards. • Phase 2: Provide concise analytics anywhere and anytime 25 Business Benefit • Have a scalable infrastructure • Adding additional ETL and analytical capabilities without increasing overhead • Creating an agile environment to keep up with business expectations (2 to 3 day turnaround for new data) 26 Thank You 27