©2015 Copyright Fractal Analytics, Inc, all rights reserved
Transcription
©2015 Copyright Fractal Analytics, Inc, all rights reserved
©2015 Copyright Fractal Analytics, Inc, all rights reserved. Confidential and proprietary Information of Fractal Analytics Inc. Fractal is a registered trademark of Fractal Analytics Limited. The Right Recommendation System for Big Data Introduction Recommendation systems predict consumer needs based on previous purchase history, online behavior, ratings, reviews, and other personalized attributes. They are therefore proving to be a key differentiator for businesses. Companies are investing in real-time technologies to understand consumer behavior at a more granular level across channels and devices. The objective is to get more personalized recommendations, satisfied customers, and increased sales. We conducted a study to address the challenges that traditional analytical tools face while building recommendation systems for big data. The market trends and experiences of our customers and internal teams who deal with ever-growing data and changing contextual parameters triggered this study. Our work intends to guide organizations dealing with big data to take informed decisions on the basis of our findings outlined in this paper. Selecting appropriate analytics approach will not only reduce time to market of their offerings, but also increase customer satisfaction leading to better business outcomes. We used three different tools* on the same dataset to compare our findings of analytics for recommendation systems: 1. Apache Mahout 2. Neo4j 3. Apache Spark ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 1 The Right Recommendation System for Big Data Background We wanted to solve common problems that recommendation systems face such as inability to scale, generate relevant recommendations, provide flexible processing plans, etc. More often than not, recommendation systems are built using single-threaded analytical tools that work only with smaller datasets. On a standard computer in our labs having 16 gigabytes of memory, it took approximately one hour to get recommendations for about 1,000 users. We then projected how long the same recommendation solution on a dataset that had about 0.8 million users would take. The projected time to get recommendations on this scale of data was extrapolated to three weeks of continuous computing time. This time frame to generate recommendations using such a model was clearly unaffordable. Recommendation systems predict user responses to options. These systems involve content-based and collaborative filtering methods to make appropriate recommendations. Using various statistical models and algorithms, they predict rating or preference that users would give to a product/service they had not yet considered. ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 2 The Right Recommendation System for Big Data Our Experiments and Findings | To address these kinds of common issues, we tried Apache Mahout, Apache Spark, and Neo4j separately in our custom recommendation system. Apache Mahout Apache Mahout provides a rich set of components from which one can construct a customized recommendation system from a selection of algorithms. We decided to try the most mature machine learning library – Apache Mahout. Mahout’s MapReduce execution mode on a Hadoop system provided the advantage of running algorithms in batch processing mode on a distributed system. We modified Mahout’s out-of-the-box collaborative filtering (CF) algorithm and extended it to suit our needs. We enhanced it to run in multiple threads to exploit the number of cores available in a server. We thus had the ability to recommend for a larger dataset (about 0.8 million users using about 600 items, effectively about 4.5 million rows) in approximately two hours of processing time, where traditional tools would have taken three weeks. This gave us the flexibility to try multiple variations of the algorithm to fine-tune and obtain the apt recommendation model. We customized the following Mahout components: 1. Data Model 2. User Similarity 3. Nearest Neighborhood 4. Rescorer 5. Recommender 6. We also used user similarity caching and changed the execution to multi-threaded mode ©2015 Fractal Analytics, Inc., all rights reserved | Confidential 3 The Right Recommendation System for Big Data As we were building the recommendation system on Mahout, we knew this method had a disadvantage – batch processing. It evaluates the algorithm and processes the entire dataset at one time, even if we needed to find the recommendations for only a few rows. It was not optimal for online recommendations where the expectation is to provide real-time responses for only a few users at a given time. To solve these problems, we took our experiment further and developed one recommendation system based on scalable machine learning library (Apache Spark) and another on graph databases (Neo4j). Each of these methods provided distinct advantages over Mahout. Spark’s in-memory computation method allowed us to drastically decrease the run-time of our algorithm. Neo4j graph database provided the flexibility of using an online method for recommendations by allowing us to compute the heuristics for only the relevant subset of data. ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 4 The Right Recommendation System for Big Data Apache Spark Spark provides a scalable machine learning library consisting of common learning algorithms and utilities – including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. Batch processing using MapReduce does not offer speed and flexibility. The Mahout community has started moving away from the MapReduce paradigm towards memory-based computing, which offers faster processing due to reduced disk input/output operations. It also offers more flexibility by allowing algorithms to use the directed acyclic graph (DAG) approach rather than the rigid MapReduce paradigm. We leveraged the Alternating Least Squares (ALS) algorithm to find the appropriate recommendations for users. Spark-based algorithm was substantially faster than Mahout (10 minutes to run ALS as opposed to two hours for CF in Mahout). We further enhanced our algorithm by running cross-validation to iterate over multiple parameters/heuristics. ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 5 The Right Recommendation System for Big Data | Neo4j Neo4j is a popular open-source graph database implemented in Java. It is embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. Neo4j graph database provided a key advantage by processing only the subset of data we were interested in. This gave us the ability to scale to billions of rows and still generate recommendations for smaller connected subset of data. We were also able incorporate more attributes into the graph-map and use richer information to get appropriate recommendations. The following are the two approaches we used for building an online recommendation system with Neo4j: JDBC: The following process depicts how we used a JDBC connection to query the database to generate the recommendations: • Find top 20 similar users based on cosine similarity using cypher query: MATCH (u1:Users)-[d]->() using index u1:Users(id) where u1.id ={1} with u1.id as user1, count(d) as user1_prod MATCH (u1:Users)-[]->()<-[prod]-others using index u1:Users(id) where u1.id ={1} with user1, user1_prod, others, count(prod) as intersect match others-[b1]->() with user1, others.id as user2,intersect, user1_prod, count(b1) as user2_prod with user1, user2, intersect/(sqrt(user1_prod) * sqrt(user2_prod)) as similarity return user2, similarity order by similarity desc limit 20; • Find all the products bought/liked by these top 20 similar users using a cypher query: MATCH (u1:Users)-[]->prod using index u1:Users(id) where u1.id in [<top 20 similar users>] with prod, collect(u1.id) as users return prod.id, users Find the weight of each of those products by adding the similarity values of the users who have bought them (normalized similarity). Recommend top 10 products based on the weight evaluated. Neo4j Server extension OR traversal framework: The algorithm in this case remained the same, except that we could make further customizations to reduce query and evaluation execution time ©2015 Fractal Analytics, Inc., all rights reserved | Confidential 6 The Right Recommendation System for Big Data Conclusion Building the right recommendation system involves using the right technology after assessing the business model, appropriate algorithms, and frequency/type of consumption of recommendations. Mahout and Spark are useful for running a batch process and evaluating recommendations for the entire dataset. Both help in comparing recommendations based on multiple algorithms and heuristics. Spark is much faster and helps in cross-validation efficiently. For online processing to provide real-time recommendations based on a selected algorithm, a NoSQL graph database like Neo4j is helpful. It uses only the relevant information for analysis, thereby leading to faster processing. ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 7 The Right Recommendation System for Big Data | Authors Suraj Amonkar Associate Director Suraj has over 10+ years of experience in Data Mining and Analytics Industry. He has lead deployment of enterprise grade Big-Data Analytics platforms at Palantir Technologies, and has worked for multiple Healthcare and Life Science analytics companies like Celera Genomics and the Mayo Clinic, focusing on pioneering algorithms for early detection of cancer. He has multiple publications/patents in the domain of cancer-research algorithms, and is a coinventor of one of the first algorithms for early detection of cancer. Vishal Rajpal Associate Director Vishal has close to 10 years of rich experience in building technology applications. He has worked for over four years at Morgan Stanley Capital International, leading the development and support of real-time, high availability, fault tolerant systems. He has also worked for Accenture Consulting serving clients worldwide in the energy trading and risk management domain. Karan Gusani Analyst – Big Data Engineer Karan has rich experience in developing enterprise applications using opensource technologies and tools. He also has exposure in setting-up the entire infra-structure, development and deployment of a resource management web application systems on the cloud. Contributor Vikas Arora Technical Content Specialist ©2015 Fractal Analytics, Inc., all rights reserved | Confidential 8 The Right Recommendation System for Big Data About Fractal Analytics Fractal Analytics is a global analytics firm that serves Fortune 500 companies to gain a competitive advantage by providing them a deep understanding of consumers and tools to improve business efficiency. Producing accelerated analytics that generate data driven decisions, Fractal Analytics delivers insight, innovation and impact through predictive analytics and visual storytelling. Fractal Analytics was founded in 2000 and has 800 people in 13 offices around the world serving clients in over 100 countries. The company has earned recognition by industry analysts and has been named one of the top five “Cool Vendors in Analytics” by research advisor Gartner. Fractal Analytics has also been recognized for its rapid growth, being ranked on the exclusive Inc. 5000 list for the past three years and also being named among the USPAACC’s Fast 50 Asian-American owned businesses for the past two years. Learn more at www.fractalanalytics.com For more information, contact us at: +1 650 378 1284 info@fractalanalytics.com Follow us: ©2015 Fractal Analytics, Inc., all rights reserved | Confidential | 9