©2015 Copyright Fractal Analytics, Inc, all rights reserved

Transcription

©2015 Copyright Fractal Analytics, Inc, all rights reserved
©2015 Copyright Fractal Analytics, Inc, all rights reserved. Confidential and proprietary Information of Fractal Analytics Inc. Fractal is a registered trademark of Fractal Analytics Limited.
The Right Recommendation System for Big Data
Introduction
Recommendation systems predict consumer needs based on previous
purchase history, online behavior, ratings, reviews, and other personalized
attributes. They are therefore proving to be a key differentiator for
businesses. Companies are investing in real-time technologies to understand
consumer behavior at a more granular level across channels and devices.
The objective is to get more personalized recommendations, satisfied
customers, and increased sales.
We conducted a study to address the challenges that traditional analytical
tools face while building recommendation systems for big data. The market
trends and experiences of our customers and internal teams who deal with
ever-growing data and changing contextual parameters triggered this study.
Our work intends to guide organizations dealing with big data to take
informed decisions on the basis of our findings outlined in this paper.
Selecting appropriate analytics approach will not only reduce time to market
of their offerings, but also increase customer satisfaction leading to better
business outcomes.
We used three different tools* on the same dataset to compare our findings of analytics for
recommendation systems:
1. Apache Mahout
2. Neo4j
3. Apache Spark
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
1
The Right Recommendation System for Big Data
Background
We wanted to solve common problems that recommendation systems face
such as inability to scale, generate relevant recommendations, provide
flexible processing plans, etc. More often than not, recommendation systems
are built using single-threaded analytical tools that work only with smaller
datasets. On a standard computer in our labs having 16 gigabytes of
memory, it took approximately one hour to get recommendations for about
1,000 users. We then projected how long the same recommendation solution
on a dataset that had about 0.8 million users would take. The projected time
to get recommendations on this scale of data was extrapolated to three
weeks of continuous computing time. This time frame to generate
recommendations using such a model was clearly unaffordable.
Recommendation systems predict user responses to options. These systems involve
content-based and collaborative filtering methods to make appropriate recommendations.
Using various statistical models and algorithms, they predict rating or preference that users
would give to a product/service they had not yet considered.
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
2
The Right Recommendation System for Big Data
Our Experiments
and Findings
|
To address these kinds of common issues, we tried Apache Mahout, Apache
Spark, and Neo4j separately in our custom recommendation system.
Apache Mahout
Apache Mahout provides a rich set of components from which one can
construct a customized recommendation system from a selection of algorithms.
We decided to try the most mature machine learning library – Apache Mahout.
Mahout’s MapReduce execution mode on a Hadoop system provided the
advantage of running algorithms in batch processing mode on a distributed
system.
We modified Mahout’s out-of-the-box collaborative filtering (CF) algorithm and
extended it to suit our needs. We enhanced it to run in multiple threads to
exploit the number of cores available in a server. We thus had the ability to
recommend for a larger dataset (about 0.8 million users using about 600 items,
effectively about 4.5 million rows) in approximately two hours of processing
time, where traditional tools would have taken three weeks. This gave us the
flexibility to try multiple variations of the algorithm to fine-tune and obtain the
apt recommendation model.
We customized the following Mahout components:
1. Data Model
2. User Similarity
3. Nearest Neighborhood
4. Rescorer
5. Recommender
6. We also used user similarity caching and changed the execution to multi-threaded
mode
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
3
The Right Recommendation System for Big Data
As we were building the
recommendation system on
Mahout, we knew this method had
a disadvantage – batch
processing. It evaluates the
algorithm and processes the
entire dataset at one time, even if
we needed to find the
recommendations for only a few
rows. It was not optimal for online
recommendations where the
expectation is to provide real-time
responses for only a few users at
a given time.
To solve these problems, we took our experiment further and developed one
recommendation system based on scalable machine learning library (Apache
Spark) and another on graph databases (Neo4j). Each of these methods
provided distinct advantages over Mahout. Spark’s in-memory computation
method allowed us to drastically decrease the run-time of our algorithm. Neo4j
graph database provided the flexibility of using an online method for
recommendations by allowing us to compute the heuristics for only the relevant
subset of data.
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
4
The Right Recommendation System for Big Data
Apache Spark
Spark provides a scalable machine learning library consisting of common learning
algorithms and utilities – including classification, regression, clustering,
collaborative filtering, dimensionality reduction, and underlying optimization
primitives.
Batch processing using MapReduce does not offer speed and flexibility. The
Mahout community has started moving away from the MapReduce paradigm
towards memory-based computing, which offers faster processing due to reduced
disk input/output operations. It also offers more flexibility by allowing algorithms to
use the directed acyclic graph (DAG) approach rather than the rigid MapReduce
paradigm. We leveraged the Alternating Least Squares (ALS) algorithm to find the
appropriate recommendations for users. Spark-based algorithm was substantially
faster than Mahout (10 minutes to run ALS as opposed to two hours for CF in
Mahout). We further enhanced our algorithm by running cross-validation to iterate
over multiple parameters/heuristics.
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
5
The Right Recommendation System for Big Data
|
Neo4j
Neo4j is a popular open-source graph database implemented in Java. It is
embedded, disk-based, fully transactional Java persistence engine that stores
data structured in graphs rather than in tables.
Neo4j graph database provided a
key advantage by processing only
the subset of data we were
interested in. This gave us the
ability to scale to billions of rows
and still generate recommendations
for smaller connected subset of data. We were also able incorporate more attributes
into the graph-map and use richer information to get appropriate recommendations.
The following are the two approaches we used for building an online
recommendation system with Neo4j:
JDBC: The following process depicts how we used a JDBC connection to query the database to generate
the recommendations:
•
Find top 20 similar users based on cosine similarity using cypher query:
MATCH (u1:Users)-[d]->() using index u1:Users(id) where u1.id ={1}
with u1.id as user1, count(d) as user1_prod
MATCH (u1:Users)-[]->()<-[prod]-others using index u1:Users(id) where u1.id ={1}
with user1, user1_prod, others, count(prod) as intersect
match others-[b1]->()
with user1, others.id as user2,intersect, user1_prod, count(b1) as user2_prod
with user1, user2, intersect/(sqrt(user1_prod) * sqrt(user2_prod)) as similarity
return user2, similarity order by similarity desc limit 20;
•
Find all the products bought/liked by these top 20 similar users using a cypher query:
MATCH (u1:Users)-[]->prod using index u1:Users(id) where u1.id in [<top 20 similar users>]
with prod, collect(u1.id) as users return prod.id, users
Find the weight of each of those products by adding the similarity values of the users who have bought
them (normalized similarity).
Recommend top 10 products based on the weight evaluated.
Neo4j Server extension OR traversal framework: The algorithm in this case remained the same, except
that we could make further customizations to reduce query and evaluation execution time
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
6
The Right Recommendation System for Big Data
Conclusion
Building the right recommendation system involves using the right technology
after assessing the business model, appropriate algorithms, and
frequency/type of consumption of recommendations.
Mahout and Spark are useful for running a batch process and evaluating
recommendations for the entire dataset. Both help in comparing
recommendations based on multiple algorithms and heuristics. Spark is much
faster and helps in cross-validation efficiently.
For online processing to provide real-time recommendations based on a
selected algorithm, a NoSQL graph database like Neo4j is helpful. It uses
only the relevant information for analysis, thereby leading to faster
processing.
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
7
The Right Recommendation System for Big Data
|
Authors
Suraj Amonkar
Associate Director
Suraj has over 10+ years of experience in Data Mining and Analytics Industry.
He has lead deployment of enterprise grade Big-Data Analytics platforms at
Palantir Technologies, and has worked for multiple Healthcare and Life
Science analytics companies like Celera Genomics and the Mayo Clinic,
focusing on pioneering algorithms for early detection of cancer. He has multiple
publications/patents in the domain of cancer-research algorithms, and is a coinventor of one of the first algorithms for early detection of cancer.
Vishal Rajpal
Associate Director
Vishal has close to 10 years of rich experience in building technology
applications. He has worked for over four years at Morgan Stanley Capital
International, leading the development and support of real-time, high availability,
fault tolerant systems. He has also worked for Accenture Consulting serving
clients worldwide in the energy trading and risk management domain.
Karan Gusani
Analyst – Big Data Engineer
Karan has rich experience in developing enterprise applications using opensource technologies and tools. He also has exposure in setting-up the entire
infra-structure, development and deployment of a resource management web
application systems on the cloud.
Contributor
Vikas Arora
Technical Content Specialist
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
8
The Right Recommendation System for Big Data
About
Fractal Analytics
Fractal Analytics is a global analytics firm that serves Fortune 500 companies
to gain a competitive advantage by providing them a deep understanding of
consumers and tools to improve business efficiency. Producing accelerated
analytics that generate data driven decisions, Fractal Analytics delivers
insight, innovation and impact through predictive analytics and visual storytelling.
Fractal Analytics was founded in 2000 and has 800 people in 13 offices
around the world serving clients in over 100 countries.
The company has earned recognition by industry analysts and has been
named one of the top five “Cool Vendors in Analytics” by research advisor
Gartner. Fractal Analytics has also been recognized for its rapid growth, being
ranked on the exclusive Inc. 5000 list for the past three years and also being
named among the USPAACC’s Fast 50 Asian-American owned businesses
for the past two years.
Learn more at www.fractalanalytics.com
For more information, contact us at:
+1 650 378 1284
info@fractalanalytics.com
Follow us:
©2015 Fractal Analytics, Inc., all rights reserved | Confidential
|
9