Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013.

Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Agenda 6:15 - 6:30 More data trumps better algorithms by Michael Walker 6:30 - 7:30 Recommendation Engines by Tom Rampley 7:30 - 8:30 Accumulo - Sqrrl by John Dougherty 8:30 - 9:30 Network at Old Chicago at 14th and Market.

Data Science Group New Sponsors Cloudera O'Reilly Media

More data is better

Even if less exact or messier

One sensor = strict accuracy

Multiple sensors = less accurate & messy

One sensor = strict accuracy Multiple sensors = less accurate & messy More data points = greater value

One sensor = strict accuracy Multiple sensors = less accurate & messy More data points = greater value Aggregate = more comprehensive picture

Increase frequency of sensor readings

One measure per min = accurate

Increase frequency of sensor readings One measure per min = accurate 100 readings per second = less accurate

Increase frequency of sensor readings One measure per min = accurate 100 readings per second = less accurate > volume vs. exactitude

Increase frequency of sensor readings One measure per min = accurate 100 readings per second = less accurate > volume vs. exactitude Accept messiness to get scale

Sacrifice accuracy in return for knowing general trend

Big data = probabilistic (not precise)

Sacrifice accuracy in return for knowing general trend Big data = probabilistic (not precise) Good yet has problems

Internet of Things

"Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning.

Internet of Things "Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.

Internet of Things "Big Data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

Data Science

"Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge.

Data Science "Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Algorithms

Process or set of rules to be followed in calculations or other problem-solving operations to achieve a goal, especially a mathematical rule or procedure used to compute a desired result, produce the answer to a question or the solution to a problem in a finite number of steps.

More data trumps better algorithms

Microsoft Word Grammar Checker

More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms

More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms New techniques

More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms New techniques New features

More data trumps better algorithms Feed more data into existing methods

More data trumps better algorithms Feed more data into existing methods Most ML-A one million words or less

More data trumps better algorithms Feed more data into existing methods Most ML-A one million words or less Experiment: 10 mil - 100 mill - 1 billion

More data trumps better algorithms Results: algorithms improved dramatically Simple algorithm that was worst performer with 1/2 mill words performed better than all others with 1 bill words Algorithm worked best with 1/2 mill performed worst with 1 bill words

More data trumps better algorithms Conclusions: More trumps less More trumps smarter (not always)

More data trumps better algorithms Tradeoff between spending time and money on algorithm development versus spending it on data development

More data trumps better algorithms Google language translation 1 billion words 1 trillion words larger yet messier data set - entire internet

Tom Rampley Recommendation Engines: an Introduction

A Brief History of Recommendation Engines Today: Recommenders become core products In addition to Amazon, companies like Pandora, Stitchfix, and Google (because what is a search engine other than a document recommender?) make recommendations a core value add of their services 2000: Amazon joins the party The introduction and vast success of the Amazon recommendation engine in the early 2000s led to wide acceptance of the technology as a way of increasing sales 1992: Recommenders are older than you might think GroupLens becomes the first widely used recommendation engine

What Does a Recommender Do? Recommendation engines use algorithms of varying complexity to suggest items based upon historical information Item ratings or content Past user behavior/purchase history Recommenders typically use some form of collaborative filtering

Collaborative Filtering The name: ‘Collaborative’ because the algorithm takes the choices of many users into account to make a recommendation Rely on user taste similarity ‘Filtering’ because you use the preferences of other users to filter out the items most likely to be of interest to the current user Collaborative Filtering algorithms include: K nearest neighbors Cosine similarity Pearson correlation Bayesian belief nets Markov decision processes Latent semantic indexing methods Association Rules Learning

Cosine Similarity Example Lets walk through an example of a simple collaborative filtering algorithm, namely cosine similarity Cosine similarity can be used to find similar items, or similar individuals. In this case, we’ll be trying to identify individuals with similar taste Imagine individual ratings on a set of items to be a [user,item] matrix. You can then treat the ratings of each individual as an N- dimensional vector of ratings on items: {r 1, r 2 …r N } The similarity of vectors (individuals’ ratings) can be computed by computing the cosine of the angle between them: The closer the cosine is to 1, the more alike the two individuals’ ratings are

Cosine Similarity Example Continued Lets say we have the following matrix of users and ratings of TV shows: And we encounter a new user, James, who has only seen and rated 5 of these 7 shows: Of the two remaining shows, which one should we recommend to James? True BloodCSIJAGStar TrekCastleThe Wire Twin Peaks Bob5214325 Mary4421312 Jim1152523 George3435543 Jennifer5242410 Natalie0504414 Robin5500422 True BloodCSIJAGStar TrekCastle James55310

Cosine Similarity Example Continued To find out, we’ll see who James is most similar to among the folks who have rated all the shows by calculating the cosine similarity between the vectors of the 5 shows that each individual have in common: It seems that Mary is the closest to James in terms of show ratings among the group. Of the two remaining shows, The Wire and Twin Peaks, Mary slightly preferred Twin Peaks so that is what we recommend to James Cosine SimilarityJames Bob0.73 Mary0.89 Jim0.47 George0.69 Jennifer0.78 Natalie0.50 Robin0.79

Collaborative Filtering Continued This simple cosine similarity example could be extended to extremely large datasets with hundreds or thousands of dimensions You can also compute item to item similarity by treating the item as the vectors for which you’re computing similarity, and the users as the dimensions Allows for recommending similar items to a user after they’ve made a purchase Amazon uses a variant of this algorithm This is an example of item-to-item collaborative filtering

Adding ROI to the Equation: an Example with Naïve Bayes When recommending products, some may generate more margin for the firm than others Some algorithms can take cost into account when making recommendations Naïve Bayes is a commonly used classifier that allows for the inclusion of marginal value of a product sale in the recommendation decision

Naïve Bayes Bayes theorem tells us the probability of our beliefs being true given prior beliefs and evidence Naïve Bayes is a classifier that utilizes Bayes’ theorem (with simplifying assumptions) to generate a probability of an instance belonging to a class Class likelihood can be combined with expected payoff to generate the optimal payoff from a recommendation

Naïve Bayes Continued How does the NB algorithm generate class probabilities, and how can we use the algorithmic output to maximize expected payoff? Let’s say we want to figure out which of two products to recommend to a customer Each product generates a different amount of profit for our firm per unit sold We know the target customer’s past purchasing behavior, and we know the past purchasing behavior of twelve other customers who have bought one of the two potential recommendation products Let’s represent our knowledge as a series of matrices and vectors

Naïve Bayes Continued

NB uses (independent) probabilities of events to generate class probabilities Using Bayes’ theorem (and ignoring the scaling constant) the probability of a customer with past purchase history α (a vector of past purchases) buying item θ is: P ( α 1, …, α i | θ j ) P ( θ j ) Where P ( θ j ) is the frequency with which the item appears in the training data, and P ( α 1, …, α i | θ j ) is Π P ( α i | θ j ) for all i items in the training data That P ( α 1, …, α i | θ j ) P ( θ j ) = Π P ( α i | θ j ) P ( θ j ) is dependent up on the assumption of conditional independence between past purchases

Naïve Bayes Continued In our example, we can calculate the following probabilities:

Now that we can calculate P ( α 1, …, α i | θ j ) P ( θ j ) for all instances, let’s figure out the most likely boat purchase for Eric: These probabilities may seem very low, but recall that we left out the scaling constant in Bayes theorem since we’re only interested in the relative probabilities of the two outcomes Naïve Bayes Continued P(θ)ToysGamesCandyBooksBoat EricSquirt GunLifeSnickersHarry Potter? Sailboat6/123/122/12 3/120.00086806 Speedboat6/121/122/121/12 0.00004823

So it seems like the sailboat is a slam dunk to recommend. It’s much more likely (18 times!) for Eric to buy than the speedboat. But let’s consider a scenario: let’s say our hypothetical firm generates $20 of profit whenever a customer buys a speedboat, but only $1 when they buy a sailboat (outboard motors are apparently very high margin) In that case, it would make more sense to recommend the speedboat, because our expected payoff from the speedboat recommendation would be 11% greater ($20/$1 *.0000048/.00087) than our expected payout from the sailboat recommendation This logic can be applied to any number of products, by multiplying the set of purchase probabilities by the set of purchase payoffs, taking the maximum value as the recommended item Naïve Bayes Continued

Challenges While recommendation algorithms in many cases are relatively simple as machine learning goes, there are a couple of difficult problems that all recommenders must deal with: Cold start problem How do you make recommendati ons to someone for whom you have very little or no data? Data sparsity With millions of items for sale, most customers have bought very few individual items Grey and Black sheep problem Some people have very idiosyncratic taste, and making recommendati ons to them is extremely difficult because they don’t behave like other customers

Dealing With Cold Start Typically only a problem in the very early stages of a user-system interaction Requiring creation of a profile for new users can mitigate the problem to a certain extent, by making early recommendations contingent upon supplied personal data A recommender system can also start out using item-item recommendations based upon the first items a user buys, and gradually change over to a person-person system as the system learns the user’s taste

Dealing With Data Sparsity Data sparsity can be dealt with primarily by two methods: Data imputation Latent factor methods Data imputation typically uses an algorithm like cosine similarity to impute the rating of an item based upon the ratings of similar users Latent factor methods typically use some sort of matrix decomposition to reduce the rank of the large, sparse matrix while simultaneously adding ratings for unrated items based upon latent factors

Dealing With Data Sparsity Techniques like principal components analysis/singular value decomposition allow for the creation of low rank approximations to sparse matrices with relatively little loss of information

Dealing With Sheep of Varying Darkness To a large extent, these cases are unavoidable Feedback on recommended items post purchase, as well as the purchase rate of recommended items, can be used to learn even very idiosyncratic preferences, but take longer than for a normal user Grey and black sheep are doubly troublesome because their odd tendencies can also weaken the strength of your engine to make recommendations to the broad population of white sheep

References A good survey of recommendation techniques Matrix factorization for use in recommenders Article on the BellKor solution to the Netflix challenge Article on Amazon's recommendation engine

Why Cell-Level Security Is Important: Many databases insufficiently implement security through row-and column- level restrictions. Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns. Row-level security breaks down when a single record conveys multiple levels of information. The flexible, fine-grained cell-level security within Sqrrl Enterprise (or its root Accumulo) supports flexible schemas, new indexing patterns, and greater analytic adaptability at scale.

An Accumulo key is a 5-tuple key, consisting of:  Row: Controls Atomicity  Column Family: Controls Locality  Column Qualifier: Controls Uniqueness  Visibility Label: Controls Access  Timestamp: Controls Versioning Keys are sorted:  Hierarchically: Row first, then column family, and so on  Lexicographically: Compare first byte, then second, and so on (Values are byte arrays)

An example of column usage

Accumulo servers (tablets) utilize a multitude of big data technologies, but their layout is different than Map/Reduce, HDFS, MongoDB, Cassandra, etc. used alone.  Data is stored in HDFS  Zookeeper is utilized for configuration management  SSH, password-less, node configuration  An emphasis, more of an imperative, on data model and data model design

Tablets  Partitions of tables, collections of sorted key/value pairs  Held and managed by Tablet Servers

 Receive writes, responds to reads, from clients  Writes to a write-ahead log, sorting new key/value pairs in memory, while periodically flushing sorted key/value pairs to new files in HDFS  Managed by Master  Responsible for detecting and responding to Tablet Server failure, load balancing  Coordinates startup, graceful shutdown, and recovery of write- ahead logs  Zookeeper  An apache project, open source  Utilized for distributed locking mechanism, with no single point of failure Tablet Servers

1.Gather an organization’s information security policies and dissecting them into data- ‐ centric and user- ‐ centric components 2.As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the appropriate data- ‐ centric visibility labels based on these policies. 3.Data is then stored in Accumulo where it is available for real- ‐ time queries by operational applications. End users are authenticated through these applications and authorized to access underlying data 4.As an end user performs an operation via the app (e.g., performs a search request), the visibility label on each candidate key/value pair is checked against his or her attributes, and only the data that he or she is authorized to see is returned. The visibility labels are a feature that is unique to Accumulo. No other database can apply access controls at such a fine-grained level. Labels are generated by translating an organization’s existing data security and information sharing policies into Boolean expressions

sqrrl’s extensibility of Accumulo allows it to process millions of records per second, as either static or streaming objects These records are converted into hierarchical JSON documents, giving document store capabilities Passing this data to the analytics layer is designed to make integration and development of real-time analytics possible, and accessible Combining access at the cell level, with Accumulo, sqrrl integrates Identity and Access Management (IAM) systems (LDAP, RADIUS, etc.)

Sqrrl process Data Ingest: HDFS: Apache Accumulo: Apache thrift: Apache Lucene: JSON, or graph format File Storage system, compatible with both open source (OSS) and commercial versions The core of transactional and online analytical data processing in sqrrl Enables development in diverse language choices Custom iterators, providing developers with real-time capabilities, such as full-text search, graph analysis, and statistics, for analytical applications and dashboards.

CTOs/CIOs: Unlock the value in fractured and unstructured datasets across your organization Developers: More easily create apps on top of Big Data and distributed databases Infrastructure Managers: Simplify administration of Big Data through highly scalable and multitenant distributed systems Data Analysts: Dig deeper into your data using advanced analytical techniques, such as graph analysis Business Users: Use Big Data seamlessly via apps developed on top of sqrrl enterprise

Accumulo bridges the gap for security perspectives that restrict a large swath of industries Accumulo Setup: 1.Installation of HDFS and ZooKeeper must installed and configured 2.Password-less SSH should be configured between all nodes (emphasized master <> tablet) 3.Installation of Accumulo (from http://accumulo.apache.org/downloads/ using http://accumulo.apache.org/1.4/user_manual/Administration.html#Installationhttp://accumulo.apache.org/downloads/ http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation Or get started using their AMI (http://www.sqrrl.com/downloads#getting-started) sqrrl combines the best of available technologies, develops and contributes their own, and designs big apps for big data.

Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013.

Similar presentations

Presentation on theme: "Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013.

Similar presentations

Presentation on theme: "Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013."— Presentation transcript:

Similar presentations

About project

Feedback