1 ITERATIVE FILE- BASED ITEM:ITEM SIMILARITY COMPUTATION 1 ● Will Holcomb – Vanderbilt University ● Project Aura Intern.

Slides:



Advertisements
Similar presentations
Choosing an Order for Joins
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Parallel Database Systems
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Tuple Spaces and JavaSpaces CS 614 Bill McCloskey.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Unit B065 – Coding a solution PREP WORK 1)Make sure you keep a work log / diary. Use the table on page 16 of the hand book as a template 2)Keep a bibliography.
UNIT - 1Topic - 2 C OMPUTING E NVIRONMENTS. What is Computing Environment? Computing Environment explains how a collection of computers will process and.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Collaborative Filtering Zaffar Ahmed
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Table General Guidelines for Better System Performance
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Module 11: File Structure
CS 540 Database Management Systems
Parallel Databases.
Lecture 16: Data Storage Wednesday, November 6, 2006.
e-Health Platform End 2 End encryption
External Sorting Chapter 13
File System Implementation
ITD1312 Database Principles Chapter 5: Physical Database Design
Interquery Parallelism
The Client/Server Database Environment
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
湖南大学-信息科学与工程学院-计算机与科学系
DHT Routing Geometries and Chord
Implementation of Relational Operations (Part 2)
Cse 344 May 4th – Map/Reduce.
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Objective of This Course
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
Table General Guidelines for Better System Performance
Conceptual Architecture of PostgreSQL
Chapters 15 and 16b: Query Optimization
Conceptual Architecture of PostgreSQL
Overview of Query Evaluation
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
External Sorting Chapter 13
Evaluation of Relational Operations: Other Techniques
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

1 ITERATIVE FILE- BASED ITEM:ITEM SIMILARITY COMPUTATION 1 ● Will Holcomb – Vanderbilt University ● Project Aura Intern

Sun Confidential: Internal Only 2 1.Recommender systems overview 2.The shape of the tail 3.The role of Project Aura 4.Item:item similarity 5.Programming for Project Caroline 6.Reworking item:item in terms of tuples 7.Parallelizability and computability 8.Computation in Project Aura Presentation Overview

Sun Confidential: Internal Only 3 Recommender Systems Exploiting The Long Tail The Theoretical Tail

Sun Confidential: Internal Only 4 The Actual Tail (More Or Less) Crawl of Last.fm Top 50 Artists for 11,985 Users 21,858 Total Artists 598,168 Artist/User Pairs 83,668,000 Listens

Sun Confidential: Internal Only 5 Collaborative Filtering in Project Aura More Aura details coming later Collaborative filtering is about adding the hybrid to the hybrid recommender system Main concerns for filtering algorithms: > Stability – How much can a recommendation change? > Computability – How long does it take to find the answer?

Sun Confidential: Internal Only 6 Item:Item Collaborative Filtering Users are dimensions Items are vectors Similarity is the cosine distance

Sun Confidential: Internal Only 7 Project Caroline Designed for internet applications Utility style pricing – pay for what you use Multiple processes distributed across multiple machines Shared file storage No shared memory

Sun Confidential: Internal Only 8 The Aura Datastore Requests funneled through Data Store Head Subtrees distributed to Partition Clusters running in separate processes > Process coordination using Jini

Sun Confidential: Internal Only 9 Cosine Generation Overview

Sun Confidential: Internal Only 10 Composition > For a single Record Set, perform an operation on a list of all records with a given key > (Artist, ) Cartesian Join > For n input Record Sets, permute all pairs of records with matching keys > (Artist A, Length A ) × (Artist B, Length B ) = (Artist A.Artist B, Length A *Length B ) Join Methods Composition

Sun Confidential: Internal Only 11 Partitioning Collect all matching keys in a single file Run in m processes for m output files Each processor puts records in a set of shared files as determined by a common hashing scheme File locking necessary to prevent concurrent access

Sun Confidential: Internal Only 12 Cosine Generation As Tuples

Sun Confidential: Internal Only 13 Computational Complexity Optimizations Exploit symmetricity in output files to only do n! joins Exploit symmetricity in records to only do (.5)n joins

14 Any Questions? ● Will Holcomb ● hoenir.himinbi.org hoenir.himinbi.org