Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

Group Recommendation: Semantics and Efficiency
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Book Recommender System Guided By: Prof. Ellis Horowitz Kaijian Xu Group 3 Ameet Nanda Bhaskar Upadhyay Bhavana Parekh.
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Active Learning and Collaborative Filtering
CIKM’2008 Presentation Oct. 27, 2008 Napa, California
Robust Network Compressive Sensing Lili Qiu UT Austin NSF Workshop Nov. 12, 2014.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept
Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Combining Content-Based and Collaborative Filters in an Online Newspaper Mark Claypool, Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry Netes and Matthew.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
Tag-based Social Interest Discovery
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Efficient Processing of Top-k Spatial Preference Queries
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Distributed Ranked Data Dissemination in Social Networks Joint work with: Mo Sadoghi Vinod Muthusamy Hans-Arno.
POSTER TEMPLATE BY: Background Objectives Psychophysical Experiment Smoothness Features Project Pipeline and outlines The purpose.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
2016/2/131 Structural and Temporal Analysis of the Blogosphere Through Community Factorization Y. Chi, S. Zhu, X. Song, J. Tatemura, B.L. Tseng Proceedings.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
 Smartphones – iPhone, Android, Blackberries, etc  Tablets – iPad, Android, Windows, Google, etc.  Computers Basically anything that can connect to.
Reputation-aware QoS Value Prediction of Web Services Weiwei Qiu, Zhejiang University Zibin Zheng, The Chinese University of HongKong Xinyu Wang, Zhejiang.
Slope One Predictors for Online Rating-Based Collaborative Filtering Daniel Lemire, Anna Maclachlan In SIAM Data Mining (SDM’05), Newport Beach, California,
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Dense-Region Based Compact Data Cube
Recommender Systems & Collaborative Filtering
Evaluation Anisio Lacerda.
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Optimizing Parallel Algorithms for All Pairs Similarity Search
The What, Why, and How of Blogs and Blogging
Presentation transcript:

Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles 2 NEC Labs America 3 Yahoo! Inc. ACM SIGKDD 2008

2 Motivation User-generated content in Blogosphere and Web2.0 services contains rich information of recent events Aggregation of individual user opinions to show current popular trends

3 Motivation Global aggregation  Recent news are picked up automatically “Dark Knight” in the week of July 18 “Olympics” related in the week of August 8  Potential drawbacks What if I am not interested in sports at all? Groups of bloggers collaborated to promote advertisement videos Personal aggregation  Users selectively aggregate from different sources  Efficient strategy to handle large number of users and sources

4 From Global to Personal Aggregation Dark KnightOlympics Michael PhelpsSIGKDD Las Vegas bloggers items (phrases) Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight in SIGKDD Michael Phelps performance in Olympics is awesome... Finished watching Michael Phelps in Olympics, got to attend SIGKDD now...

5 Matrix formulation Endorsement matrix (E)  E(b j,o k ) how much a blogger endorse an object  Object can be phrases or URLs 321b4b4 475Total 101b3b3 030b2b2 023b1b1 O3O3 o2o2 o1o1 E u3u u2u u1u1 b4b4 b3b3 b2b2 b1b1 T Trust matrix (T)  T(u i,b j ) how much a user trust a blogger  whether a user reads the blog or how often he reads

6 Personal aggregation PersonalizedEndorsement score is the summation of endorsement score weighted by a user's trust vector Endorsement (blog_id, item, score) Trust (user_id, blog_id, score) Personal Aggregate Query as SQL (Q1): SELECT t.item, sum(t.score*e.score) AS score FROM Endorsement e, Trust t WHERE e.blog_id = t.blog_id AND t.user_id = GROUP BY t.item ORDER BY score DESC LIMIT u3u u2u u1u1 o3o3 o2o2 o1o1 TE

7 Two baseline approaches OTF  Maintain two tables, compute the weighted sum per each personal aggregate query on-the-fly  High query cost VIEW  Pre-compute the results of every user and store as views  High update cost OTF VIEW

8 Best of both worlds Identify “template” users - typical users interested in sports / politics / technology /... Results of template users are pre-computed Results of individual users are combined from partially computed results

9 Trust matrix decomposition Trust matrix reflects user's interest Decompose the T into two sub-matrices W and H  Non-negative Matrix Factorization (NMF)  W : relationship  H : relationship User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2

10 Reconstruction of results PersoanlizedEndorsement scores of template users are precomputed, results of individual users are computed on request (HE) is maintained as sorted lists for all template users W * (HE) is the personal aggregation result  Computed using Threshold Algorithm Top-K list (HE) are sorted lists W * (HE) is weighted linear combination

11 Partition of trust matrix Decomposition is useful when the matrix is dense Real life data is skewed Hybrid method: uses decomposition only when it is effective Users with more subscription Blogs with more subscribers Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users ~1M subscription pairs 2.7M subscription pairs 1. OTF 2. VIEW 3. NMF

12 Experiments Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile  91,366 users  487,694 RSS feeds Endorsement matrix E : blog - keywords occurrence  Feed content collected between Nov 2006 and Jul 2007  Keywords filtered by nouns and high tf-idf values in entries Platform  Python implementation of proposed scheme  MySQL server on linux with data on RAID disk

13 How different is personalization? Week 2007 Jan 7 – 2007 Jan 13 major event: iphone released personal aggregation results differ from global aggregation irangooglequarterphone saddamcathartikpricesbusiness troopsvideocompaniessoftware deptkibbutzappledevelopment avenueargentinabushmanagement viewsvegasiraq presidentsearchchicagomanager bushreutersiphoneapple iraqiguazubeefiphone yorkerbrazilcattlesales User 91017User 90550User 90439Global to

14 How different is personalization? Overlap comparison of global aggregation and personal aggregation  L G – global top 20 items L i – individual top 20 items of user i Personal aggregation results also differ among users Overlap degree with global aggregation result Pair-wise among users

15 Approximation accuracy Dense region of subscription matrix  >30 subscribers: feeds  >30 subscriptions: users L 2 norm comparison Sparsity of W (23%), H (13%) NMF approximation is close to SVD with sparseness adv NMFSVDRank

16 Approximation accuracy How many items are approximated by NMF in top 20 list?  T i – top 20 items of user i computed by OTF A i – top 20 items of user i computed by NMF 70 % approximation and more accurate for higher rank items Correlation with rank

17 Efficiency of proposed method Update cost  OTF (222K) < NFM (3.2M) < VIEW (23.6M) Query response time  average over 1000 users with highest number of subscriptions  OTF: execute SQL query Q1 on MySQL server  NMF: python implementation of Threshold Algorithm that interface MySQL server for loading NMF template users' tables Average query response time reduced by 75%, eliminated outliers of significant delay 0.007s2.84s0.53s0.46sNMF 0.037s84.42s3.60s2.05sOTF minmaxstdavgMethod

18 Conclusion and future work Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries  Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy Handle updates of trust matrix change Parallelism Better phrase extraction (e.g. opinion orientation)

19 Thank you! Q and A

20 Threshold algorithm Proposed by Fagin et.al. [2001] Efficient computation of top-K items from multiple lists with a monotone aggregate function users blogs user groups

21 Illustration of matrix partition Feeds with More subscribers User with more subscriptions 2 subscriptions8 subscriptions 2 subscribers 9 subscribers