Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles.

Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles 2 NEC Labs America 3 Yahoo! Inc. ACM SIGKDD 2008

2 Motivation User-generated content in Blogosphere and Web2.0 services contains rich information of recent events Aggregation of individual user opinions to show current popular trends

3 Motivation Global aggregation  Recent news are picked up automatically “Dark Knight” in the week of July 18 “Olympics” related in the week of August 8  Potential drawbacks What if I am not interested in sports at all? Groups of bloggers collaborated to promote advertisement videos Personal aggregation  Users selectively aggregate from different sources  Efficient strategy to handle large number of users and sources

4 From Global to Personal Aggregation Dark KnightOlympics Michael PhelpsSIGKDD Las Vegas bloggers items (phrases) Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight in SIGKDD Michael Phelps performance in Olympics is awesome... Finished watching Michael Phelps in Olympics, got to attend SIGKDD now...

5 Matrix formulation Endorsement matrix (E)  E(b j,o k ) how much a blogger endorse an object  Object can be phrases or URLs 321b4b4 475Total 101b3b3 030b2b2 023b1b1 O3O3 o2o2 o1o1 E 0.5 00u3u3 0.6 0.2 u2u2 000.8 u1u1 b4b4 b3b3 b2b2 b1b1 T Trust matrix (T)  T(u i,b j ) how much a user trust a blogger  whether a user reads the blog or how often he reads

6 Personal aggregation PersonalizedEndorsement score is the summation of endorsement score weighted by a user's trust vector Endorsement (blog_id, item, score) Trust (user_id, blog_id, score) Personal Aggregate Query as SQL (Q1): SELECT t.item, sum(t.score*e.score) AS score FROM Endorsement e, Trust t WHERE e.blog_id = t.blog_id AND t.user_id = GROUP BY t.item ORDER BY score DESC LIMIT 20 21.0 u3u3 2.42.21.8u2u2 0.04.02.4u1u1 o3o3 o2o2 o1o1 TE

7 Two baseline approaches OTF  Maintain two tables, compute the weighted sum per each personal aggregate query on-the-fly  High query cost VIEW  Pre-compute the results of every user and store as views  High update cost OTF VIEW

8 Best of both worlds Identify “template” users - typical users interested in sports / politics / technology /... Results of template users are pre-computed Results of individual users are combined from partially computed results

9 Trust matrix decomposition Trust matrix reflects user's interest Decompose the T into two sub-matrices W and H  Non-negative Matrix Factorization (NMF)  W : relationship  H : relationship User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2

10 Reconstruction of results PersoanlizedEndorsement scores of template users are precomputed, results of individual users are computed on request (HE) is maintained as sorted lists for all template users W * (HE) is the personal aggregation result  Computed using Threshold Algorithm Top-K list (HE) are sorted lists W * (HE) is weighted linear combination

11 Partition of trust matrix Decomposition is useful when the matrix is dense Real life data is skewed Hybrid method: uses decomposition only when it is effective Users with more subscription Blogs with more subscribers Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users ~1M subscription pairs 2.7M subscription pairs 1. OTF 2. VIEW 3. NMF

12 Experiments Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile  91,366 users  487,694 RSS feeds Endorsement matrix E : blog - keywords occurrence  Feed content collected between Nov 2006 and Jul 2007  Keywords filtered by nouns and high tf-idf values in entries Platform  Python implementation of proposed scheme  MySQL server on linux with data on RAID disk

13 How different is personalization? Week 2007 Jan 7 – 2007 Jan 13 major event: iphone released personal aggregation results differ from global aggregation irangooglequarterphone saddamcathartikpricesbusiness troopsvideocompaniessoftware deptkibbutzappledevelopment avenueargentinabushmanagement viewsvegasiraq presidentsearchchicagomanager bushreutersiphoneapple iraqiguazubeefiphone yorkerbrazilcattlesales User 91017User 90550User 90439Global 2007-01-07 to 2007-01-13

14 How different is personalization? Overlap comparison of global aggregation and personal aggregation  L G – global top 20 items L i – individual top 20 items of user i Personal aggregation results also differ among users Overlap degree with global aggregation result Pair-wise among users

15 Approximation accuracy Dense region of subscription matrix  >30 subscribers: 10152 feeds  >30 subscriptions: 24340 users L 2 norm comparison Sparsity of W (23%), H (13%) NMF approximation is close to SVD with sparseness adv. 833.0823.2120 837.9829.0110 844.6835.1100 850.1841.690 856.9848.580 NMFSVDRank

16 Approximation accuracy How many items are approximated by NMF in top 20 list?  T i – top 20 items of user i computed by OTF A i – top 20 items of user i computed by NMF 70 % approximation and more accurate for higher rank items Correlation with rank

17 Efficiency of proposed method Update cost  OTF (222K) < NFM (3.2M) < VIEW (23.6M) Query response time  average over 1000 users with highest number of subscriptions  OTF: execute SQL query Q1 on MySQL server  NMF: python implementation of Threshold Algorithm that interface MySQL server for loading NMF template users' tables Average query response time reduced by 75%, eliminated outliers of significant delay 0.007s2.84s0.53s0.46sNMF 0.037s84.42s3.60s2.05sOTF minmaxstdavgMethod

18 Conclusion and future work Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries  Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy Handle updates of trust matrix change Parallelism Better phrase extraction (e.g. opinion orientation)

19 Thank you! Q and A

20 Threshold algorithm Proposed by Fagin et.al. [2001] Efficient computation of top-K items from multiple lists with a monotone aggregate function users blogs user groups

21 Illustration of matrix partition Feeds with More subscribers User with more subscriptions 2 subscriptions8 subscriptions 2 subscribers 9 subscribers

Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles.

Similar presentations

Presentation on theme: "Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles.

Similar presentations

Presentation on theme: "Efficient Computation of Personal Aggregate Queries on Blogs Ka Cheung Sia 1 Junghoo Cho 1 Yun Chi 2 Belle L. Tseng 3 1 University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback