Download presentation
Presentation is loading. Please wait.
1
Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009
2
Challenges and Opportunities in Building Personalized Online Content Aggregator 2 Outline Emergence of Web 2.0 Online content aggregators Challenges and opportunities RSS monitoring Personalized recommendations Social annotations Conclusion
3
Challenges and Opportunities in Building Personalized Online Content Aggregator 3 Web 1.0 A few professional content creators News Corporate sites Portal One way consumption of information
4
Challenges and Opportunities in Building Personalized Online Content Aggregator 4 Web 2.0 Facilitators of content sharing Wikipedia Blog Media file sharing Discussion group Everyone can publish content easily Handheld devices and innovation online applications Being Web 2.0 publishers
5
Challenges and Opportunities in Building Personalized Online Content Aggregator 5 Growth of UGC / blogs In 2007 study Professional content : 2GB / day UGC : 8-10GB / day Bloglines.com 26% users with >30 subscriptions 2006 person of the year - TIME
6
Challenges and Opportunities in Building Personalized Online Content Aggregator 6 RSS Really Simple Syndication XML Contains 10-15 latest posts Machine readable Datetime of publications Title / content Permalink Subscription RSS reader Personalized homepage
7
Challenges and Opportunities in Building Personalized Online Content Aggregator 7 How RSS helps readers? Without RSS (visit different URLs) With RSS (centralized access)
8
Challenges and Opportunities in Building Personalized Online Content Aggregator 8 RSS usage High usage but low awareness 27% consume 4% aware Common usage News feeds Podcasting My MSN / My Yahoo! / etc. Google reader / bloglines Indexing blogs Time-sensitive content “RSS – Crossing into the Mainstream” Yahoo white paper by Joshua Grossnickle Oct 2005
9
Challenges and Opportunities in Building Personalized Online Content Aggregator 9 Online content aggregator Centralized access to subscribed content in executive summary style Leverage collaborative filtering Ubiquitous access Collect useful social annotation data
10
Challenges and Opportunities in Building Personalized Online Content Aggregator 10 Online content aggregator (Google reader example) Subscription list Newly updated articles (Chapter 2 & 4) Recommendations (Chapter 3) Social annotations (Chapter 5)
11
Challenges and Opportunities in Building Personalized Online Content Aggregator 11 Challenges and opportunities How to deliver up-to-date content? New articles update quickly with recurring patterns Significance of articles deteriorates quickly over time How to provide better personalization? Ranking articles/topics based on user interest Efficient computation to handle large number of users What is the knowledge in Web 2.0 data? Improve Web resources categorization Vocabulary usage
12
Challenges and Opportunities in Building Personalized Online Content Aggregator 12 Outline Emergence of Web 2.0 Online content aggregator Challenges and opportunities RSS monitoring How to deliver “fresh” content Providing better personalization Web 2.0 knowledge mining Conclusion
13
Challenges and Opportunities in Building Personalized Online Content Aggregator 13 The retrieval problem Research problem in proxies, search engines, … Source cooperativeness [DKP01, OW02] Priority of different content [CG03a] Resource constraints User satisfaction [PO05, WSY02] Politeness issues, … Data source aggregator user retrievaldeliver
14
Challenges and Opportunities in Building Personalized Online Content Aggregator 14 Metrics Evaluation at time u 1 Freshness: 0 Age: Delay: Miss-penalty: 2 Push vs. Pull Push: All updates are known (e.g. RSS ping services) Pull: Future updates are estimated
15
Challenges and Opportunities in Building Personalized Online Content Aggregator 15 Refined model Commonly used Webpage change model Homogeneous Poisson model λ(t) = λ at any t RSS content update more frequently with recurring pattern Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…, T is the period userdata source
16
Challenges and Opportunities in Building Personalized Online Content Aggregator 16 Optimization problem Resource allocation How often to contact a data source? O 1 is more active and has more subscribers than O 2, how much often should we contact O 1 ? Retrieval scheduling When to contact a data source? Given 2 retrievals allocated for O 1, when to retrieve from it? Both in the morning, or one in the morning, one at night?
17
Challenges and Opportunities in Building Personalized Online Content Aggregator 17 Retrieval schedule intuition t=1 No postings missed t=0 or 2 All postings (in the same period) missed
18
Challenges and Opportunities in Building Personalized Online Content Aggregator 18 Necessary optimal condition Given λ(t) and u(t), schedule τ j ’s that minimizes delay / miss Delay: Schedule right after large number of new posts Miss-penalty: Schedule right before lot’s of user access
19
Challenges and Opportunities in Building Personalized Online Content Aggregator 19 Performance Reduce miss by 33% compared to CGM03 for 1 retrieval per day Reduce miss further by 20% when consider user access pattern
20
Challenges and Opportunities in Building Personalized Online Content Aggregator 20 Summary Better RSS content update model Significantly improve “content freshness” under same resource constraint Analysis of typical posting patterns and access patterns “Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho and Hyun-Kyu Cho, in IEEE TKDE 2007 “Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007
21
Challenges and Opportunities in Building Personalized Online Content Aggregator 21 Outline Emergence of Web 2.0 Online content aggregator Challenges and opportunities RSS monitoring Providing better personalization Ranking articles/topics based on user interest Efficient computation to support large number of users Social annotations Conclusion
22
Challenges and Opportunities in Building Personalized Online Content Aggregator 22 Learning user profile Users are reluctant to indicate their interest Cold-start problem Diversified recommendations [ZMK05] Drift of user interest [WBP01] Relevance feedback [Eft00, KDF05] Goal: Improve relevance of recommendations click utility recommendations feedback Learning process
23
Challenges and Opportunities in Building Personalized Online Content Aggregator 23 Ranking model 1 Assumptions K predefined topics Every recommendation item belongs to one topic User profile: Θ i – Pr (click | read, topic i) Θ i is estimated by α/(α+β) drawing from a beta distribution with parameters α, β Topic123456 α202532 β1001031
24
Challenges and Opportunities in Building Personalized Online Content Aggregator 24 Ranking model 2 Ranking bias: g(j) – Pr (read | j) Read probability decreases with rank Borrow from web search studies Utility function: U(R; Θ) R: ranking of topics Articles belong to the same topics are chosen randomly
25
Challenges and Opportunities in Building Personalized Online Content Aggregator 25 Ranking topics Updating posteriori distribution after each iteration Not clicked: β new =β old + g(r i ) Clicked: α new =α old + 1 Ranking function of topics Exploitation + λ*exploration Mean + λ*variance Example (λ=1) α=2, β=2 Ranking 0.55 α=5, β=5 Ranking 0.52
26
Challenges and Opportunities in Building Personalized Online Content Aggregator 26 Simulation Click utility improve in long run Adapts to drift of interest More accurate estimation of user interest Θ
27
Challenges and Opportunities in Building Personalized Online Content Aggregator 27 User studies 10 users from UCLA and NEC 45 categories from dmoz.org Arts/Archecture Computers/E-books Science/Biology … Survey of user interest before experiment 7 articles (Webpages) per iteration 3 strategies interleaved First 25 iterations Drifted at 25 th iteration
28
Challenges and Opportunities in Building Personalized Online Content Aggregator 28 Summary Learning framework Exploitation: recommend user interested items Exploration: explore user’s other potential interest Proven to improve click utility and adapt to drift of user interest “Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007
29
Challenges and Opportunities in Building Personalized Online Content Aggregator 29 Outline Emergence of Web 2.0 Online content aggregator Challenges and opportunities RSS monitoring Providing better personalization Ranking articles/topics based on user interest Efficient computation to support large number of users Social annotations Conclusion
30
Challenges and Opportunities in Building Personalized Online Content Aggregator 30 Aggregation as recommendation User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events Aggregation of individual opinions often shows interesting popular topics
31
Challenges and Opportunities in Building Personalized Online Content Aggregator 31 Personal recommendation Dark KnightOlympics Michael PhelpsWALL-E Las Vegas RSS sources Items (phrases) Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight and WALL-E Michael Phelps performance in Olympics is awesome... Finished watching Michael Phelps in Olympics, let me watch the WALL-E DVD...
32
Challenges and Opportunities in Building Personalized Online Content Aggregator 32 Matrix formulation Reference Matrix (E) – the number of times a blogger mention a phrase/link in his blog post Subscription matrix (T) – how often a user reads a blog Personalized score (TE) 321b4b4 475Total 101b3b3 030b2b2 023b1b1 o3o3 o2o2 o1o1 E 0.5 00u3u3 0.6 0.2 u2u2 000.8 u1u1 b4b4 b3b3 b2b2 b1b1 T 21.0 u3u3 2.42.21.8u2u2 0.04.02.4u1u1 o3o3 o2o2 o1o1 TE
33
Challenges and Opportunities in Building Personalized Online Content Aggregator 33 Database operation of matrix Reference (rss-id, item, score) … Grows over time Subscription (user-id, rss- id, score) … Relatively stable 0.5 00u3u3 0.6 0.2 u2u2 000.8 u1u1 b4b4 b3b3 b2b2 b1b1 T Eo1o1 o2o2 o3o3 b1b1 320 b2b2 030 b3b3 101 b4b4 123
34
Challenges and Opportunities in Building Personalized Online Content Aggregator 34 Baselines Aggregate Query SELECT t.item, sum(t.score*e.score) As p_score FROM Endorsement e, Trust t WHERE e.blog-id = t.blog-id AND t.user-id = GROUP BY t.items ORDER BY p_score DESC LIMIT 20 On-the-fly (OTF) View
35
Challenges and Opportunities in Building Personalized Online Content Aggregator 35 Two stage computation Support large number of users and rss sources OTF – high query cost VIEW – high update cost Identify “template” users Users often share similar reading interest Example: template users interested in sports / politics / technologies / … Result are pre-computed and then combined in two stages
36
Challenges and Opportunities in Building Personalized Online Content Aggregator 36 Discover user groups by NMF Decompose subscription matrix T into sub-matrices W and H Non-negative matrix factorization (NMF) [Hoy04] W : [individual users : template users] relationship H : [template users : blogs] relationship Example: user 2’s subscription vector is expressed as linear combination of two template users NMF as an approximation of original subscription matrix Accurate Sparse
37
Challenges and Opportunities in Building Personalized Online Content Aggregator 37 Reconstruction of results Personalized scores of template users are pre-computed (HE) is maintained as sorted lists for template users W*(HE) becomes the personalized scores of all users Computed using Threshold Algorithm [FLN01] Top-K list (HE) are sorted lists W*(HE) is weighted linear combination
38
Challenges and Opportunities in Building Personalized Online Content Aggregator 38 Experiments Bloglines.com: online RSS reader Subscription matrix T: (0 or 1) subscription profile 91k users 487k feeds Reference matrix E: blog-keyword occurrence Feed content collected between Nov 2006 – Jul 2007 Top 20 nouns with highest tf-idf in each posts are selected as keywords Platform Python implementation of proposed method MySQL server on linux with data reside in RAID
39
Challenges and Opportunities in Building Personalized Online Content Aggregator 39 The difference by personalization Week 2007 Jan 7 – Jan 13 Major event: iphone released 3 users with large number of subscriptions Distinct difference between top-20 recommended words Among users – 1.13 Between users and global – 1.12 irangooglequarterphone saddamcathartikpricesbusiness troopsvideocompaniessoftware deptkibbutzappledevelopment avenueargentinabushmanagement viewsvegasiraq presidentsearchchicagomanager bushreutersiphoneapple iraqiguazubeefiphone yorkerbrazilcattlesales User 91017User 90550User 90439Global 2007-01-07 to 2007-01-13
40
Challenges and Opportunities in Building Personalized Online Content Aggregator 40 Efficiency of proposed method Update cost OTF (222K) < NMF (3.2M) < VIEW (23.6M) Query response time Average over 1000 users with highest number of subscription OTF : execute SQL query directly on MySQL server NMF: python implementation that interfaces with MySQL server Average query response time reduced by 75%, eliminated outliers of significant delay 70% approximation 0.007s2.84s0.53s0.46sNMF 0.037s84.42s3.60s2.05sOTF minmaxstdavgMethod
41
Challenges and Opportunities in Building Personalized Online Content Aggregator 41 Summary Provide personalized recommendation by selective aggregation Proposed matrix model for personalized aggregation Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced significantly with acceptable approximation accuracy “Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008
42
Challenges and Opportunities in Building Personalized Online Content Aggregator 42 Outline Emergence of Web 2.0 Online content aggregator Challenges and opportunities RSS monitoring Providing better personalization Social annotations Vocabulary usage effective advertising keyword selection Conclusion
43
Challenges and Opportunities in Building Personalized Online Content Aggregator 43 Social annotations Bookmark tags, video/picture annotations, article tags Evolving vocabularies (itouch, wow, w00t, …) Emoticons (>_<, Orz, …) Intensive human effort Latent Dirichlet Allocation [BNJ03] Recover hidden topics z’s Represent words p(z|w) and documents p(z|d) as distribution over hidden topics Improving information retrieval Web document retrieval [WZY06, ZBZ08] Social tagging usability [CM07] users (u) tags (w) documents (d)
44
Challenges and Opportunities in Building Personalized Online Content Aggregator 44 Topic categorization
45
Challenges and Opportunities in Building Personalized Online Content Aggregator 45 Desired properties of effective advertising keywords Specific Reach target audience e.g. automobiles > ford, good > programming Emerging Developing vs. stable Easier to attract user attention Time-(in)sensitive Context change over time Watch for change in target audience How can these properties be learned from social annotations collected in aggregators?
46
Challenges and Opportunities in Building Personalized Online Content Aggregator 46 Emerging Words correspond to emerging topics Users actively explore new pages and annotate evenly on different pages Examples (between December 2007 and March 2008): rails2.0 (ruby on rails webapp framework) kindle (amazon ebook) itouch (unofficial nickname of ipod touch) eeepc (subnotebook by Asus) obama (Barrack Obama) jailbreak (Apple iphone crack software) Change of entropy emerging stable
47
Challenges and Opportunities in Building Personalized Online Content Aggregator 47 Effective advertising keyword classifier 10+ features extracted from social annotations for each word User study performed on Amazon Mechanical Turk 10-fold cross-validation on different classifiers SVM 70.3% Logistic regression 69.8% C4.5 73.3% Random forest 73.3% K-nn 67.3% Back-propagation neural nets 63.9% Naïve Bayes 59.9% Best-5 combined 73.8%
48
Challenges and Opportunities in Building Personalized Online Content Aggregator 48 Summary Leverage social annotations collected from online content aggregator users Social annotation differ significantly from general text corpora New metrics / features Usage in online advertising “Exploring Social Annotations for Effective Advertising Keyword Selection”, with Junghoo Cho, work in progress
49
Challenges and Opportunities in Building Personalized Online Content Aggregator 49 Conclusion Web 2.0 phenomenon More content sharing and diverse interest Personalized online content aggregator Easier access to different information sources Deliver update content Deliver better personalized recommendations Leverage human effort collected in the aggregator
50
Challenges and Opportunities in Building Personalized Online Content Aggregator 50 References [CG03a] Junghoo Cho and Hector Garcia-Molina. “Effective Page Referesh Policies for Web Crawlers.” ACM TODS 28(4), 2003 [DKP01] Paven Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamritham, and Prashant Shenoy. “Adaptive Push-Pull: Disseminating Dynamic Web Data” WWW 2001 [OW02] Chris Olston and Jennifer Widom. “Best-Effort Cache Synchronization with Source cooperation” SIGMOD 2002 [PO05] Sandeep Pandy and Christopher Olston. “User-Centric Web Crawling” WWW 2005 [WSY02] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen. “Optimal Crawling Strategies for Web Search Engines.” WWW 2002 [FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. “Optimal Aggregation Algorithms for Middleware.” PODS 2001 [Hoy04] Patrik Hoyer “Non-negative Matrix Factorization with Sparseness Constraints” Journal of Machine Learning Research, 5:1457-1469, 2004 [LWL07] Chengkai Li, Ming Wang, Lipyeow Lim, Haixun Wang, and Kevin Chen- Chuan Chang. “Supporting Ranking and Clustering as Generalized Order-By and Group-By” SIGMOD 2007 [PP07] Seung-Taek Park and David Pennock. “Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing” SIGKDD 2007
51
Challenges and Opportunities in Building Personalized Online Content Aggregator 51 References [Eft00] E.N. Efthimiadis “Interactive Query Expansion: A User-based Evaluation in a Relevance Feedback Environment” JASIS 51(11), 2000 [KDF05] Diane Kelly, Vijay Deepak Dollu, and Xin Fu. “The Loquacious User: A Document-Independent Source of Term for Query Expansion” SIGIR 2005 [WPB01] Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. “Machine Learning for User Modeling” User Modeling and User-Adapted Interaction, 11(1-2)19- 29, 2001 [ZMK05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. “Improving Recommendation Lists Through Topic Diversification” WWW 2005 [BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research, 3:993-1022, 2003 [CM07] Ed H. Chi and Todd Mytkowicz. “Understanding Navigability of Social Tagging Systems” CHI 2007 [WZY06] Xian Wu, Lei Zhang, and Yong Yu. “Exploring Social Annotation for the Semantic Web” WWW 2006 [ZBZ08] Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C. Lee Giles. “Exploring Social Annotations for Information Retrieal” WWW 2008
52
Thank you Q & A
53
Challenges and Opportunities in Building Personalized Online Content Aggregator 53 Additional slides RSS monitoring Different data posting patterns Optimal size of estimation window Consistency of posting rate Providing personalized recommendations Partition of trust matrix Threshold algorithm NMF approximation accuracy Approximation accuracy Multi-armed bandit problem Social annotations Preferential attachment / usage URL in photography category Distribution of entropy change Performance of different classifiers
54
Challenges and Opportunities in Building Personalized Online Content Aggregator 54 Different data posting patterns
55
Challenges and Opportunities in Building Personalized Online Content Aggregator 55 Optimal size of estimation window Resource constraint: 4 retrievals per day per feeds on average 2 weeks seems an appropriate choice
56
Challenges and Opportunities in Building Personalized Online Content Aggregator 56 Consistency of posting rate 90% of the RSS feeds post consistently
57
Challenges and Opportunities in Building Personalized Online Content Aggregator 57 Partition of subscription matrix Decomposition is useful when matrix is dense Real life data is often skewed Hybrid method: uses NMF only in its effective region Users with more subscription Blogs with more subscribers Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users ~1M subscription pairs 2.7M subscription pairs 1. OTF 2. VIEW 3. NMF
58
Challenges and Opportunities in Building Personalized Online Content Aggregator 58 Threshold algorithm Proposed by Fagin et.al. [FLN01] Efficient computation of top-K items from multiple lists with a monotone aggregate function users blogs Template user’s recommendations update query
59
Challenges and Opportunities in Building Personalized Online Content Aggregator 59 NMF approximation accuracy Dense region of subscription matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users L2 norm comparison Sparsity of W (23%), H (13%) NMF approximation is close to SVD and sparse 833.0823.2120 837.9829.0110 844.6835.1100 850.1841.690 856.9848.580 NMFSVDRank
60
Challenges and Opportunities in Building Personalized Online Content Aggregator 60 Approximation accuracy How many items are approximated by NMF in the top 20 list? T i – top 20 items of user i computed by OTF A i – top 20 items of user i computed by NMF 70% approximation and more accurate for higher rank items Correlation with rank
61
Challenges and Opportunities in Building Personalized Online Content Aggregator 61 Multi-armed bandit problem Well-studied problem in reinforcement learning / statistics Problem statement Background: You are given n different choices Decision: For each choice you receive a numerical reward chosen from an unknown stationary probability distribution Goal: maximize the total reward over some time period Solutions Action-value methods (greedy & ε-greedy) Softmax Action Selection (decaying) Pursuit methods Associative search
62
Challenges and Opportunities in Building Personalized Online Content Aggregator 62 Preferential attachment/usage URL / Tag usage distribution
63
Challenges and Opportunities in Building Personalized Online Content Aggregator 63 URL in photography category Documents ranked by p(d|z) values
64
Challenges and Opportunities in Building Personalized Online Content Aggregator 64 Specific Stop word list Inverse document frequency (idf) Ontology based Entropy Least specific tags found idf – [web, reference, software, design, …] Entropy – [temp, for, important, good, …]
65
Challenges and Opportunities in Building Personalized Online Content Aggregator 65 Time-sensitivity The usage / associated context changes over time “holiday” Travel packages: [travel, eclipse, europe, guide, …] Christmas shopping: [christmas, gift, shopping, …] “programmers” Programming: [programming, development, code, patterns, …] Job hunting: [work, jobs, career, job, …] KL-divergence of two distributions Jaccard coefficient of two sets of tagged URL
66
Challenges and Opportunities in Building Personalized Online Content Aggregator 66 Distribution of entropy change Entropy increase over time (+0.1 over 3 months)
67
Challenges and Opportunities in Building Personalized Online Content Aggregator 67 Performance of different classifiers 10-fold cross-validation ClassifierSpecificEmergingStable SVM65.7%70.3%63.9% Logistic regression66.3%69.8%60.7% C4.564.5%73.3%59.0% Random forest65.1%73.3%57.4% Knn (k=5)60.1%67.3%64.5% Multilayer perceptron60.3%63.9%60.1% Naïve Bayes67.4%59.9%63.4% Best-5 combined66.3%73.8%63.4%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.