Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Similar presentations


Presentation on theme: "On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese."— Presentation transcript:

1 On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese

2 Frequent Patterns Mining How may patterns do you see in the following dataset ? ABCDEFGHIJKLM 1 2 3 4 5 6 7 8 9 10 11 12 13 6/15/12 1st HPC Workshp - Claudio Lucchese Claudio Lucchese, Salvatore Orlando, Raffaele Perego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010

3 ABCDEFGHIJKLM 1 2 3 4 5 6 7 8 9 10 11 12 13 Frequent Patterns Mining 6/15/12 1st HPC Workshp - Claudio Lucchese

4 Frequent Patterns Mining usually rows and cols are not in “good-looking” order 6/15/12 1st HPC Workshp - Claudio Lucchese

5 State of the art Most recent approaches try to discover the top- k patterns that optimize different cost functions: Minimize Noise (“holes”) or Minimize MDL encoding(Patterns) + encoding(Data|Patterns) Maximize Information Ratio: Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution Minimize length of patterns and the amount of noise ( our approach =) 6/15/12 1st HPC Workshp - Claudio Lucchese

6 Evaluation Unsupervised: Measure how well the proposed algorithm optimizes the proposed cost function What is the best cost function ? We are investigating supervised measures: Unsupervised extraction : extract patterns from classification/clustering dataset without class/cluster labels information Supervised evaluation : measure how well the patterns can predict/match classes/clusters Preliminary result: Fancy cost functions might not be the best ones 6/15/12 1st HPC Workshp - Claudio Lucchese

7 Information Overload in News 6/15/12 1st HPC Workshp - Claudio Lucchese Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news recommendation. WSDM 2012.

8 ✓ Timeliness ✓ Personalization Can we exploit Twitter? Number of mentions of “Osama Bin Laden” 6/15/12 1st HPC Workshp - Claudio Lucchese

9 90% of the clicks happen within 2 days from publication Only a few occur early! News Get Old Soon 6/15/12 1st HPC Workshp - Claudio Lucchese

10 T.Rex (Twitter-based news recommendation system) Builds a user model from Twitter Signals from user generated content, social neighbors and popularity across Twitter and news Entity-based representation (overcomes vocabulary mismatch) Learn a personalized news ranking function: Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user 6/15/12 1st HPC Workshp - Claudio Lucchese

11 Ranking function is user and time dependent Social model + Content model + Popularity model Popularity model tracks entity popularity by the number of mentions in Twitter and news (with exponential forgetting) Content model measures relatedness of a bag-of-entities representation of a users’ tweet stream and of a news article Social model weights the content model of every social neighbor by a truncated PageRank on the Twitter network Recommendation Model 6/15/12 1st HPC Workshp - Claudio Lucchese

12 ✓ Designed to be streaming and lightweight (just counting) ✓ User model is updated continuously System Overview 6/15/12 1st HPC Workshp - Claudio Lucchese

13 Learning to rank approach with SVM Each time the user clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): Prune the number of constraints for scalability: only news published in the last 2 days only take the top-k news for each ranking component Can optionally include additional features for news articles: click count, age, etc... (T.Rex+) Learning the Weights 6/15/12 1st HPC Workshp - Claudio Lucchese

14 ✓ User generated content is a very good predictor albeit very sparse ✓ Click Count is a strong baseline but does not help T.Rex+ Predicting Clicked News 6/15/12 1st HPC Workshp - Claudio Lucchese

15 Predicting Clicked Entities 6/15/12 1st HPC Workshp - Claudio Lucchese

16 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. 6/15/12 1st HPC Workshp - Claudio Lucchese

17 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. Example: European sovereign-debt crisis tim e Merkel Monti France Berlusconi Greece EU New Italian government Fiscal Compact EuroBond Obama Loan 6/15/12 1st HPC Workshp - Claudio Lucchese

18 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. Applications: Given the news the user is currently reading, provide an explanation of the related facts that precede that news Given a query, provide an explanation of the documents related to that query Given a set of topics, explain their relations over time Browse a collection of news, by changing the topics of interest, the time window, the granularity 6/15/12 1st HPC Workshp - Claudio Lucchese

19 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. A topic is a named entity relevant over time An interaction is a cluster of news related to some event and relevant in a small time window It might be important to cover the given time window, but recent events might be more interesting 6/15/12 1st HPC Workshp - Claudio Lucchese

20 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. Given a maximum number of main topics and interactions, maximize: Topic coverage and diversity Events time coverage Cluster similarity Main topics connectivity 6/15/12 1st HPC Workshp - Claudio Lucchese

21 Future works (?) Explain a set of news showing how the main topics interacted with each other over time. Its is different from news clustering: Even if you had a good clustering, might not be trivial to select which events and which topics to show in order to maximize the amount of information delivered to the user There is some interesting related work aimed at finding chains of news, we are more interested in topic evolution 6/15/12 1st HPC Workshp - Claudio Lucchese

22 Thank you ! 6/15/12 1st HPC Workshp - Claudio Lucchese


Download ppt "On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese."

Similar presentations


Ads by Google