Download presentation
Presentation is loading. Please wait.
Published byEmmanuel Spraggins Modified over 9 years ago
1
Pete Bohman Adam Kunk
2
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
3
Requirements ◦ Contents searchable immediately following creation ◦ Scale to thousands of updates/sec OBL Death 5,000 tweets/sec ◦ Results relevant to query via cost efficient ranking
4
TI Rank Vs. Time Rank
5
Real-time search of microblogging applications is provided via two components: ◦ Indexing Mechanism – for pruning tweets, only looking at a subset of all tweets (allows for speed) ◦ Ranking Mechanism – for looking at relevant tweets (weeding out tweets that are not deemed important enough) Main idea: look at important tweets only
6
Real-Time Search = Indexing + Ranking TI Index ◦ Scalable indexing scheme based on partial indexing ◦ Only index tweets likely to appear in query result TI Rank ◦ User’s pagerank ◦ Popularity of topic ◦ Tweet to query similarity
7
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
8
The Case for Partial Indexes ◦ Stonebreaker, 1989 ◦ Index only a portion of a column User specified index predicates (where salary > 500) Build index as a side-effect of query processing Incremental index building
9
An application of materialized views is to use cost models to automatically select which views to materialize. ◦ Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object. The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.
10
Google and Twitter have both released real- time search engines. ◦ Google’s engine adaptively crawls the microblog ◦ Twitter’s engine relies on Apache’s Lucene (high- performance, full-featured text search engine library) But, both the Google and Twitter engines only utilize time in their ranking algorithms. TI’s ranking algorithm takes much more than just time into account.
11
TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search. Tweets are grouped into topics by grouping them by relationship in a tree structure. ◦ Tweets replying to the same tweet or belonging to the same thread are organized as a tree. TI also maintains popular topics in memory.
12
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
14
Twitter users have links to other friends A User Graph is utilized to demonstrate this relationship G u = (U, E) ◦ U is the set of users in the system ◦ E is the friend links between them
15
Nodes represent tweets Directed edges indicate replies or retweets Implemented by assigning tweets a tree encoding ID
16
Search is handled via an inverted index for tweets Given a keyword, the inverted index returns a tweet list, T ◦ T contains set of tweets sorted by timestamp
17
TID = Tweet ID U-PageRank = Used for ranking TF = Term Frequency tree = TID of root node of tweet tree time = timestamp
18
In order to help ranking, TI keeps a table of metadata for each tweet ◦ TID = tweet ID ◦ RID = ID of replied tweet (to find parent) ◦ tree = TID of root node of tweet tree ◦ time = timestamp ◦ count = number of tweets replying to this tweet
19
Certain structures are kept in-memory to support indexing and ranking ◦ Keyword threshold – records statistics of recent popular queries ◦ Candidate topic list – information about recent topics ◦ Popular topic list – information about highly discussed topics
20
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
22
Observation ◦ Users are only interested in top-K results for a query Given a tweet t and a user’s query set Q, ◦ ∃ q i ∈ Q and t is a top-K result for q i based on the ranking function F t is a distinguished tweet Maintenance cost for query set Q ?
23
Observation ◦ 20% of queries represent 80% of user requests (Zipf’s dist.) Suppose the n th query appears with a prob. of (Zipf’s distribution) Let s be the # of queries submitted /sec. Expected time interval of the n th query is We will keep the n-th query in Q, only if t(n) < t’ Batch processing occurs every t’ sec
24
Dominant set ds(q i,t) ◦ The tweets that have higher ranks than t for query q i Performance problems ◦ Full scan of tweet set required for dominant set ◦ Test each tweet against every query
25
Observation ◦ The rank of the lower results are stable Replace dominant set with a comparison to the score of Q’s Kth result.
26
Compare a tweet to similar queries Key Words k1k2k3k4Count Query 1 1 1 2 Query 2 1 1 Query 3 11 Query 4 111 3 Given tweet t =, compare t to Q1, Q3, Q4
27
New tweets categorized as being distinguished (index these immediately) 1.If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. 2.Tweet is inserted into tweet data table. 3.Tweet is inserted into inverted index. Main cost is updating the inverted index (due to each keyword in the tweet).
28
New tweets categorized as being noisy (index these at a later time) Instead of indexing in inverted index, append tweet to log file. Batch indexing process periodically scans the log file and indexes the tweets there.
29
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
30
“The ranking function must consider both the timestamp of the data and the similarity between the data and the query.” ◦ “The ranking function is composed of two independent factors, time and similarity.” “The ranking function should be cost- efficient.”
31
Ranking functions are completely separate from the indexing mechanism ◦ New ranking functions could be used TI’s proposed ranking function is based on: ◦ User’s PageRank ◦ Popularity of the topic ◦ Timestamp (self-explanatory) ◦ Similarity between tweet and the query
32
Twitter has two types of links between users ◦ f(u): the set of users who follow user u ◦ f -1 (u): the set of users who user u follows A matrix, M f [i][j], is used to record the following links between users A weight factor is given for each user ◦ V = (w 1, w 2, ….. w n )
33
PageRank formula is given as: P u = VM f x So, the user’s PageRank is a combination of their user weight and how many followers they have ◦ The more popular the user, the higher the PageRank
34
Users can retweet or reply to tweets. Popularity can be determined by looking at the largest tweet trees. Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree
35
The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)
36
q.timestamp = query submittal time tree.timstamp = timestamp of tree t belongs to (timestamp of root node) w 1, w 2, w 3 are weight factors for each component (all set to 1)
37
The size of the inverted index limits the performance of the search for tweets ◦ The size of the inverted index grows with the number of tweets To alleviate this problem, adaptive indexing is proposed:
38
The main idea: ◦ Iteratively read a block of the inverted index (rather than the entire thing) ◦ Stop iterating blocks when the timestamp value gives a score low enough to throw out the results Stop because the rest of the tweets in the inverted index will also have a lower score
39
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
40
Evaluation performed on real dataset ◦ Dataset collected for 3 years (October 2006 to November 2009) ◦ 500 random users picked as seeds (from which other users are integrated into the social graphs) ◦ 465,000 total users ◦ 25,000,000 total tweets Experiments typically 10 days long ◦ 5 days training, 5 days measuring performance
41
Queries lengths are distributed as follows: ◦ ~60% are 1 word ◦ ~30% are 2 words ◦ ~10% are more than 2 words Queries submitted at random, tweets are inserted into system based on original timestamps (from dataset)
43
TimeBased represents using only tweet timestamp (like Google)
44
Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion
45
Current search engines unable to index social networking data Adaptive indexing mechanism to reduce update cost Cost efficient and effective ranking function Successful evaluation using real data set from twitter
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.