Download presentation
1
SEARCHING THE BLOGOSPHERE
Nilesh Bansal Nick Koudas University of Toronto
2
BLOGOSPHERE
4
67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS
5
WHAT ARE THEY WRITING ABOUT??
PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT
6
WHY SHOULD WE CARE?
7
EXTRACT PUBLIC OPINION
HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS Blogosphere is a huge repository of human generated content. As our lives become more and more dependent on the internet, this repository is expected to grow. Actionable information about a variety of topics can be extracted from this data.
8
KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING
9
CHALLENGES AND OPPORTUNITIES
10
HUGE AMOUNTS OF UNSTRUCTURED TEXT
12
MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM
33% OF WEBSPAM HOSTED AT BLOGSPOT
13
TEMPORAL DIMENSION
14
GEOGRAPHICAL ASSOCIATION
15
CONVERSATION
16
Gruhl et al., The Predictive Power of Online Chatter, KKD 2005
Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007
17
BLOGSCOPE
19
AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS
CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS
20
ANY STREAMING TEXT SOURCE
NEWS MAILING LISTS FORUMS SOCIAL MEDIA
21
Hot Keywords
22
Geo Search Related Terms Search Results Popularity Curve
23
Taiwan Undersea Earthquake Sumatra Earthquake Hawaii Earthquake
24
December March
25
IPHONE ON JAN
26
Curves are usually correlated, except at one point
27
TECHNIQUES
28
250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM
CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM
29
LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT
WE USE HEURISTICS ON GOING BATTLE [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006
30
INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA
SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY
32
BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005
33
POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER
34
IDENTIFYING RELATED TERMS
35
POINTWISE MUTUAL INFORMATION
COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989
36
FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE
MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF
37
COMPUTING HOT KEYWORDS
38
POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING
MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED
39
INTELLIGENT ALERT SERVICE
BURST SYNOPSIS AUTHORATIVE RANKING
40
JUST THE BEGINNING Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007. Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).
41
THANK YOU. QUESTIONS? Source: xkcd.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.