Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEARCHING THE BLOGOSPHERE

Similar presentations


Presentation on theme: "SEARCHING THE BLOGOSPHERE"— Presentation transcript:

1 SEARCHING THE BLOGOSPHERE
Nilesh Bansal Nick Koudas University of Toronto

2 BLOGOSPHERE

3

4 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS

5 WHAT ARE THEY WRITING ABOUT??
PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT

6 WHY SHOULD WE CARE?

7 EXTRACT PUBLIC OPINION
HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS Blogosphere is a huge repository of human generated content. As our lives become more and more dependent on the internet, this repository is expected to grow. Actionable information about a variety of topics can be extracted from this data.

8 KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING

9 CHALLENGES AND OPPORTUNITIES

10 HUGE AMOUNTS OF UNSTRUCTURED TEXT

11

12 MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM
33% OF WEBSPAM HOSTED AT BLOGSPOT

13 TEMPORAL DIMENSION

14 GEOGRAPHICAL ASSOCIATION

15 CONVERSATION

16 Gruhl et al., The Predictive Power of Online Chatter, KKD 2005
Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

17 BLOGSCOPE

18

19 AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS
CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS

20 ANY STREAMING TEXT SOURCE
NEWS MAILING LISTS FORUMS SOCIAL MEDIA

21 Hot Keywords

22 Geo Search Related Terms Search Results Popularity Curve

23 Taiwan Undersea Earthquake Sumatra Earthquake Hawaii Earthquake

24 December March

25 IPHONE ON JAN

26 Curves are usually correlated, except at one point

27 TECHNIQUES

28 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM
CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM

29 LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT
WE USE HEURISTICS ON GOING BATTLE [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

30 INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA
SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY

31

32 BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

33 POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER

34 IDENTIFYING RELATED TERMS

35 POINTWISE MUTUAL INFORMATION
COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

36 FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE
MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF

37 COMPUTING HOT KEYWORDS

38 POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING
MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED

39 INTELLIGENT ALERT SERVICE
BURST SYNOPSIS AUTHORATIVE RANKING

40 JUST THE BEGINNING Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007. Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

41 THANK YOU. QUESTIONS? Source: xkcd.com


Download ppt "SEARCHING THE BLOGOSPHERE"

Similar presentations


Ads by Google