SEARCHING THE BLOGOSPHERE

Slides:



Advertisements
Similar presentations
Almaden Research Center © 2006 IBM Corporation IOP 06 Open Source Intelligence Lesson Learned.
Advertisements

The Blogging of Health Care January 26, 2007 Kaya Walton Internet Consultant Issue Dynamics Inc. (IDI)
SEO in 2010 January 21 st, 2010 Steve Thomas President, The Net Impact.
Teaching Using the Internet in Your Classroom.
Social media for business by Frank Flores Hash Cloud Studio A Creative Marketing Agency 200 Industrial Rd. Suite 155 San Carlos, CA (650)
Our Digital World Second Edition
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
CLEar (Clairaudient Ear) A Realtime Online Observatory for Bursty and Viral Events A demonstration of CLEar System.
Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.
SNA: Research Dr. Nawaporn Wisitpongphan 1. Michael Mathioudakis, Nick Koudas TwitterMonitor: Trend Detection over the Twitter Stream Michael Mathioudakis,
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Web Insights from blogs and search trends Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
Blogging & The Blogosphere Harnessing the Power of the Blog.
Analysing Public Science Debates through Blogs and Online News Sources Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Power of Social Media Reflections by Kelvin J. Twissa.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Blog searching and Web 2.0 Technologies: New Insights into Customers/Citizens/Voters? Mike Thelwall Statistical Cybermetrics Research Group Web Impact.
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
Overview of Web Data Mining and Applications Part I
Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Indonesian Social Media Monitoring Tools a Company Profile Indonesian Social Media Monitoring Tools a Company Profile version 2.0 See What People Say…
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,
Strengths: SEO – Moderate Page Placement Inbound Links: 11 Onsite Lead Generation Mobile Optimization Onsite Blogging -API To Social Sites - Facebook,
BLACK HAT SEO "Show Me The Money”. Keyword Selection.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
SEEKING STABLE CLUSTERS IN THE BLOGOSPHERE SNU IDB Lab. Chung-soo Jang MAR 21, 2008 VLDB 2007, VIENNA Nilesh Bansal, Fei Chiang, Nick Koudas University.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Communicating With Customers Jamie O’Donnell SEO-PR August 6, 2006.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Data Mining By Dave Maung.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
Social Media Measurement Tools Disclaimer: All images such as logos, photos, etc. used in this presentation are the property of their respective copyright.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.
Internet Architecture and Governance
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Jointly Modeling Topics, Events and User Interests on Twitter Qiming DiaoJing Jiang School of Information Systems Singapore Management University.
Optimizing today's websites using tomorrow's technologies.
Online Marketing. Types Marketing Link Building Content Marketing Search Engine Optimization(SEO) Social Media Marketing Advertising.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Unsupervised Streaming Feature Selection in Social Media
Seeking Stable Clusters in the Blogosphere Nilesh Bansal, Fei Chiang, Nick Koudas (univ. of Toronto) Frank Wm. Tompa (univ. of Waterloo) Presented by Jung-yeon,
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
A Machine Learning Approach
The What, Why, and How of Blogs and Blogging
Social Media Measurement Tools
Yi-Chia Wang LTI 2nd year Master student
Text Retrieval and Data Mining in SI - An Introduction
INTERNET STRATEGIES.
Huayi Zhagn and Haiyan Liang
Discussion Forum for Community assistance
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
PolyAnalyst Web Report Training
Building Topic/Trend Detection System based on Slow Intelligence
Presentation transcript:

SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

BLOGOSPHERE

67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS

WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT

WHY SHOULD WE CARE?

EXTRACT PUBLIC OPINION HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS Blogosphere is a huge repository of human generated content. As our lives become more and more dependent on the internet, this repository is expected to grow. Actionable information about a variety of topics can be extracted from this data.

KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING

CHALLENGES AND OPPORTUNITIES

HUGE AMOUNTS OF UNSTRUCTURED TEXT

MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM 33% OF WEBSPAM HOSTED AT BLOGSPOT

TEMPORAL DIMENSION

GEOGRAPHICAL ASSOCIATION

CONVERSATION

Gruhl et al., The Predictive Power of Online Chatter, KKD 2005 Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

BLOGSCOPE

AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS

ANY STREAMING TEXT SOURCE NEWS MAILING LISTS FORUMS SOCIAL MEDIA

www.blogscope.net Hot Keywords

Geo Search Related Terms Search Results Popularity Curve

Taiwan Undersea Earthquake Sumatra Earthquake Hawaii Earthquake

December 15 2006 March 06 2007

IPHONE ON JAN 09 2007

Curves are usually correlated, except at one point

TECHNIQUES

250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM

LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT WE USE HEURISTICS ON GOING BATTLE [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY

BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER

IDENTIFYING RELATED TERMS

POINTWISE MUTUAL INFORMATION COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF

COMPUTING HOT KEYWORDS

POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED

INTELLIGENT ALERT SERVICE BURST SYNOPSIS AUTHORATIVE RANKING

JUST THE BEGINNING Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007. Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

THANK YOU. QUESTIONS? Source: xkcd.com