1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Social Media.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
“Introduction to Blogging” Article by: WordPress Presented by: Elizabeth Kuhn Presented by: Elizabeth Kuhn ENGL /09/07.
Writing in Blogs: Developing Your Students’ Digital Fluency while strengthening traditional literacy skills.
Yusuf Simonson Title Suggesting Friends Using the Implicit Social Graph.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Xyleme A Dynamic Warehouse for XML Data of the Web.
San Francisco Bay Area News Ecology Daniel Ramos CS790G Fall 2010.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
(hyperlink-induced topic search)
By: Wordpress.org Present by: Bora Hong Introduction to Blogging.
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
Overview of Search Engines
WELCOME TO THE AHIA CONNECTED COMMUNITY! HEALTHCARE INTERNAL AUDIT'S PROFESSIONAL THOUGHT LEADERSHIP COMMUNITY.
Introduction To Blogging Sarah Mapel 9 October 2007.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CS105 Introduction to Social Network Lecture: Yang Mu UMass Boston.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Lecturer: Ghadah Aldehim
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
Web 2.0 Features on Scitation. Web 2.0 and Powder Diffraction Web 2.0 features can be found on the Scitation platform for Powder Diffraction –
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
Tools of the Trade: Inquiry CECS 5030: Introduction to the Internet Dr. Cathleen Norris & Jennifer Smolka.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Union-find Algorithm Presented by Michael Cassarino.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Algorithmic Detection of Semantic Similarity WWW 2005.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
“Introduction to Blogging” A Blog By: WordPress.org Mike Mannone ENGL 393 Ms.Harris Summer Session I 2009 Due:
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Network Community Behavior to Infer Human Activities.
Post-Ranking query suggestion by diversifying search Chao Wang.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Kendra Hunter & Charde Johnson EDUC Dr. M. Kariuki.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Basic Blogs Jennifer Dempsey and Aimee Smith. "Can't I just you a link to my blog, miss?"
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
+ “Introduction to Blogging” Katelyn Jacobsen By WordPress.org.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
What is a Blog? short for Weblog journal on a website
HITS Hypertext-Induced Topic Selection
Searching for Truth: Locating Information on the WWW
Discovery of Blog Communities based on Mutual Awareness
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Presentation transcript:

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

2 Introduction  Average Internet users are becoming more and more active in publishing on the web.  Many loosely organized communities of bloggers have started to form around common interests.  A prominent feature of those communities is the speed and accuracy of capturing the latest discussion and information.

3 Introduction  Internet users start to look for latest information not only from traditional news sites but also from weblogs.  Many users also subscribe to certain weblogs and receive updates through web feeds.  This is like joining a community and keeping track of what’s going on in the community.

4 Introduction  We propose a new way of collecting and preparing data for community search which emphasizes gaining informaiton on the strength of social ties between weblogs.  The social tie strength data is used to as key knowledge in discovering closely knitted communities from a large set of interwoven weblogs.

5 Blog Community Formal Definition  4 distinct subtypes of hyperlinks : Blogroll : Many bloggers list a few other blogs they read on the siderbar. Permalinks : Bloggers often cite or reference other bloggers’ writing on their own posts. Comments : Bloggers may respond to other’s writing through other channels. Trackback : This is a protocol that a responder can use to notify the author that he(she) has cited a particular article of the author in his(her) own blog.

6 Blog Community Formal Definition  We consider each hyperlink as a communication instance between two bloggers.  The strength of social tie is simply computed as the number of instances between bloggers.  The blogspace could grow to an unmanageable size.  We are more interested in those strongly connected sub-network.

7 Blog Community Formal Definition

8 G = ( V, E )

9 Blog Community Formal Definition  The betweenness of a vertex in a graph is defined as the number of paths passing through this particular vertex.  A vertex falls on the communication paths between other vertices will have a potential for control of their communication.  The vertex that sits on the largest number of communication paths will play a central role in the whole graph.

10 Blog Community Formal Definition  Community centrality is a normalized metric to measure the extent a single weblog to be more central than all other weblogs in a community.  It is computed as the average difference between the relative centrality of the most central point v* and that of all other point.  The most central point v* is determined by its betweenness score.

11 Blog Community Formal Definition  The hub acts like a single point access to various sources, while the authority is like a very reliability source.

12 Methodology – Blog acquisition  The starting point of weblog community identification is a connected blogspace with relatively similar topic.  The weblog crawler takes a weblog URL as seed and incrementally adds ties between weblogs in the collection.  The crawler identified four different sorts of hyperlinks pointing from source weblog to target weblog as indicators of ties from source to target.

13 Methodology – Blog acquisition  The basic algorithm has two steps: Construct a large candidate set of hyperlinks from all web pages belonging to the source weblog. Remove hyperlinks pointing back to the source weblog and hyperlinks not pointing to weblogs. The remaining set consists of hyperlinks pointing to other weblogs.

14 Methodology – Blog acquisition crawl depth seed’s depth = 0 maxDepth = 6 prefix matching

15 Methodology – Blog acquisition  The blogroll list usually appears in homepage and every single entry page.  To avoid counting them multiple times, the blogrolls were identified during the home page crawling and stored in a special place.

16 Methodology – Blog acquisition  The crawling result is a collection of relational records with two fields, the source weblog url and the target weblog url.  It is the relational representation of the resulting blogspace.  Does not contain self pointing edges.  All edges are lines connecting two different vertices.

17 Methodology – Blog acquisition  Outgoing ties : other weblogs which this blogger cites and all comments other bloggers made on his/her entries.  Incoming links : links we can not get from a weblog includes comments this blogger leave in other weblogs and the collection of links that referencing or citing its posts.

18 Methodology – Blog community extraction  In constructing the directed graph, first remove the multiple lines between two vertice.  We adopt the island partitioning algorithm to extract blog communities.  Island is defined as a connected small sub- network of size [min, max] with stronger internal line weight than line weight to vertices outside the sub-network.  Island partition is a hierarchical clustering method. [5, 50]

19 Experimental Results - Savas.parastatidist.name

20 Experimental Results - Savas.parastatidist.name

21 Experimental Results - Savas.parastatidist.name center hub Authority

22 Experimental Results - Savas.parastatidist.name

23 Experimental Results -

24 Discussion  Quality of the Results The experiments demonstrate that our algorithm can extract communities with focused themes related with a particular topic.  Clustering Feature A recurring theme in the community construction is that many communities are formed by members from the same hosting services.

25 Discussion  Seeding Effect The selection of seed may affect the efficiency and quality of the communities returned. Blog communities discovered from different blogspaces may have overlaps.

26 Discussion  Metrics Evaluation Community center is the most stable measure in member ranking. The community center is determined through a vertex’s betweenness score and can always accurately represent the vertex that sits in most communication paths.