Download presentation
Presentation is loading. Please wait.
Published byAvery Boyle Modified over 11 years ago
1
Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston
2
Contents Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion Conclusion
3
Introduction What is a blog? –First appeared in 1994 –Peter Merholz in early 1999 –60 million as of November 2006 Information often republished by other blog users
4
Introduction Form a complex social structure Propagation of information could be visualised as infection Paper aims to track infection through blogspace and determine the original source Most-related work on spread of foot- and-mouth disease
5
Terminology Meme Infected Patient zero Infection inference Infection tree
6
Spread of URLs Infection: www.giantmicrobes.com Data source: www.blogpulse.com
7
Spread of URLs Do not expect all blogs which mention a given URL to have seen it at the source Aim is to determine the infection source for any given blog Most URLs appearing on blogs are free- floating –From external channels, different URLs for same page Cannot guarantee links with timelines and infection inference but can rule out some possibilities and find the most plausible
8
Spread of URLs Blogrolls –Two-way links to other blogs (e.g. trackbacks) –One user links to anothers blog and that automatically links back to the original Frequently find no explicit links to explain infection –Via links very rare
9
Inferring Infection Routes Where explicit links are not present, use 5 classifiers to infer likely routes –Number of blog-blog links in common –Number of blog-non-blog links in common –Text similarity –Order and frequency of repeated infections –In- and out-link counts for both blogs
10
Inferring Infection Routes Classify blogs likeliness to be linked based on similarity –Blog-blog and blog-non-blog links: –Textual similarity: Term Frequency-Inverse Document Frequency weighted vector Features obtained from full text and differential text crawls
11
Inferring Infection Routes Similarity features often useful in predicting the existence of a link
12
Inferring Infection Routes Classify explicit links likeliness to participate in infection Infection six times more likely to happen again where it has happened previously % Blog Pairs Citing 1 Common URL Link typeSameA > BA < BEither A B 17.424.5 45 A B 10.922.917.036 None0.61.51.33
13
Inferring Infection Routes Likeliness of links to participate in infection not generally linked to similarity of blogs
14
Inferring Infection Routes First link classifier used with a three-class SVM performed with only 57% accuracy –Difficult to distinguish reciprocated and unreciprocated links Second link classifier performed better –SVM: 91.2% accuracy –Logistic regression: 91.9% accuracy but based on fewer factors
15
Inferring Infection Routes Additional classifiers were created for plausible infection routes from links –Logistic regression: up to 77% accuracy –SVM: up to 71.5% accuracy Accuracy depended on which subset of classifiers was selected
16
Visualisation From inferred routes, can construct infection trees Directed Acyclic Graph (DAG) created for each URL Thinned out to make it more manageable Label each link with an inference score and dynamically control the display
17
Visualisation Sparse Tree Algorithm: For blog A and URL x, collect sets of blogs, B –indicated by A as explicit sources of URL x –explicitly linked to A and also infected by a common URL x –with an unreciprocated link to A that were infected by URL x prior to A –inferred by the classifier with timing restrictions
18
Visualisation For each blog A infected by URL x and for the first non-empty set, draw a link to each blog B in that set If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link Note: doesnt guarantee an upward link for each blog
19
Visualisation Further refinement incorporates via data to incorporate hidden blogs Both types of graphs are available as a web service for any users
20
Visualisation Giant Microbes Infection Tree: CNN News Story Infection Tree:
21
Discussion Incompleteness of crawl Small dataset Unknown robustness of classifiers Meme residing at multiple URLs A B C
22
Discussion Novel application of infection model to blogspace Useful visualisation tool developed Further research into influence of graph structure on spread of infection Could be useful for blog search engines
23
Conclusion Difficult objectives achieved to a limited extent Problems with dataset affect significance of work Further work required to fully determine usefulness of technique
24
Summary Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion
25
Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.