Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated hyperlink analysis.

Slides:



Advertisements
Similar presentations
Basic Searching Engineering Village. Agenda What is Engineering Village? Setting up a personal account Searching Engineering Village How to.
Advertisements

Choosing a Topic and Developing Research Questions
Social Media and Teaching Tools by Hongmei Chi
Mine Action Information Center
Advanced Searching Engineering Village.
Engineering Village ™ Basic Searching.
1 © 2010 SAGA Worldwide, LLC. All Rights Reserved.
RSS Feeds Real Simple Syndication: The New Killer App for Educators.
Social Media Intro to Business & Marketing. The most three most trusted forms of advertising are: Recommendations from people I know - 90% Consumer opinions.
Social Media Networking Sites Charlotte Jenkins Designing the Social Web
RSS 2.0: Experience with implementation in a closed Intranet Presented by Mr Ajith Balan Scientific Officer Scientific Information Resource Division Bhabha.
PEPE 23 January 2008 © Institute for research and Innovation in Social Services. This work is licensed under the Creative Commons Attribution-Non- Commercial.
Writing in Blogs: Developing Your Students’ Digital Fluency while strengthening traditional literacy skills.
Web Insights from blogs and search trends Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
WISER Humanities: Keeping up to date Kate Petherbridge and Gillian Pritchard Oxford University Library Services.
Scientific Web Intelligence The Birth of a New Research Field Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
Link analysis as a social science technique Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.
Analysing Public Science Debates through Blogs and Online News Sources Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Social Media Motion: How to Get Started & Keep Going With Facebook, Twitter & More Presented by Eli Lilly and Company Hosted by Rob Robinson McNeely Pigott.
Using Search Engines and Web Crawlers in Social Science Research Mike Thelwall Head, Statistical Cybermetrics Research Group University of Wolverhampton,
An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.
- Hyperlink Analysis - Merton & Garfield vs. Malinowski & MacRoberts Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Blog searching and Web 2.0 Technologies: New Insights into Customers/Citizens/Voters? Mike Thelwall Statistical Cybermetrics Research Group Web Impact.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
RSS is an acronym for Really Simple Syndication or Rich Site Summary. RSS (noun) - an XML format for distributing news headlines on the Web.
Students and faculty need to keep up-to-date with their area of research. Subject specific blogs, RSS feeds, and listservs make it easier than ever to.
Defining Blogs & RSS Feeds. What is a blog?  A web log  Definition by Darlene Fichter….a blog is a “web page containing brief entries arranged chronologically.”
Bloglines: LISD Brown Bag Webinar, February 23, 2010.
Adriana Iordan Web Marketing Manager / Avangate Social Networking Media How the software authors should use it?
Creating Online Class Communities Jennifer Dorman Discovery Education
RSS (Really Simple Syndication) Feed Created Revised 6/9/ Office of Information, Technology and Accountability.
Consider ways to use social software in your professional learning and school.
Planned Giving Design Center. What is the Planned Giving Design Center? National network of websites dedicated to advancing philanthropy.
Using sources in your Advanced Higher Investigation.
LILAC 2006 The place for weblogs and RSS newsfeeds in information skills instruction. Kara Jones Subject Librarian Mathematics, Computer Sciences, Biology.
Maximizing Online Information Retrieval: How Theological Librarians Can Best Access the Gnostic Areas of the Internet Libby Peterek, M.S.Info.St. Division.
Do You Have a Web Site?. Everyone does, don’t they?
Citizens, Pundits & Scholars: In Defense of Blogs Kalina Grewal Mark Robertson Scott Library York University.
Web 2.0: Concepts and Applications 2 Publishing Online.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
SEARCH ENGINES Jaime Ma, Vancy Truong & Victoria Fry.
Searching the “New” Web: Blogs & RSS ORALL Annual Meeting October 13, 2005 Presented by Bonnie Shucha UW Law Library
WIRESCRIPT1 WIRESCRIPT Web Interactive REview of Scientific Culture, Research, Innovation Policy and Technology.
RSS Basics and Beyond RSS Basics and Beyond Tips and Tricks for Getting the Most out of Syndicated Content.
Module 3 News Engine, Blogs, Wikis, and RSS feeds Instructional Technology.
Strategies for Conducting Research on the Internet Angela Carritt User Coordinator, Oxford University Library Services Angela Carritt User Education Coordinator,
WISER: Keeping up to date Kate Petherbridge & Judy Reading.
1 Proposal Presentation On Search Engine Optimization.
DUNN & WILSON PROJECT Tales from outside the Square.
Extracting Information from the Links in Academic Webs Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK An overview.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
Blogs and RSS Siobhan Champ- Blackwell. Definitions  Blog – Web Log; an online journal; A web page with periodic posts in reverse chronologic order 
Program Assessment User Session Experts (PAUSE) Information Sessions: RSS & Subscription Services October , 2006.
Introduction to RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites.
Kendra Hunter & Charde Johnson EDUC Dr. M. Kariuki.
Using RSS Readers in Education: The Google Reader.
CREATE, IMPLEMENT AND ENJOY! Blogs,Wikis & RSS Readers.
[xxxx] SEO Online Marketing for Business Catalyst Websites
1 RESEARCHING USING ONLINE SOURCES _____________________________ A Guide to Searching for and Evaluating Web Pages on the Internet.
Searching the Web for academic information Ruth Stubbings.
What Every Chamber Executive Needs to Know About Blogging, Podcasts and Wikis C. David Gammel High Context Consulting (410)
WISER Humanities: Keeping up to date
Internet Basics and Information Literacy
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated hyperlink analysis

Part 1 Automated Blog Analysis Analysing Public Science Debates through Blogs and Online News Sources

Part 1 Contents Background Blogs Online news sources RSS Tracking public science debates Detecting public science debates

Background Blogs, public opinion, online news, RSS

Background There are millions of bloggers Bloggers are almost normal human beings Automatically tracking bloggers’ postings may give insights into public opinion

Blog tracking companies IBM WebFountain Intelliseek BlogPulse “Monitor, measure and leverage consumer- generated media” Others growing…

RSS Format Rich Site Syndication/Really Simple Syndication XML technology Used for frequently updated information sources (blogs, news, academic journals) RSS Readers Users subscribe to the RSS feeds of favourite blogs/sites/journals/searches Notified when updates available User-controlled ‘push’ technology

Tracking Public Science Debates

Blog keyword searches Technorati “Searches weblogs by keyword and for links” Stem cell research Blogdigger stem cell research IceRocket Allows Advanced searchesAdvanced search Allows genuine date range search (Google only allows “last updated” date range searches)

Track evolution over time What is changing about interest in Stem cell research/GM food? Are experts good at identifying changes in public interest? How can experts be sure/can they be supported with quantitative information? Can blogs be used to generate time series reflecting changes in “public interest”?

Free science debate graphs Solves the trend identification problem? Blogpulse Offers free automatic blog searches and keyword-generated click- search graphs Stem cell research Stem cell GM food GM Mobile phone radiation

Research graphs Time-consuming to collect data Give control over the data source

Detecting Public Science Debates

How to detect a new debate? Heuristic methods E.g. Read papers, scan relevant blogs Automatic methods E.g. look for sudden increase in usage of science-related words in blogs?

Free hot topic searches Blog keyword search (sort by date) Technorati “Searches weblogs by keyword and for links”  Stem cell research Stem cell research Blogdigger blog searchblog search Hot topic searches Blogdex – top contagious informationtop contagious information Bloglines – today’s hot topics (most popular links)today’s hot topics Searches find the really big science debates?

Specialist research tools Commercial software Intelliseek/IBM Mozdeh RSS monitor Generates sub-collections Generates word time series Allows keyword searches Identifies hot topics

Mozdeh Science Concern Corpus A collection of blog postings containing a fear word AND a science word Trend detection used to identify hot “science fear” topics Data cleaning to remove spam Need manual scanning of list of words experiencing biggest usage increase

Classification of top 5 words WordMax. daily increase (feeds) Classification stem19%Science fear (stem cell research) orlean16%Information (about hurricane) hurricane16%Duplicate of ‘orlean’ katrina15%Duplicate of ‘orlean’ june14%Temporal descriptor

Classification of top 200 words 7.5% of top 200 Words Represent new public fears of Science stories E.g. new medical cure The words come from multiple stories

Unexpected results? Social science research Sudden burst of discussion over fears of the economic theories of Karl Rove, an influential advisor to George Bush Computer security Concern over spyware features in a software vendor’s products Research showing that consumers’ pin numbers could be revealed by poor printing

Conclusions Many free tools support exploration of Consumer Generated Media Also room for specialist research tools

References workshop/ Thelwall, M., Prabowo, R. & Fairclough, R. (2006, to appear). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. Journal of the American Society for Information Science and Technology.Are raw RSS feeds suitable for broad issue scanning

Acknowledgement The work was supported by a European Union grant for activity code NEST Path-1. It is part of the CREEN project (Critical Events in Evolving Networks, contract ,

Part 2: Automated hyperlink analysis Link analysis as a social science technique

Link Analysis Manifesto Links are: A wonderful new source of information about relationships between people, organisations and information An easy to collect data source But: Results should be interpreted with care

Part 2 Contents Academic link analysis –mainly from an information science perspective A general social science link analysis methodology Commercial applications

Why Count Links? Individual hyperlinks may reflect connections between web page contents or creators Counts of large numbers of hyperlinks may reflect wider underlying social processes Links may reflect phenomena that have previously been difficult to study E.g. informal scholarly communication

Why Count University Links? To map patterns of communication between researchers in a country Which universities collaborate a lot? Which universities collaborate with government or industry? Which universities are using the web effectively?

Counting links Search engines will count them for you! Yahoo! advanced queries, e.g. Links from Wolves Uni. to Oxford Uni. Or back domain:ox.ac.uk AND linkdomain:wlv.ac.uk domain:wlv.ac.uk AND linkdomain:ox.ac.uk Google link queries Find links to specific URLs, e.g. links to the University home page link:

Counting links Can use a special purpose web crawler or robot Visits all the pages in a web site Counts the links in the site Can use “advanced” counting methods

Some Inter-University Hyperlink Patterns Mainly for the UK and Europe

Links to UK universities against their research productivity The reason for the strong correlation is the quantity of Web publication, not its quality This is different to citation analysis

Most links are only loosely related to research 90% of links between UK university sites have some connection with scholarly activity, including teaching and research But less than 1% are equivalent to citations So link counts do not measure research dissemination but are more a natural by- product of scholarly activity Cannot use link counts to assess research Can use link counts to track an aspect of communication

UK universities tend to link to their neighbours

Universities cluster geographically

Language is a factor in international interlinking English the dominant language for Web sites in the Western EU In a typical country, 50% of pages are in the national language(s) and 50% in English Non-English speaking extensively interlink in English

Patterns of international communication Counts of links between EU universities in Swedish are represented by arrow thickness.

Counts of links between EU universities in French are represented by arrow thickness.

Which language???

Which language? Who is isolated?

International link patterns The next slide is a (Kamada-Kawai) network of the interlinking of the “top” 5 universities in AEAN countries (Asia and Europe) with arrows representing at least 100 links and universities not connected removed.

The rich get richer on the web Link creation obeys the ‘rich get richer’ law  Sites which already have a lot of links attract the most new links  Some sites have a huge number of links: most have one or none

Rich get richer example: Links from Australian university pages The anomalies are also interesting

Part 3: A General Social Science Link Analysis Methodology A general framework for using link counts in social sciences research For research into link creation or Together with other sources, for research into other online or offline phenomena Applicable when there are enough links relevant to the research question to count For collections of large web sites or For large collections of small web sites

Nine stages for a research project 1. Formulate an appropriate research question, taking into account existing knowledge of web structure 2. Conduct a pilot study 3. Identify web pages or sites that are appropriate to address the research question

Nine stages for a research project 4. Collect link data from a commercial search engine or a personal crawler, taking appropriate accuracy safeguards 5. Apply data cleansing techniques to the links, if possible, and select an appropriate counting method 6. Partially validate the link count results through correlation tests, if possible

Nine stages for a research project 7. Partially validate the interpretation of the results through a link classification exercise 8. Report results with an interpretation consistent with link classification exercise, including either a detailed description of the classification or exemplars to illustrate the categories 9. Report the limitations of the study and parameters used in data collection and processing

The theoretical perspective for link counting In order to be able to reliably interpret link counts, all links should be created individually and independently, by humans, through equivalent gravity judgments (e.g., about the quality of the information in the target page). Additionally, links to a site should target pages created by the site owner or somebody else closely associated with the site.

Commercial applications Of link analysis

Commercial applications Find out who links to your web site More links mean more visitors Check if your web site is being recognised Find out who isn’t linking to your site But is linking to a competitor’s web site! Gives ideas about where to get new customers or links from Takes an hour of advanced searches Simple but very valuable!

Conclusion There is a lot of hidden information in the web: in blogs and hyperlinks

Co-authors Ray Binns, Viv Cothey, Ruth Fairclough, Gareth Harries, Xuemei Li, Peter Musgrove, Teresa Page- Kennedy, Nigel Payne, Rudy Prabowo, Liz Price, David Stuart, David Wilkinson, Alesia Zuccala University of Wolverhampton. Rong Tang, Catholic University of America. Han-Woo Park, YeungNam University, South Korea. Paul Wouters, Andrea Scharnhorst. The Virtual Knowledge Studio for the Humanities and Social Sciences, Amsterdam, The Netherlands.