Using Search Engines and Web Crawlers in Social Science Research Mike Thelwall Head, Statistical Cybermetrics Research Group University of Wolverhampton, UK RC33 August 2004
Link Analysis in Social Science Research Use to study web phenomena E.g. NGO web site interlinking E.g. university web site interlinking Use to study offline phenomena with web aspects E.g. scholarly communication E.g. the perception of news events The web is a free, accessible massive data source for information about many aspects of life
What use is hyperlink data to qualitative researchers? Part of a mixed methodology Numbers to back up theories To obtain samples of types of Web pages for qualitative analyses Background information on how the Web is used
Quick example 1: UK university interlinking with geographic clusters indicated
Quick example 2: Asia-Pacific university interlinking. {Research with Alastair Smith, VUW, NZ}
Quick example 3: Geographic interlinking trends for UK universities.
Talk overview A social science approach for link analysis Data collection with commercial search engines Data collection and analysis with SocSciBot
A social science approach for link analysis 1: Preliminary steps 1. Formulate an appropriate research question, taking into account existing knowledge of web structure 2. Conduct a pilot study 3. Identify web pages or sites that are appropriate to address a research question 4. Collect link data from a commercial search engine or a personal crawler taking appropriate safeguards to ensure that the results obtained are accurate
A social science approach for link analysis 2: Validation 5. Partially validate the link count results through correlation tests 6. Partially validate the interpretation of the results through a link classification exercise or web author interviews
A social science approach for link analysis 3: Reporting 8. Report results with an interpretation consistent with link classification exercise include either a detailed description of the classification or exemplars to illustrate the categories 9. Report the limitations of the study and parameters used in data collection and processing
Link data from commercial search engines Commercial search engines can give information about the existence of links in the web Can be used for data collection Advanced interfaces are usually needed, or special commands
Google Can find all links to a given web page with the link: command E.g. link:
Yahoo! site-specific searches Yahoo! allows searching for links between pairs of web sites/web spaces E.g. linkdomain:db.dk +site:ac.uk returns web pages in the ac.uk domain that link to the db.dk site …ac.uk/……db.dk/…
SocSciBot Personal crawler for link research Available free at socscibot.wlv.ac.uk Crawls sets of web sites and analyses the links between them, producing: Link lists Link counts Network diagrams
Reprise: Link Analysis in Social Science Research Use to study web phenomena E.g. NGO web site interlinking E.g. university web site interlinking Use to study offline phenomena with web aspects E.g. scholarly communication E.g. the perception of news events The web is a free, accessible massive data source for information about many aspects of life But don’t forget the need for validation!