Extracting Information from the Links in Academic Webs Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK An overview.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
SCIENTROMETRIC By Preeti Patil. Introduction The twentieth century may be described as the century of the development of metric science. Among the different.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Scientific Web Intelligence The Birth of a New Research Field Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
Link analysis as a social science technique Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Information Retrieval in Practice
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
Aims Correlation between ISI citation counts and either Google Scholar or Google Web/URL citation counts for articles in OA journals in eight disciplines.
Using Search Engines and Web Crawlers in Social Science Research Mike Thelwall Head, Statistical Cybermetrics Research Group University of Wolverhampton,
An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
- Hyperlink Analysis - Merton & Garfield vs. Malinowski & MacRoberts Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Patterns of International and National Web Inlinks to US University Departments Rong Tang Catholic University of America, USA Mike Thelwall University.
Information Retrieval
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
My Research, its Potential, and its Contribution to SCIT Mike Thelwall.
Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
Search Engines and Information Retrieval Chapter 1.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
Does metadata count? A Webometric investigation Alastair G Smith School of Information Management Victoria University of Wellington New Zealand
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Google Scholar as a cybermetric tool Alastair G Smith Victoria University of Wellington New Zealand
Chapter 6: Information Retrieval and Web Search
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Bibliometrics toolkit Website: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Further info: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Scopus Scopus was launched by Elsevier in.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
W orkshops in I nformation S kills and E lectronic R esources Oxford University Library Services – Information Skills Training Finding quality information.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Searching the Web for academic information Ruth Stubbings.
Presented By, Dr Manoj Kumar Verma Assistant Professor
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Informetrics, Webometrics and Web Use metrics
A Comparative Study of Link Analysis Algorithms
Scientific communication in the electronic age – Definitions
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
EERQI Innovative Indicators and Test Results
Presentation transcript:

Extracting Information from the Links in Academic Webs Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK An overview of methods and results

Contents 1. Introduction to Webometrics 2. Computer Science uses for Web links 3. Main talk: analysing university Web links 1. Data collection 2. Data processing 3. Analysis 4. Results

Part 1: Introduction to Webometrics A new area of Information Science

infor-/biblio-/sciento-/cyber-/webo-/metrics informetrics bibliometrics scientometrics webometrics cybermetrics © Lennart Björneborn

Webometrics the study of quantitative aspects of the construction and use of info. resources, structures and technologies on the Web, drawing on bibliometric and informetric methods – LB def. four main research areas of Webometric concern: Web page contents link structures (e.g., Web Impact Factors, cohesion of link topologies, etc.) search engine performance users’ information behavior (searching, browsing, encountering, etc.) cybermetrics = quantitative studies of the whole Internet i.e. chat, mailing lists, news groups, MUDs, etc. - and Web © Lennart Björneborn

Part 2: Computer Science uses for Web links Search engine page ranking, topic identification and similarity matching

PageRank Assumptions: A page with many links to it is more likely to be useful than one with few links to it The links from a page that itself is the target of many links are likely to be particularly important

Example Y X X seems to be the most important page since 2 important pages link to it

Simple voting model: round

Simple voting model: round

Simple voting model: round

Revised voting model: round Allocate 1 vote to each node after each voting round Remove votes from ‘leaf’ nodes

Revised voting model: round

Revised voting model: round The middle node only has one link to it, but this does not share its votes with other nodes

Revised voting model cycling problem 1 1 1

PageRank Use a proportion of vote, redistribute the rest If proportion is < 1 then no cycling will occur Voting can also be performed by a matrix Find votes from principle left eigenvector of matrix

PageRank: round votes in system: allocate 20% of vote, redistribute 80% of each, plus the lost votes from leaf nodes = 3.6 votes

PageRank: round x x 0.5 x 1

PageRank: round x x 0.5 x 1.1

PageRank summary The pages that get the highest PageRank are those that are linked to by many pages or by important pages Spammers try to exploit this by creating dummy sites to link to their main sites

Kleinberg’s HITS Also uses link structures, but also uses page content to identify pages that are useful for a coherent topic on the web An Authority is a page that is linked to by many other pages from the same topic A Hub is a page that links to many pages from the same topic

Hubs and authorities H A

The HITS algorithm Another iterative algorithm Each page has a hub value and an authority value Unlike PageRank, is topic specific, and potentially needs to be recomputed for each user query

Link Algorithms - Overview The success of HITS and PageRank indicates the importance of links as a new information source More needs to be known about patterns of linking But there is still no hard evidence that link approaches work – academic paper report unscientific experiments or inconclusive results

Small worlds short cuts or ‘weak ties’ between otherwise ‘distant’ web clusters (e.g., subject domains, interest communities) transversal link ’info. science’ ’creativity research’ © Lennart Björneborn

Part 3: Analysing University Link Structures Information science approaches

Why analyse university link structures? Analogies with citation studies Ensure that the Web is efficiently used for research communication Identify trends in informal scholarly communication Suggest improvements in search tools Exploratory research: the Web is important and a valid object for scientific study

Methodologies: Data collection Web crawler AltaVista advanced queries host:wlv.ac.uk AND link:albany.edu AllTheWeb advanced queries Google Does not support same level of Boolean querying

Methodologies: Data processing 1 Link counts to target universities Inter-site links only Colink counts B and C are colinked Couplings D and E are coupled BC A DE F

Methodologies: Data processing 2 Alternative Document Models E.g. count links between domains (ignoring multiple links) instead of pages P1 P2 P3 P4 P5 P6

Methodologies: Data analysis Statistical techniques for evaluating results Correlation with known research performance measures Factor analysis, Multi-Dimensional Scaling, Cluster analysis for patterns Simple graphical techniques Techniques from Communication Networks research / Geography

Results section 1 – Patterns of links between university Web sites

Results 1: Links associate with research Counts of links to universities within a country can correlate significantly with measures of research productivity

Links to UK universities counted by domain

Results 2: Links between universities in a country can be related to geography

Results 3: Universities cluster by geographic region This is clearest for Scotland but also for other groupings, including Manchester- based universities Coherent clusters are difficult to extract because of overlapping trends

A pathfinder network of UK university interlinking with geographic clusters indicated

Results section 2: Links and subject areas

Results 4: Links to departments associate with research In the US, links to chemistry and psychology departments from other departments associate with total research impact No evidence of a significant geographic trend Disciplinary differences in the extent of interlinking: history Web use is very low {Research with Rong Tang}

Results 5: Links for precision, colinks and couplings for recall For the UK academic Web, about 42% of domains connected by links alone are similar, and about 43% connected by links, colinks and couplings But over 100 times more domains are colinked or coupled than are directly linked Colinks and couplings can help the task of finding additional subject-based pages

Results 6: Most links are only loosely related to research A random sample of links between UK university sites revealed over 90% had some connection with scholarly activity, including teaching and research. Less than 1% were equivalent to citations

Results section 3: International academic links

Results 7: Linguistic factors in EU communication English the dominant language for Web sites in the Western EU In a typical country, 50% of pages are in the national language(s) and 50% in English Non-English speaking extensively interlink in English {Research with Rong Tang}

Results 8: Can map patterns of international communication Counts of links between Asia- Pacific universities are represented by arrow thickness. {Research with Alastair Smith, VUW, NZ}

Results section 4: The topology of national academic Webs

Results 9: “Power laws” in the Web Academic Webs have a topology dominated by power laws, including Counts of links to pages (inlink counts) Counts of links to pages (outlink counts) Groups of interconnected pages Directed component sizes Undirected component sizes

Results 9: “Power laws” in the Web

Results 10: Academic Web topology A mess!

The future Results of research leading into: Improved Web-related policy making Improved Web information retrieval algorithms Improved understanding of informal scholarly communication on the Web More effective use of the Web by scholars, e.g. via PhD training