Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
1 Searching the Web Representation and Management of Data on the Internet.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text mining.
The identification of interesting web sites Presented by Xiaoshu Cai.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Web- and Multimedia-based Information Systems Lecture 2.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Why indexing? For efficient searching of a document
CSCE 590 Web Scraping – Information Extraction II
Information Retrieval and Web Search
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton

Contents Introduction to Scientific Web Intelligence Introduction to the Vector Space Model Vocabulary Spectral Analysis Low frequency words

Part 1 Scientific Web Intelligence

Applying web mining and web intelligence techniques to collections of academic/scientific web sites Uses links and text Objective: to identify patterns and visualize relationships between web sites and subsites Objective: to report to users causal information about relationships and patterns

Academic Web Mining Step 1: Cluster domains by subject content, using text and links Step 2: Identify patterns and create visualizations for relationships Step 3: Incorporate user feedback and reason reporting into visualization This presentation deals with Step 1, deriving subject-based clusters of academic webs from text analysis

Part 2 Introduction to the Vector Space Model

Overview The Vector Space Model (VSM) is a way of representing documents through the words that they contain It is a standard technique in Information Retrieval The VSM allows decisions to be made about which documents are similar to each other and to keyword queries

How it works: Overview Each document is broken down into a word frequency table The tables are called vectors and can be stored as arrays A vocabulary is built from all the words in all documents in the system Each document is represented as a vector based against the vocabulary

Example Document A –“A dog and a cat.” Document B –“A frog.” adogandcat 2111 afrog 11

Example, continued The vocabulary contains all words used –a, dog, and, cat, frog The vocabulary needs to be sorted –a, and, cat, dog, frog

Example, continued Document A: “A dog and a cat.” –Vector: (2,1,1,1,0) Document B: “A frog.” –Vector: (1,0,0,0,1) aandcatdogfrog aandcatdogfrog 10001

Measuring inter-document similarity For two vectors d and d’ the cosine similarity between d and d’ is given by: Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together The cosine measure calculates the angle between the vectors in a high-dimensional virtual space

Stopword lists Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing –E.g. “in”, “a”, “the”

Normalised term frequency (tf) A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document This is known as the tf factor. Document A: raw frequency vector: (2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)

Inverse document frequency (idf) A calculation designed to make rare words more important than common words The idf of word i is given by Where N is the total number of documents and n i is the number that contain word i

tf-idf The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word Words are important for a document if they are frequent relative to other words in the document and rare in other documents

Part 3 Vocabulary Spectral Analysis

Subject-clustering academic webs through text similarity 1 1. Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university. –Doc. 1 = cs.auckland.ac.uk 14,521 pgs –Doc. 2 = 3,463 pgs –… –Doc. 760 = 4,125 pgs

Subject-clustering academic webs through text similarity 2 2. Convert each virtual document into a tf-idf word vector 3. Identify clusters using k-means and VSM cosine measures 4. Rank words for importance in each ‘natural’ cluster Cluster Membership Indicator 5. Manually filter out high-ranking words in undesired clusters Destroys the natural clustering of the data to uncover weaker subject clustering

Cluster Membership Indicator For a cluster C of documents and tdf-idf weights w ij The next slide shows the top CMI weights for an undesired non-subject cluster

WordFrequencyDomainsCMI massey palmerston and the of in north students research a

Eliminating low frequency words Can test whether removing low frequency words increases or decreases subject clustering tendency –E.g. are spelling mistakes? Need partially correct subject clusters Compare similarity of documents within cluster to similarity with documents outside cluster

Eliminating low frequency words

Summary For text based academic subject web site clustering: –need to select vocabularies to break natural clustering and allow subject clustering –consider ignoring low frequency words because they do not have high clustering power –Need to automate the manual element as far as possible The results can then form the basis of a visualization that can give feedback to the user on inter-subject connections