1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 27 Classification 2.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 Discussion Class 12 Medical Subject Headings (MeSH) and Unified Medical Language System (UML)
WMES3103 : INFORMATION RETRIEVAL
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Thesaurus Design and Development
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Thesauruses and Cluster Analysis 2.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 16 Thesauruses and Gazetteers.
AAT Art & Architecture Thesaurus. Diffuse list of museum standards
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 23 Cluster Analysis 2 Thesaurus Construction.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Charlyn P. Salcedo Instructor Types of Indexing Languages.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
Data Mining and Text Mining. The Standard Data Mining process.
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Information Organization: Overview
CS 430: Information Discovery
Text & Web Mining 9/22/2018.
CS 430: Information Discovery
Classification and Prediction
CS 430: Information Discovery
Ying Dai Faculty of software and information science,
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Information Organization: Overview
Introduction to Search Engines
CS 430: Information Discovery
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1

2 Course Administration

3 Classification and Categorization terms documents pre-defined classes empirically-defined classes thesaurus classification text categorization document clustering

4 Text categorization Problem is to classify documents by whether they belong to a fixed set of pre-determined categories. Each document may belong to many categories. Each category is taken as a separate binary classification problem. Classification Problem is to assign each document to exactly one pre-determined category.

5 Text categorization Outline Select a subject domain. Choose a corpus of documents that cover the domain. Obtain a training set of documents that have been assigned to a set of categories. Create a vocabulary by extracting terms, normalization, precoordination of phrases, etc. Use the terms in a document as a feature set that describes the document. Scale the terms using idf or similar measure. Use machine learning methods (e.g., support vector machine) to train an automatic classifier.

6 Lexicon and Thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911)

7 Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) A. Manual A human indexer assigns standard terms and associations. computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education RT teaching From: INSPEC Thesaurus used for broader term top term related term

8 Thesaurus in Information Retrieval Use of a thesaurus in indexing (precoordination) B. Automatic Divide terms into thesaurus classes. Replace similar terms by a thesaurus class. 408dislocation409blast-cooled junctionheat-flow minority-carrierheat-transfer n-p-n p-n-p410 anneal point-contactstrain recombine transition unijunction From: Salton and McGill

9 Desirable Properties for Information Retrieval Thesaurus is specific to a subject area. Contains only terms of interest for identification within that subject area. Ambiguous terms are coded only for the senses important for that field. Target is that each thesaurus class should include terms of moderate frequency. Ideally the classes should have similar frequency.

10 Alexandria Thesaurus: Example canals A feature type category for places such as the Erie Canal. Used for: The category canals is used instead of any of the following. canal bends canalized streams ditch mouths ditches drainage canals drainage ditches... more... Broader Terms: Canals is a sub-type of hydrographic structures.

11 Alexandria Thesaurus: Example (continued) canals (continued) Related Terms: The following is a list of other categories related to canals (non- hierarchial relationships). channels locks transportation features tunnels Scope Note: Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.

12 Art and Architecture Thesaurus Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture. Almost 128,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures. Used by archives, museums, and libraries to describe items in their collections. Used as a database accessed via a Web site. Used by computer programs, for information retrieval, and natural language processing. A project of the J. Paul Getty Trust

13 Art and Architecture Thesaurus Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism. Concept: a cluster of terms, one of which is established as the preferred term, or descriptor. Categories: associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects.

14 Art and Architecture Thesaurus: Sample Record Record ID: Descriptor: rhyta Note: Refers to vessels from Ancient Greece, eastern Europe, or the Middle East that typically have a closed form with two openings, one at the top for filling and one at the base so that liquid could stream out. They are often in the shape of a horn or an animal's head, and were typically used as a drinking cup or for pouring wine into another vessel. Hierarchy: Containers [TQ]

15 Art and Architecture Thesaurus: Sample Record (continued) Terms: rhyta rhyton (alternate, singular) protomai protome rhea rheon rheons Related concepts: stirrup cups sturzbechers drinking vessels ceremonial vessels

16 Medical Subject Headings (MeSH) National Library of Medicine's controlled vocabulary thesaurus The library provides MeSH subject headings for each article in the 4,800 journals that it indexes every year and every book, etc. acquired by the library. 23,000 primary headings. Additional thesaurus of about 151,000 chemical terms. Terms are organized in a hierarchy. 135,000 cross-references. Experts who understand the field and are able to formulate queries using MeSH terms and the MeSH structures.

17 MeSH hierarchy Biological Sciences [G] Biological Sciences [G01] + Health Occupations [G02] + Environment and Public Health [G03] + Biological Phenomena, Cell Phenomena, and Immunity [G04] + Genetics [G05] + Biochemical Phenomena, Metabolism, and Nutrition [G06] + Physiological Processes [G07] + Reproductive and Urinary Physiology [G08] + Circulatory and Respiratory Physiology [G09] + Digestive, Oral, and Skin Physiology [G10] + Musculoskeletal, Neural, and Ocular Physiology [G11] + Chemical and Pharmacologic Phenomena [G12] +

18 MeSH hierarchy (continued) Physiological Processes [G07] Adaptation, Physiological [G07.062] + Aging [G07.168] + Body Constitution [G07.265] + Body Temperature [G07.315] Body Temperature Regulation [G ] + Skin Temperature [G ] Chronobiology [G07.450] + Electrophysiology [G07.453] + Fluid Shifts [G07.503] Growth and Embryonic Development [G07.553] + Homeostasis [G07.621] + Tensile Strength [G07.900] Tropism [G07.950] +

19 MeSH hierarchy (continued) MeSH HeadingBody Temperature Tree NumberE Tree NumberG Entry TermOrgan Temperature See AlsoFever See AlsoThermography See AlsoThermometers Allowable QualifiersDE GE IM PH RE Unique IDD001831

20 Automatic Thesaurus Construction Outline Select a subject domain. Choose a corpus of documents that cover the domain. Create vocabulary by extracting terms, normalization, precoordination of phrases, etc. Devise a measure of similarity between terms and thesaurus classes. Cluster terms into thesaurus classes, using a cluster method that generates compact clusters (e.g., complete linkage).

21 Normalization of vocabulary Normalization rules map variant forms into base expressions. Typical normalization rules for manual thesaurus construction are: (a)Nouns only, or nouns and noun phrases. (b)Singular nouns only. (c)Spelling (e.g., U.S.). (d)Capitalization, punctuation (e.g., hyphens), initials (e.g., IBM), abbreviations (e.g., Mr.). Usually, many possible decisions can be made, but they should be followed consistently. Which of these can be carried out automatically with reasonable accuracy?

22 Phrase construction In a precoordinated thesaurus, term classes may contain phrases. Informal definitions: pair-frequency (i, j) is the frequency that a given pair of words occur in context (e.g., in succession within a sentence) phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency cohesion (i, j) = observed pair-frequency (i, j) expected pair-frequency if i, j independent

23 Phrase construction: simple case Example: corpus of n terms p i, j is the observed frequency that a given pair of terms occur in succession. f i is the number of occurrences of term i in the corpus. There are n-1 pairs. If the terms are independent, the probability that a given pair begins with term i and ends with term j is (f i /n).(f j /n) cohesion (i, j) = n 2.p i, j (n-1)f i.f j

24 Phrase construction Salton and McGill algorithm 1. Computer pair-frequency for all terms. 2. Reject all pairs that fall below a certain frequency threshold 3. Calculate cohesion values 4. If cohesion above a threshold value, consider word pair as a phrase. There is promising research on phrase identification using methods of computational linguistics

25 Similarities The vocabulary consists of a set of elements, each of which can be a single term or a phrase. The next step is to calculate a measure of similarity between elements. One measure of similarity is the number of documents that have terms i and k in common: S(t j, t k ) =  t ij t ik where t ij = 1 if document i contains term j and 0 otherwise. i=1 n

26 Similarities: Incidence array alphabravocharliedeltaechofoxtrotgolf D D D D n

27 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot golf Using count of documents that have two terms in common

28 Similarity measures Improved similarity measures can be generated by: Using term frequency matrix instead of incidence matrix Weighting terms by frequency: cosine measure  t ij t ik |t j | |t k | dice measure  t ij t ik  t ik +  t ij i=1 n n n n S(t j, t k ) =

29 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha bravo charlie delta echo foxtrot golf Using incidence matrix and dice weighting

30 Clustering terms to form concepts The final stage is to group similar terms together into concepts. This is done by cluster analysis. Cluster analysis is the topic of the next lecture.