Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst Information Semantics Command & Control Center July 17, 2007 Ontologies Can't Help Records Management Or Can They?
Advertisements

Taxonomy as Content Outline, Site Map and Search Aid SLA NWR Vancouver October 6, 2006 Marjorie M.K. Hlava President
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Advanced Information Systems Laboratory Department of Computer Science and Systems Engineering GI-DAYS MÜNSTER A software tool.
Chapter 5: Introduction to Information Retrieval
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
Access Innovations, Inc. Marjorie M.K. Hlava Jay Ven Eman.
Advanced Searching Engineering Village.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Implementing a Taxonomy in a Content Management Portal Content Week 2005 Miami, Florida Monday, January 31, 2005 Workshop H 2:45pm – 4:45 pm Marjorie M.K.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Engineering Village ™ ® Basic Searching On Compendex ®
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
WMES3103 : INFORMATION RETRIEVAL
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Sunday May 4 – 5 PM Bradford, Hlava, McNaughton
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Knowledge is Empowerment Tutorial Guide no. 32 SEARCHING ERIC THESAURUS & INDEX.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Search options and content tagging NKOS 2008 Aarhus Denmark September 19 Marjorie M. K. Hlava Access Innovations, Inc – Data Harmony.
Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 5 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Controlled Vocabulary Working Group Virtual Water Cooler Session April 6-7, 2009 Moderator: John Porter rm.action?confKey=jhp7e.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Indexing Jyothi Jandhyala. Disclaimer! Indexing cannot be reduced to a set of steps that can be followed! It is not a mechanical process. Indexing books.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Evolution of a production pipeline Marjorie M.K. Hlava President Access Innovations.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
Welcome to Stanah School
Semantic (web) activity at Elsevier Marc Krellenstein VP, Search and Discovery Elsevier October 27, 2004
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Information Organization
Taxonomies, Lexicons and Organizing Knowledge
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Social Knowledge Mining
Information Retrieval
Part Three SOURCES AND COLLECTION OF DATA
IL Step 3: Using Bibliographic Databases
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Text Mining & Natural Language Processing
Introduction to Search Engines
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.

Industry in change Technology changes Evolving standards Mergers New buzzwords Hard to tell what is real

Popular Misconceptions Computers can do it all No need to index No need for thesauri or subject headings Full text gives all we need Automatic full text User friendly search engines Search engines are indexes User profiles provide the right context Data filters give right answers

Some of it is true What can we use? Automatic - semi - classification Depends….. Size of collection Cost of the effort

What’s in?? Taxonomies –thesauri –hierarchies - classification –categorization –browsing Wellformedness Bricks and mortar, i.e., profit

Options for Access/Control Keep track of the input –Thesaurus –Authority file Maximize the access –Search engine –Browse list Power of the word –McCain

What do we need? The basics... Authority file –People, places, things Taxonomy –Thesaurus* with authority file or document instance “Automatic” Classification

Thesaurus Construction Parts of a whole Noun and noun phrases People, places, things Actions and reactions Concepts and processes

Term Records - Thesaurus - format Main Entries Top Terms - TT Broader Terms - BT Narrower Terms - NT Scope Notes - SN History - HI Date Term - added/changed - DA

Thesaurus - Format Related Terms - RT See - S See Also - SA Use - U Use For - UF “Wellformedness” = W3C

What are the parts? Natural Language Processing Term forms Term Relationships Term Associations

Natural Language Processing Morphological Lexical Analysis Syntactic Numerical Phraseological Semantic Analysis Pragmatic

Seven Major Parts of NLP 1. Morphological – plural – past tense to present

Seven Major Parts of NLP 2. Lexical Analysis – part of speech tagging 3. Syntactic analysis – non phrase id –proper name boundary

Seven Major Parts of NLP 4. Numeric concept boundary 5. Semantic analysis –Proper name concept categorization –Numeric concept categorization –Semantic relation extraction 6. Phraseological - discourse analysis –Text structure identification

Seven Major Parts of NLP 7. Pragmatic analysis –Cause and effect relationships –Nurse and nursing –Common sense reasoning (buy  possess) –Who has x ? –These are the people who brought you.....

Say it another way Term standardization Term forms Term relationships Term associations Rule building / domain creation

Word Standardization Split out chemical & drug terms – Separates chemical & drug terms for special treatment Split out homonyms, non-English terms, and authority terms – Separates objects, proper names, place names, and dates for special treatment Run spelling standardization program – Identifies variant spellings

Word Standardization Run word standardization program – ie, ing, -ed, -s, es, pre-, non-, and “-” Match preferred terms and synonyms

Term Forms Noun Adjective Verb, adverb Singular, plural Initial articles Spelling variants

Term Forms Punctuation Capitalization Abbreviations

Term Relationships Generic Hierarchical Systematic Alphabetic Instance Poly-hierarchical

Term Associations Cross references All and some rule Associative terms Related terms

“Rule building”* process Put terms in context Group like categories Consider relationships Standardize variants Meld to a single concept rule How much is really automatic???

Domains Taxonomy Term Record - thesaurus Hierarchical Browse-able list Handout in Booth 150

What else can we have? Proximity Stemming (lemmatization) Truncation Statistical clustering Bayesian and others

Other terms and tools Neural networks Word normalization Lexical (word) networks Distance mapping Pattern recognition

Moving toward the search engines Term weighting Frequency counts Relevance Precision Recall

Classification of Evolving model… Noun Extractors Rule Based Systems Semantic Processors Fuzzy Search Systems Filtering Systems “Automatic Classification Systems”

(Semi) Automatic Indexing Basic theories Thesaurus construction Natural language processing Domain specific

Noun extractors Noun Extractors Use stop word list and frequency counts –Semio –Word Perfect 5.0 –Recon Prebuilt domains –Autonomy –Net Owl –Newsindexer

Rules Based Systems Rule Based –Data Harmony –API –DTIC –Mapit

Semantic Processors Synth Bank n-Stein - expected Quiver - beta

Fuzzy Search Systems Dr. Link Sovereign Hill

Filtering Systems Screaming Media Data Harmony

New Directions Topic Maps - TAO –Topic –Associations –Occurrences Relational Indexing Index Visualization Based on term records Add the search engines….

What’s a user to do? Enjoy the presentation What about a database producer? –Look the options, –Build from the basics –Evaluate the new tools –See it work before you buy

Give me your card I will the presentation tonight

Thank You Marjorie M.K. Hlava President, Access Innovations, Inc. Chairman, Data Harmony Booth 150