Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.

Slides:



Advertisements
Similar presentations
An Ontology Creation Methodology: A Phased Approach
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Word Bi-grams and PoS Tags
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
LING 388: Language and Computers Sandiway Fong Lecture 2.
Statistical NLP: Lecture 3
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Information and Business Work
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Vocabulary Markup Language (Voc-ML) Project Joseph A. Busch Content Intelligence Evangelist Interwoven.
1/13 Parsing III Probabilistic Parsing and Conclusions.
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Phonetics, Phonology, Morphology and Syntax
ELN – Natural Language Processing Giuseppe Attardi
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Ling 570 Day 17: Named Entity Recognition Chunking.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Linguistic Essentials
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
CSA2050 Introduction to Computational Linguistics Parsing I.
Text Analytics Workshop Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
MedKAT Medical Knowledge Analysis Tool December 2009.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
LREC Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis Semi-Automatic Domain Ontology Creation.
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
Chapter 7 K NOWLEDGE R EPRESENTATION, O NTOLOGICAL E NGINEERING, AND T OPIC M APS L EO O BRST AND H OWARD L IU.
March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Question Classification Ling573 NLP Systems and Applications April 25, 2013.
Social Knowledge Mining
Improving an Open Source Question Answering System
Dept. of Computer Science University of Liverpool
By Hossein Hematialam and Wlodek Zadrozny Presented by
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International

Agenda Automatic Construction from Structured Documents Automatic Construction from Unstructured Documents

Contextual Harvesting Markup can provide clues about the information within a document Largely dependent on semantic markup Takes advantage of nesting within elements Rules can be developed for harvesting data to build topic map constructs –Rules could then be applied to similar types of documents

DTD/Schema Development to Support Harvesting DTDs like HTML are mostly useless to a harvesting system Flat structures make associations between elements more difficult New DTD/schema development should take possible knowledge harvesting into account

Content-Based Harvesting Combination of contextual and natural language harvesting Text is parsed and clues within the text are used to harvest knowledge. HTML documents where labels are included in the text could be processed this way

NLP Strategies Named Entity recognition –A list of entities (people, companies, places, etc.) is defined –Programs parse a corpus of information to identify entities –Limited to the completeness of the entity list

NLP Strategies – cont. Concept extraction –A list of key words can be defined much like the named entity strategy –Common strings may also be identified and suggested as new concepts –More processing intensive than named entity

NLP Strategies – cont. Taxonomic classification –Documents are analysed and classified according to a human-defined taxonomy –Specialized programs must be developed that are able to understand the taxonomy Must also be able to process synonyms and related concepts

NLP Strategies – cont. Discourse analysis –Programs are developed that attempt to understand the meaning of a text –Analyze the parts of speech using a lexicons and rules to attempt to derive the meanings and usages of words

Steps in Processing Natural Language Tokenization Part of Speech Tagging Bracketing Identification of useful structures

Tokenization GOAL: Prepare text for processing by a natural language processing system Flowing paragraph text is broken into sentence units Sentence units are broken into word tokens Word tokens are prepared for part of speech processing –Contractions and other constructs receive special processing –n’t, ’s

Part of Speech (POS) Tagging GOAL: Identify the part(s) of speech for each word token Lexicon – list of words with the possible POS tags for each word EXAMPLE: –sound - NNP JJ NN VB The Sound of Music, Puget Sound a sound decision the sound of silence sound the alarm

POS Tagging – cont. POS tagging is difficult Exception processing often required for phrases and grouped words EXAMPLE: –Time flies like an arrow. –Fruit flies like an apple.

Bracketing GOAL: Identify groupings of words into phrases and the hierarchical relationship of phrases to one another A set of rules is used to identify how different parts of speech and phrases can be combined to form larger phrases. EXAMPLE: –[NP, DT, JJ, NN] –Noun phrase can consist of a determiner followed by adjective followed by a noun

Benefits of the phased approach The separation of functions allows this approach to be applied to any language. –A lexicon is developed for the language –Rules for language construction are defined –Generic engine is able to process the data The separation of the lexicon and the rules base allows the model to be modified/improved as the corpus of text grows.

Putting it all together EXAMPLE: The red ball rolled down the hill. Tokenization and POS tagging –The DT –red JJ NNP –ball NN –rolled VBD VBN –down RB IN RBR VBP JJ NN RP –hill NN

Putting it all together – cont. EXAMPLE: The red ball rolled down the hill. Bracketing rules –[S, NP, VP] –[NP, DT, JJ, NN] –[VP, VBD, PP] –[PP, IN, NP] –[NP, DT, NN] RESULT (using XML for bracketing): The red ball rolled down the hill

Harvesting Considerations GIGO rule in effect The harvesting process only hastens topic map construction Only some of the topic map merging rules are applicable –Limited prospect of meaningful subject identities Humans must still participate in the process of knowledge organization in order to maintain quality –Selective inclusion in the topic map/knowledge base

Q & A Questions or comments welcome at: ISOGEN International 1611 W. County Road B, Suite 204 St. Paul, MN USA Voice: Fax:

Demonstrations SemanText – open source Ontopia Knowledge Suite – commercial

Harvesting Knowledge using SemanText  Contextual Harvesting Natural Language-Based Harvesting

Contextual Harvesting in SemanText  XML markup is used as in NLP to identify document structures Users write rules for harvesting information into topic maps structures

Harvesting XML Content Data can be harvested from any XML document (including RDF) into a topic map without the need to develop specialized programs Specification language is based on Xpath In future, will also record the location from which the data was harvested as an occurrence

Harvesting RDF Content RDF structures can be harvested in order to build topic maps Demonstrates the possibility of interoperability between the two models Rules can be established for flavors of RDF (e.g. Dublin Core, DAML/OIL) –Allows any document using the tagging scheme to be harvested Only binary associations can be generated

NLP in SemanText  The initial lexicon included within SemanText  is based on a lexicon derived from the Penn Treebank tagging of the Brown corpus (1 million+ words) and a very large sample from the Wall Street Journal (approx. 3 million words) SemanText  provides the ability to identify and add new words to the lexicon through a GUI.

NLP in SemanText  – cont. SemanText  uses a public domain parser to tokenize flowing text and identify the appropriate POS for each word token based on a set of bracketing rules. XML is used to denote the bracketing This XML markup allows SemanText  to process natural language using its contextual-based harvesting capability