Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Presented by Zeehasham Rasheed
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Foundations This chapter lays down the fundamental ideas and choices on which our approach is based. First, it identifies the needs of architects in the.
Overview of Search Engines
Ontology-based Information Extraction for Business Intelligence
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Chapter 1 Introduction to Data Mining
Information Retrieval and Knowledge Organisation Knut Hinkelmann.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Flexible Text Mining using Interactive Information Extraction David Milward
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
1 Language Technologies (1) Diana Maynard University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Topic Maps introduction Peter-Paul Kruijsen CTO, Morpheus software ISOC seminar, april 5 th 2005.
1 Information Retrieval LECTURE 1 : Introduction.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Data mining in web applications
Information Retrieval in Practice
Information Organization: Overview
Search Engine Architecture
Introduction to Web Science
Introduction to Information Retrieval
Information Retrieval and Web Design
Information Organization: Overview
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield

University of Manchester – 15 March Structure of this lecture Text Mining and the Semantic Web Text Mining Components / Methods Information Extraction Evaluation Visualisation Summary

Introduction to Text Mining and the Semantic Web

University of Manchester – 15 March What is Text Mining? Text mining is about knowledge discovery from large collections of unstructured text. It’s not the same as data mining, which is more about discovering patterns in structured data stored in databases. Similar techniques are sometimes used, however text mining has many additional constraints caused by the unstructured nature of the text and the use of natural language. Information extraction (IE) is a major component of text mining. IE is about extracting facts and structured information from unstructured text.

University of Manchester – 15 March Challenge of the Semantic Web The Semantic Web requires machine processable, repurposable data to complement hypertext Such metadata can be divided into two types of information: explicit and implicit. IE is mainly concerned with implicit (semantic) metadata. More on this later…

Text mining components and methods

University of Manchester – 15 March Text mining stages Document selection and filtering (IR techniques) Document pre-processing (NLP techniques) Document processing (NLP / ML / statistical techniques)

University of Manchester – 15 March Stages of document processing Document selection involves identification and retrieval of potentially relevant documents from a large set (e.g. the web) in order to reduce the search space. Standard or semantically- enhanced IR techniques can be used for this. Document pre-processing involves cleaning and preparing the documents, e.g. removal of extraneous information, error correction, spelling normalisation, tokenisation, POS tagging, etc. Document processing consists mainly of information extraction For the Semantic Web, this is realised in terms of metadata extraction

University of Manchester – 15 March Metadata extraction Metadata extraction consists of two types: Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.) Implicit metadata extraction involves semantic information deduced from the material itself, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.

Information Extraction (IE)

University of Manchester – 15 March IE is not IR IE pulls facts and structured information from the content of large text collections. You analyse the facts. IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents.

University of Manchester – 15 March IE for Document Access With traditional query engines, getting the facts can be hard and slow Where has the Queen visited in the last year? Which places on the East Coast of the US have had cases of West Nile Virus? Which search terms would you use to get this kind of information? How can you specify you want someone’s home page? IE returns information in a structured way IR returns documents containing the relevant information somewhere (if you’re lucky)

University of Manchester – 15 March IE as an alternative to IR IE returns knowledge at a much deeper level than traditional IR Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text

University of Manchester – 15 March Some example applications HaSIE KIM Threat Trackers

University of Manchester – 15 March HaSIE Application developed by University of Sheffield, which aims to find out how companies report about health and safety information Answers questions such as: “How many members of staff died or had accidents in the last year?” “Is there anyone responsible for health and safety?” “What measures have been put in place to improve health and safety in the workplace?”

University of Manchester – 15 March HASIE Identification of such information is too time-consuming and arduous to be done manually IR systems can’t cope with this because they return whole documents, which could be hundreds of pages System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information

University of Manchester – 15 March HASIE

University of Manchester – 15 March KIM KIM is a software platform developed by Ontotext for semantic annotation of text. KIM performs automatic ontology population and semantic annotation for Semantic Web and KM applications Indexing and retrieval (an IE-enhanced search technology) Query and exploration of formal knowledge

University of Manchester – 15 March KIM Ontotext’s KIM query and results

University of Manchester – 15 March Threat tracker Application developed by Alias-I which finds and relates information in documents Intended for use by Information Analysts who use unstructured news feeds and standing collections as sources Used by DARPA for tracking possible information about terrorists etc. Identification of entities, aliases, relations etc. enables you to build up chains of related people and things

University of Manchester – 15 March Threat tracker

University of Manchester – 15 March What is Named Entity Recognition? Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate

University of Manchester – 15 March Why is NE important? NE provides a foundation from which to build more complex IE systems Relations between NEs can provide tracking, ontological information and scenario building Tracking (co-reference) “Dr Head, John, he” Ontologies “Manchester, CT” Scenario “Dr Head became the new director of Shiny Rockets Corp”

University of Manchester – 15 March Two kinds of approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus

University of Manchester – 15 March Typical NE pipeline Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging) Entity finding (gazeteer lookup, NE grammars) Coreference (alias finding, orthographic coreference etc.) Export to database / XML

University of Manchester – 15 March GATE and ANNIE GATE (Generalised Architecture for Text Engineering) is a framework for language processing ANNIE (A Nearly New Information Extraction system) is a suite of language processing tools, which provides NE recognition GATE also includes: plugins for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages etc. tools for visualising and manipulating ontologies ontology-based information extraction tools evaluation and benchmarking tools

University of Manchester – 15 March GATE

University of Manchester – 15 March Information Extraction for the Semantic Web Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc. For the Semantic Web, we need information in a hierarchical structure Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology

University of Manchester – 15 March Richer NE Tagging Attachment of instances in the text to concepts in the domain ontology Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK

University of Manchester – 15 March Magpie Developed by the Open University Plugin for standard web browser Automatically associates an ontology-based semantic layer to web resources, allowing relevant services to be linked Provides means for a structured and informed exploration of the web resources e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc.

University of Manchester – 15 March MAGPIE in action

University of Manchester – 15 March MAGPIE in action

Evaluation

University of Manchester – 15 March Evaluation metrics and tools Evaluation metrics mathematically define how to measure the system’s performance against human-annotated gold standard Scoring program implements the metric and provides performance measures –for each document and over the entire corpus –for each type of NE –may also evaluate changes over time A gold standard reference set also needs to be provided – this may be time-consuming to produce Visualisation tools show the results graphically and enable easy comparison

University of Manchester – 15 March Methods of evaluation Traditional IE is evaluated in terms of Precision and Recall Precision - how accurate were the answers the system produced? correct answers/answers produced Recall - how good was the system at finding everything it should have found? correct answers/total possible correct answers There is usually a tradeoff between precision and recall, so a weighted average of the two (F- measure) is generally also used.

University of Manchester – 15 March GATE AnnotationDiff Tool

University of Manchester – 15 March Metrics for Richer IE Precision and Recall are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong Similarity metrics need to be integrated additionally, such that items closer together in the hierarchy are given a higher score, if wrong Also possible is a cost-based approach, where different weights can be given to each concept in the hierarchy, and to different types of error, and combined to form a single score

Visualisation of Results

University of Manchester – 15 March Visualisation of Results Cluster Map example Traditionally used to show documents classified according to topic Here shows instances classified according to concept Enables analysis, comparison and querying of results Examples here created by Marta Sabou (Free University of Amsterdam) using Aduna software

University of Manchester – 15 March The principle – Venn Diagrams Documents classified according to topic

University of Manchester – 15 March Jobs by region Instances classified by concept

University of Manchester – 15 March Concept distribution Shows the relative importance of different concepts

University of Manchester – 15 March Correct and incorrect instances attached to concepts

University of Manchester – 15 March Summary Introduction to text mining and the semantic web How traditional information extraction techniques, including visualisation and evaluation, can be extended to deal with complexity of the Semantic Web How text mining can help the progression of the Semantic Web

University of Manchester – 15 March Research questions Automatic annotation tools are currently mainly domain and ontology-dependent, and work best on a small scale Tools designed for large scale applications lose out on accuracy Ontology population works best when the ontology already exists, but how do we ensure accurate ontology generation? Need large scale evaluation programs

University of Manchester – 15 March Some useful links NaCTem (National centre for text mining) GATE KIM h-TechSight Magpie