Introduction to Web Science

Introduction to Web Science
Information Extraction for the SW

Six challenges of the Knowledge Life Cycle
Acquire Model Reuse Retrieve Publish Maintain

What is Text Mining? Text mining is about knowledge discovery from large collections of unstructured text. It’s not the same as data mining, which is more about discovering patterns in structured data stored in databases. Information extraction (IE) is a major component of text mining. IE is about extracting facts and structured information from unstructured text.

IE is not IR IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents. IE pulls facts and structured information from the content of large text collections. You analyse the facts.

Challenge of the Web Science
The Web requires machine processable, repurposable data to complement hypertext Such metadata can be divided into two types of information: explicit and implicit. IE is mainly concerned with implicit (semantic) metadata. More on this later…

IE by example (1) the seminar at 4 pm will ... How can we learn a rule to extract the seminar time?

IE by example (2)

IE by example (3)

Adaptive Information Extraction
IE Systems capable of extracting information AIE Same as IE But considers important the usability and accessibility of a system Makes it easy to transport it to new domains Exploits machine learning

What is adaptable? New domain information
Based upon an ontology which can change Different sub-language features POS, Noun chunks, etc Different text genres Free text, structured, semi-structured, etc Different types Text, String, Date, Name, etc

Shallow Vrs Deep Approaches
Shallow approach Uses syntax primarily Tokenisation, POS, etc. Deep approach Uses syntactic information Uses semantics (Named entity, etc) Heuristics (World rules, Brother is male) Additional knowledge

Single Vrs Multi Slot Single Multi Slot Extract one element at a time
The seminar is at 4pm. Multi Slot Extract several concepts simultaneously Tom is the brother of Mary. Brother(Tom, Mary)

Batch Vrs Incremental Learners
Examples are collected The system is trained on the examples Simpler Incremental Add a rule to the test set at a time Evaluate that rule More complex Must be careful about local Maxima

Interactive Vrs Non-Interactive
Use an oracle to verify and validate results An oracle can be a person or a simple program Non-Interactive Use the training provided by the users only 10 cross validation

Top-Down Vrs Bottom Up Top-Down Bottom Up
Starts from a generic rule and specialise it Bottom Up Starts from a specific rule and relax it

Top Down

Bottom Up

Generalisation task The process of generating generic rules from domain specific data

Overfitting Vrs Underfitting
When the learner does not manage to detect the full underlying model Produces excessive bias Overfitting When the learner fits the model and the noise

Text mining stages Document selection and filtering (IR techniques)
Document pre-processing (NLP techniques) Document processing (NLP / ML / statistical techniques)

Stages of document processing
Document selection involves identification and retrieval of potentially relevant documents from a large set (e.g. the web) in order to reduce the search space. Standard or semantically-enhanced IR techniques can be used for this. Document pre-processing involves cleaning and preparing the documents, e.g. removal of extraneous information, error correction, spelling normalisation, tokenisation, POS tagging, etc. Document processing consists mainly of information extraction

Metadata extraction Metadata extraction consists of two types:
Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.) Implicit metadata extraction involves semantic information deduced from the material itself, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.

IE for Document Access With traditional query engines, getting the facts can be hard and slow Where has the President visited in the last year? Which places in Europe have had cases of Bird Flu? Which search terms would you use to get this kind of information? How can you specify you want someone’s home page? IE returns information in a structured way IR returns documents containing the relevant information somewhere (if you’re lucky)

IE as an alternative to IR
IE returns knowledge at a much deeper level than traditional IR Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text

Some example applications
HaSIE KIM Threat Trackers

HaSIE Aims to find out how companies report about health and safety information Answers questions such as: “How many members of staff died or had accidents in the last year?” “Is there anyone responsible for health and safety?” “What measures have been put in place to improve health and safety in the workplace?”

HASIE Identification of such information is too time-consuming and arduous to be done manually IR systems can’t cope with this because they return whole documents, which could be hundreds of pages System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information

KIM KIM is a software platform developed by Ontotext for semantic annotation of text. KIM performs automatic ontology population and semantic annotation for Semantic Web and KM applications Indexing and retrieval (an IE-enhanced search technology) Query and exploration of formal knowledge

KIM Ontotext’s KIM query and results

Threat tracker Application developed by Alias-I which finds and relates information in documents Intended for use by Information Analysts who use unstructured news feeds and standing collections as sources Used by DARPA for tracking possible information about terrorists etc. Identification of entities, aliases, relations etc. enables you to build up chains of related people and things

Threat tracker

What is Named Entity Recognition?
Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate

Why is NE important? NE provides a foundation from which to build more complex IE systems Relations between NEs can provide tracking, ontological information and scenario building Tracking (co-reference) “Dr Head, John, he”

Two kinds of approaches
Knowledge Engineering rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus

Typical NE pipeline Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging) Entity finding (gazeteer lookup, NE grammars) Coreference (alias finding, orthographic coreference etc.) Export to database / XML

GATE and ANNIE GATE (Generalised Architecture for Text Engineering) is a framework for language processing ANNIE (A Nearly New Information Extraction system) is a suite of language processing tools, which provides NE recognition GATE also includes: plugins for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages etc. tools for visualising and manipulating ontologies ontology-based information extraction tools evaluation and benchmarking tools

Information Extraction for the Semantic Web
Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc. For the Semantic Web, we need information in a hierarchical structure Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology

Richer NE Tagging Attachment of instances in the text to concepts in the domain ontology Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK

Magpie Developed by the Open University
Plugin for standard web browser Automatically associates an ontology-based semantic layer to web resources, allowing relevant services to be linked Provides means for a structured and informed exploration of the web resources e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc.

Magpie in Action (1)

Evaluation metrics and tools
Evaluation metrics mathematically define how to measure the system’s performance against human-annotated gold standard Scoring program implements the metric and provides performance measures for each document and over the entire corpus for each type of NE may also evaluate changes over time A gold standard reference set also needs to be provided – this may be time-consuming to produce Visualisation tools show the results graphically and enable easy comparison

Methods of evaluation Traditional IE is evaluated in terms of Precision and Recall Precision - how accurate were the answers the system produced? correct answers/answers produced Recall - how good was the system at finding everything it should have found? correct answers/total possible correct answers There is usually a tradeoff between precision and recall, so a weighted average of the two (F-measure) is generally also used.

Metrics for Richer IE Precision and Recall are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong Similarity metrics need to be integrated additionally, such that items closer together in the hierarchy are given a higher score, if wrong Also possible is a cost-based approach, where different weights can be given to each concept in the hierarchy, and to different types of error, and combined to form a single score

Visualisation of Results
Cluster Map example Traditionally used to show documents classified according to topic Here shows instances classified according to concept Enables analysis, comparison and querying of results

The principle – Venn Diagrams
Documents classified according to topic

Jobs by region Instances classified by concept

Concept distribution Shows the relative importance of different concepts

Correct and incorrect instances attached to concepts

Why is IE difficult? BNC Holdings Inc named Ms G Torretta as its new chairman. Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc. Ms. Gina Torretta took the helm at BNC Holdings Inc. Hint: What are they referring to? They refer to same thing!!

Try IE yourself ... (1) Given a particular text ...
Find all the successions ... Hint there are 6 including the one below Hint we do not have complete information E.g. <SUCCESSION-1> ORGANIZATION : “New York Times” POST : "president" WHO_IS_IN : “Russell T. Lewis” WHO_IS_OUT : “Lance R. Primis”

<DOC> <DOCID> wsj93_050
<DOC> <DOCID> wsj93_ </DOCID> <DOCNO> </DOCNO> <HL> Marketing Noted.... </HL> <DD> 02/19/93 </DD> <SO> WALL STREET JOURNAL (J), PAGE B5 </SO> <CO> NYTA </CO> <IN> MEDIA (MED), PUBLISHING (PUB) </IN> <TXT> <p> New York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. </p> </TXT> </DOC>

Answer (1) <SUCCESSION-2> ORGANIZATION : "New York Times" POST : "general manager" WHO_IS_IN : "Russell T. Lewis" WHO_IS_OUT : "Lance R. Primis" <SUCCESSION-3> POST : "executive vice president" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis"

Answer (2) <SUCCESSION-4> ORGANIZATION : "New York Times" POST : "deputy general manager" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis" <SUCCESSION-5> ORGANIZATION : "New York Times Co." POST : "president" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :

Answer (3) <SUCCESSION-6> ORGANIZATION : "New York Times Co." POST : "chief operating officer" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :

Questions?

Introduction to Web Science

Similar presentations

Presentation on theme: "Introduction to Web Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Web Science

Similar presentations

Presentation on theme: "Introduction to Web Science"— Presentation transcript:

Similar presentations

About project

Feedback