Introduction to Web Science

Introduction to Web Science
Harvesting the SW

Six challenges of the Knowledge Life Cycle
Acquire Model Reuse Retrieve Publish Maintain

Information Extraction vs. Retrieval
IR IE 

A couple of approaches …
Active learning to reduce annotation burden Supervised learning Adaptive IE The Melita methodology Automatic annotation of large repositories Largely unsupervised Armadillo Active learning gets the system involved in activity rather than passively learning from examples.

The Seminar Announcements Task
Created by Carnegie Mellon School of Computer Science How to retrieve Speaker Location Start Time End Time From seminar announcements received by

Seminar Announcements Example
Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.

Information Extraction Measures
How many documents out of the retrieved documents are relevant? How many retrieved documents are relevant out of all the relevant documents? Weighted harmonic mean of precision and recall

IE Measures Examples If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?

IE Measures Answers If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? Precision = 4/8 = 50% Recall = 4/10 = 40% F =(2*50*40)/(50+40) = 44.4%

Adaptive IE What is IE? What is AIE?
Automated ways of extracting unstructured or partially structured information from machine readable files What is AIE? Performs tasks of traditional IE Exploits the power of Machine Learning in order to adapt to complex domains having large amounts of domain dependent data different sub-language features different text genres Considers important the Usability and Accessibility of the system IE = Automated ways of extracting unstructured or partially structured information from machine readable files.

Amilcare Tool for adaptive IE from Web-related texts
Specifically designed for document annotation Based on (LP)2 algorithm *Linguistic Patterns by Learning Patterns Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types free texts semi-structured texts structured texts Uses Gate and Annie for preprocessing Amilcare site Learning Patterns Linguistic Patterns

CMU: detailed results Best overall accuracy
What is precission/recall/f measure ? Best overall accuracy Best result on speaker field No results below 75%

Gate General Architecture for Text Engineering Contains Tokeniser
provides a software infrastructure for researchers and developers working in NLP Contains Tokeniser Gazetteers Sentence Splitter POS Tagger Semantic Tagger (ANNIE) Co-reference Resolution Multi lingual support Protégé WEKA many more exist and can be added

needs annotation by experts
Current practice of annotation for knowledge identification and extraction needs annotation by experts is complex is time consuming Reduce burden of text annotation for Knowledge Management

Different Annotation Systems
SGML TEX Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Alembic The Gate Annotation Tool iMarkup, Yawas MnM, S-CREAM Standard Generalised Markup Language (SGML) 69 developed by IBM task was to integrate law office information systems allow editing, formatting, and information retrieval subsystems to share documents by having different meta information the ancestor of modern markup languages. TEX 70-80 one of the initial typesetting systems system is still used today with lots of enhancements like the addition of LATEX TEX reenforced the idea that layout information and content can be mixed in the same document lie at the base of modern web languages like HTML. Xanadu 88 Original Hypertext project Still alive Cosmic Book 90’s CoNote support collaborative work system share documents and notes (annotations) inserted in those documents ComMentor meta viewer users allowed to enter meta information in text JotBot applet that retrieves annotations from specialised servers presents an interface for reading and composing annotations one of the first on-the-fly annotation tools Third Voice commercial company browser plugin for annotation sort of newsgroup service where users could add comments to any page original page was not altered but annotations were inserted after Interesting issues unpopular with many Web site owners were disturbed by the idea of people posting critical, Off Topic, obscene material presented on top of their site Legal action was also discussed Another issue was privacy the annotations were centrally stored controlled by Third Voice Annotate.net Similar to Third Voice But it restricts the people who can add comments The Annotation Engine Similar to the others, annotates pages by passing them through a proxy Alembic uses several strategies to bootstrap annotation process string matching, rule languages, gazetteers, statistical analysis (Frequency counts), trains a learning algorithm it does not cater for redundant information user must retag GATE annotation tool which allows user to execute linguistic modules over the document iMarkup similar but allows mark-up in the form of text, sound, drawing etc. MnM Ontology Editor + Web Browser use an IE to help the user annotation, learning is done in phases i.e. browse, markup, learn, test and extract. S-Cream

Melita Tool for assisted automatic annotation
Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) Users: annotates document samples IE System: Trains while users annotate Generalizes over seen cases Provides preliminary annotation for new documents Performs smart ordering of documents Advantages Annotates trivial or previously seen cases Focuses slow/expensive user activity on unseen cases User mainly validates extracted information Simpler & less error prone / Speeds up corpus annotation The system learns how to improve its capabilities

Methodology: Melita Bootstrap Phase
Amilcare Learns in background User Annotates Bare Text

Methodology: Melita Checking Phase
Learning in background from missing tags, mistakes User Annotates Amilcare Annotates Bare Text

Methodology: Melita Support Phase
Amilcare Annotates Bare Text Corrections used to retrain User Corrects

Smart ordering of Documents
User Annotates Bare Text Learns annotations Tries to annotate all the documents and selects the document with partial annotations

Intrusivity An evolving system is difficult to control Goal: Method:
Avoiding unwelcome/unreliable suggestions Adapting proactivity to user’s needs Method: Allow users to tune proactivity Monitor user reactions to suggestions

Ontology defining concepts
Methodology: Melita Control Panel Ontology defining concepts Document Panel Ontology an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Amount of Texts needed for training
Results Tag Amount of Texts needed for training Prec Rec stime 20 84 63 etime 96 72 location 30 82 61 speaker 100 75 70 30 60 Precision – total correct from the number of elements retrieved Recall – total correct from all the existing elements

Future Work Research better ways of annotating concepts in documents
Optimise document ordering to maximise the discovery of new tags Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects user annotations !! (Co-training)

Annotation for the Semantic Web
Semantic Web requires document annotation Current approaches Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita) BUT: Manual/Semi-automatic annotation of Large diverse repositories Containing different and sparse information is unfeasible E.g. a Web site (So: 1,600 pages)

Redundancy Information repeated in different superficial formats
Information on the Web (or large repositories) is Redundant Information repeated in different superficial formats Databases/ontologies Structured pages (e.g. produced by databases) Largely structured pages (bibliography pages) Unstructured pages (free texts) Web is rich in Redundancy Databses / Ontologies are publicly available (Citeseer, Computer Science Bibliography, People Searches etc) – can be accessed by agents that wrap the site Structured pages constitute a big majority of the pages available. Front end to the information in databases. Largely structured pages containing lists or data with some sort of structure – can be accessed by using a smart combination of Natural Language processing and wrapping techniques Unstructured pages contain loads of information which is difficult to extract – can be accessed by Natural Language Processing techniques In synthesis the more structured information is used to bootstrap the learning from the less structured sources

The Idea Largely unsupervised annotation of documents
Based on Adaptive Information Extraction Bootstrapped using redundancy of information Method Use the structured information (easier to extract) to bootstrap learning on less structured sources (more difficult to extract) The Semantic Web requires huge amounts of annotations Annotation tools are used to support annotation The manual annotation of web sites is unfeasible (1600 pages) Information on the web is Redundant presence of multiple citations of the same facts in different formats Dispersed Knowledge found in different sources, different websites, databases, etc. Extraction and integration occurs by using redundancy by automatically obtaining examples from well defined structured sources (such as databases) seeking the most relevant source train Adaptive IE algorithms using the discovered examples use the AIE algorithm to extract further examples Eg: 1. Extract list of papers for an Author from a database such as Citeseer 2. Use a search engine with the examples just found to locate a list of papers of such author 3. Train the AIE algorithm using the examples found on the page returned by a search engine 4. Extract new papers from that page

Example: Extracting Bibliographies
Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch In order to understand better how Armadillo works I will illustrate it by taking an example from the Computer Science Department Web site scenario. Imagine we need to extract bibliographies, how would Armadillo go about doing so?

Mining Web sites (1) Looks for structured lists of names
Mines the site looking for People’s names Uses Generic patterns (NER) Citeseer for likely bigrams Looks for structured lists of names Annotates known names Trains on annotations to discover the HTML structure of the page Recovers all names and hyperlinks User would like to obtain the bibliographies from a Computer science department Gives the URL of the specific department to Armadillo Armadillo mines the site to find people’s name Uses generic patterns in order to identify possible names Verifies them with external sources like cite seer It also stores the information found in cite seet in an internal database for further use Such as papers, co-authors etc Once a list of names is obtained, google is used to find a bigger list of names which include those names already found Amilcare is trained on the already available names and used to extract further names from the page

Experimental Results II - Sheffield
People discovering who works in the department using Information Integration Total present in site 139 Using generic patterns + online repositories 35 correct, 5 wrong Precision 35 / 40 = % Recall 35 / = % F-measure % Errors A. Schriffin Eugenio Moggi Peter Gray precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results IE - Sheffield
People using Information Extraction Total present in site 139 116 correct, 8 wrong Precision 116 / 124 = % Recall 116 / 139 = % F-measure % Errors Speech and Hearing European Network Department Of Enhancements – Lists, Postprocessor precision = RET REL / RET recall = RET REL / REL (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!) Position Paper The Network To System

Experimental Results - Edinburgh
People using Information Integration Total present in site 216 Using generic patterns + online repositories 11 correct, 2 wrong Precision 11 / 13 = % Recall 11 / = 5.1 % F-measure % using Information Extraction 153 correct, 10 wrong Precision 153 / 163 = % Recall 153 / 216 = % F-measure % precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results - Aberdeen
People using Information Integration Total present in site 70 Using generic patterns + online repositories 21 correct, 1 wrong Precision 21 / 22 = % Recall 21 / 70 = % F-measure % using Information Extraction 63 correct, 2 wrong Precision 63 / 65 = % Recall 63 / 70 = % F-measure % precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Mining Web sites (2) Annotates known papers
Trains on annotations to discover the HTML structure Recovers co-authoring information Within the internal database of Armadillo, there exists a small incomplete list of papers per person This list is used to search in Google for a larger lists of papers for that particular person That page is annotated by Amilcare, Amilcare trains upon it with the few papers it already has and extracts futher papers

Experimental Results (1)
Papers discovering publications in the department using Information Integration Total present in site 320 Using generic patterns + online repositories 151 correct, 1 wrong Precision 151 / 152 = 99 % Recall 151 / 320 = 47 % F-measure % Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/ html" } precision = RET REL / RET recall = RET REL / REL F measure ??? (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results (2)
Papers using Information Extraction Total present in site 320 214 correct, 3 wrong Precision 214 / 217 = 99 % Recall 214 / 320 = 67 % F-measure % Errors Wrong boundaries in detection of paper names! Names of workshops mistaken as paper names! precision = RET REL / RET recall = RET REL / REL (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Artists domain Task Given the name of an artist, find all the paintings of that artist. Created for the ArtEquAKT project

Artists domain Evaluation
Method Precision Recall F-Measure Caravaggio II 100.0% 61% 75.8% IE 98.8% 99.4% Cezanne 27.1% 42.7% 91.0% 42.6% 58.0% Manet 29.7% 45.8% 40.6% 57.8% Monet 14.6% 25.5% 86.3% 48.5% 62.1% Raphael 59.9% 74.9% 96.5% 86.4% 91.2% Renoir 94.7% 40.0% 56.2% 96.4% 60.0% 74.0%

User Role Providing … In case … A URL List of services
Already wrapped (e.g. Google is in default library) Train wrappers using examples Examples of fillers (e.g. project names) In case … Correcting intermediate results Reactivating Armadillo when paused User’s role is very limited. A user can either use an already defined service or create a new one. In the first case, very limited information is required, in the CS Scenario, just a URL of the start site is enough To create a new scenario or modify the existing one, armadillo provide tools to do so. A user might be required to give examples of concepts. This is then used to find site independent patterns for those concepts. In the CS domain, it was used to find generic patterns in order to identify projects. No domains to describe

Armadillo Library of known services (e.g. Google, Citeseer)
Tools for training learners for other structured sources Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy User-driven revision of results With re-learning after user correction The tool we are presenting is Armadillo Tool which provides several approaches in order to facilitate information harvesting information discovery information integration 1. provides wrappers to known services (Such as Google, Citeseer) 2. tools to facilitate the creation of such wrappers to add new services 3. tools used to harvest information from structured sources and use it to discover information in unstructured sources for this no user annotation is required, the system creates annotations, sets up an AIE learner, trains the algorithm and extracts the information without any user intervention At any stage the user can revise the results and feed it back to the system in order to bootstrap further learning

Rationale Armadillo learns how to extract information Use:
From large repositories By integrating information from diverse and distributed resources Use: Ontology population Information highlighting Document enrichment Enhancing user experience Realtime enrichment

Data Navigation (1) Armadillo GUI
user can control the processes and manage the data Armadillo’s output as a graph Armadillo’s output as rdf triples

Data Navigation (2)

Data Navigation (3)

IE for SW: The Vision Automatic annotation services Effects:
For a specific ontology Constantly re-indexing/re-annotating documents Semantic search engine Effects: No annotation in the document As today’s indexes are not stored in the documents No legacy with the past Annotation with the latest version of the ontology Multiple annotations for a single document Simplifies maintenance Page changed but not re-annotated

Questions?

Introduction to Web Science

Similar presentations

Presentation on theme: "Introduction to Web Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Web Science

Similar presentations

Presentation on theme: "Introduction to Web Science"— Presentation transcript:

Similar presentations

About project

Feedback