Introduction to Web Science

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Information and Business Work
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Adaptive Book: A Platform for teaching, learning and student modeling Ananda Gunawardena School of Computer Science Carnegie Mellon University.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Internet and Social Networking Research Tools for Academic Writing Copyright © 2014 Todd A. Whittaker
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Populating Ontologies for the Semantic Web Alexiei Dingli.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.
ITGS Databases.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
The Savvy Cyber Teacher ® Using the Internet Effectively in the K-12 Classroom 1 Copyright © 2003 Stevens Institute of Technology, CIESE, All Rights Reserved.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
CREAM: Semantic annotation system May 24, 2013 Hee-gook Jun.
Introduction to the Semantic Web and Linked Data
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
General Architecture of Retrieval Systems 1Adrienn Skrop.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Advanced Higher Computing Science
Information Retrieval in Practice
A Generic Toolkit for Electronic Editions of Medieval Manuscripts
Chapter 1 Introduction to HTML
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Chapter 18 Maintaining Information Systems
Presented by: Hassan Sayyadi
Software Documentation
UNIT 15 Webpage Creator.
Social Knowledge Mining
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
ece 627 intelligent web: ontology and beyond
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
CS246: Information Retrieval
Web Mining Research: A Survey
Intro Project Introduction to HTML.
Presentation transcript:

Introduction to Web Science Harvesting the SW

Six challenges of the Knowledge Life Cycle Acquire Model Reuse Retrieve Publish Maintain

Information Extraction vs. Retrieval IR IE 

A couple of approaches … Active learning to reduce annotation burden Supervised learning Adaptive IE The Melita methodology Automatic annotation of large repositories Largely unsupervised Armadillo Active learning gets the system involved in activity rather than passively learning from examples.

The Seminar Announcements Task Created by Carnegie Mellon School of Computer Science How to retrieve Speaker Location Start Time End Time From seminar announcements received by email

Seminar Announcements Example Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.

Information Extraction Measures How many documents out of the retrieved documents are relevant? How many retrieved documents are relevant out of all the relevant documents? Weighted harmonic mean of precision and recall

IE Measures Examples If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?

IE Measures Answers If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? Precision = 4/8 = 50% Recall = 4/10 = 40% F =(2*50*40)/(50+40) = 44.4%

Adaptive IE What is IE? What is AIE? Automated ways of extracting unstructured or partially structured information from machine readable files What is AIE? Performs tasks of traditional IE Exploits the power of Machine Learning in order to adapt to complex domains having large amounts of domain dependent data different sub-language features different text genres Considers important the Usability and Accessibility of the system IE = Automated ways of extracting unstructured or partially structured information from machine readable files.

Amilcare Tool for adaptive IE from Web-related texts Specifically designed for document annotation Based on (LP)2 algorithm *Linguistic Patterns by Learning Patterns Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types free texts semi-structured texts structured texts Uses Gate and Annie for preprocessing Amilcare site Learning Patterns Linguistic Patterns

CMU: detailed results Best overall accuracy What is precission/recall/f measure ? Best overall accuracy Best result on speaker field No results below 75%

Gate General Architecture for Text Engineering Contains Tokeniser provides a software infrastructure for researchers and developers working in NLP Contains Tokeniser Gazetteers Sentence Splitter POS Tagger Semantic Tagger (ANNIE) Co-reference Resolution Multi lingual support Protégé WEKA many more exist and can be added http://www.gate.ac.uk

needs annotation by experts Current practice of annotation for knowledge identification and extraction needs annotation by experts is complex is time consuming Reduce burden of text annotation for Knowledge Management

Different Annotation Systems SGML TEX Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Alembic The Gate Annotation Tool iMarkup, Yawas MnM, S-CREAM Standard Generalised Markup Language (SGML) 69 developed by IBM task was to integrate law office information systems allow editing, formatting, and information retrieval subsystems to share documents by having different meta information the ancestor of modern markup languages. TEX 70-80 one of the initial typesetting systems system is still used today with lots of enhancements like the addition of LATEX TEX reenforced the idea that layout information and content can be mixed in the same document lie at the base of modern web languages like HTML. Xanadu 88 Original Hypertext project Still alive Cosmic Book 90’s CoNote support collaborative work system share documents and notes (annotations) inserted in those documents ComMentor meta viewer users allowed to enter meta information in text JotBot applet that retrieves annotations from specialised servers presents an interface for reading and composing annotations one of the first on-the-fly annotation tools Third Voice commercial company browser plugin for annotation sort of newsgroup service where users could add comments to any page original page was not altered but annotations were inserted after Interesting issues unpopular with many Web site owners were disturbed by the idea of people posting critical, Off Topic, obscene material presented on top of their site Legal action was also discussed Another issue was privacy the annotations were centrally stored controlled by Third Voice Annotate.net Similar to Third Voice But it restricts the people who can add comments The Annotation Engine Similar to the others, annotates pages by passing them through a proxy Alembic uses several strategies to bootstrap annotation process string matching, rule languages, gazetteers, statistical analysis (Frequency counts), trains a learning algorithm it does not cater for redundant information user must retag GATE annotation tool which allows user to execute linguistic modules over the document iMarkup similar but allows mark-up in the form of text, sound, drawing etc. MnM Ontology Editor + Web Browser use an IE to help the user annotation, learning is done in phases i.e. browse, markup, learn, test and extract. S-Cream

Melita Tool for assisted automatic annotation Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) Users: annotates document samples IE System: Trains while users annotate Generalizes over seen cases Provides preliminary annotation for new documents Performs smart ordering of documents Advantages Annotates trivial or previously seen cases Focuses slow/expensive user activity on unseen cases User mainly validates extracted information Simpler & less error prone / Speeds up corpus annotation The system learns how to improve its capabilities

Methodology: Melita Bootstrap Phase Amilcare Learns in background User Annotates Bare Text

Methodology: Melita Checking Phase Learning in background from missing tags, mistakes User Annotates Amilcare Annotates Bare Text

Methodology: Melita Support Phase Amilcare Annotates Bare Text Corrections used to retrain User Corrects

Smart ordering of Documents User Annotates Bare Text Learns annotations Tries to annotate all the documents and selects the document with partial annotations

Intrusivity An evolving system is difficult to control Goal: Method: Avoiding unwelcome/unreliable suggestions Adapting proactivity to user’s needs Method: Allow users to tune proactivity Monitor user reactions to suggestions

Ontology defining concepts Methodology: Melita Control Panel Ontology defining concepts Document Panel Ontology an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Amount of Texts needed for training Results Tag Amount of Texts needed for training Prec Rec stime 20 84 63 etime 96 72 location 30 82 61 speaker 100 75 70 30 60 Precision – total correct from the number of elements retrieved Recall – total correct from all the existing elements

Future Work Research better ways of annotating concepts in documents Optimise document ordering to maximise the discovery of new tags Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects user annotations !! (Co-training)

Annotation for the Semantic Web Semantic Web requires document annotation Current approaches Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita) BUT: Manual/Semi-automatic annotation of Large diverse repositories Containing different and sparse information is unfeasible E.g. a Web site (So: 1,600 pages)

Redundancy Information repeated in different superficial formats Information on the Web (or large repositories) is Redundant Information repeated in different superficial formats Databases/ontologies Structured pages (e.g. produced by databases) Largely structured pages (bibliography pages) Unstructured pages (free texts) Web is rich in Redundancy Databses / Ontologies are publicly available (Citeseer, Computer Science Bibliography, People Searches etc) – can be accessed by agents that wrap the site Structured pages constitute a big majority of the pages available. Front end to the information in databases. Largely structured pages containing lists or data with some sort of structure – can be accessed by using a smart combination of Natural Language processing and wrapping techniques Unstructured pages contain loads of information which is difficult to extract – can be accessed by Natural Language Processing techniques In synthesis the more structured information is used to bootstrap the learning from the less structured sources

The Idea Largely unsupervised annotation of documents Based on Adaptive Information Extraction Bootstrapped using redundancy of information Method Use the structured information (easier to extract) to bootstrap learning on less structured sources (more difficult to extract) The Semantic Web requires huge amounts of annotations Annotation tools are used to support annotation The manual annotation of web sites is unfeasible (1600 pages) Information on the web is Redundant presence of multiple citations of the same facts in different formats Dispersed Knowledge found in different sources, different websites, databases, etc. Extraction and integration occurs by using redundancy by automatically obtaining examples from well defined structured sources (such as databases) seeking the most relevant source train Adaptive IE algorithms using the discovered examples use the AIE algorithm to extract further examples Eg: 1. Extract list of papers for an Author from a database such as Citeseer 2. Use a search engine with the examples just found to locate a list of papers of such author 3. Train the AIE algorithm using the examples found on the page returned by a search engine 4. Extract new papers from that page

Example: Extracting Bibliographies Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch In order to understand better how Armadillo works I will illustrate it by taking an example from the Computer Science Department Web site scenario. Imagine we need to extract bibliographies, how would Armadillo go about doing so?

Mining Web sites (1) Looks for structured lists of names Mines the site looking for People’s names Uses Generic patterns (NER) Citeseer for likely bigrams Looks for structured lists of names Annotates known names Trains on annotations to discover the HTML structure of the page Recovers all names and hyperlinks User would like to obtain the bibliographies from a Computer science department Gives the URL of the specific department to Armadillo Armadillo mines the site to find people’s name Uses generic patterns in order to identify possible names Verifies them with external sources like cite seer It also stores the information found in cite seet in an internal database for further use Such as papers, co-authors etc Once a list of names is obtained, google is used to find a bigger list of names which include those names already found Amilcare is trained on the already available names and used to extract further names from the page

Experimental Results II - Sheffield People discovering who works in the department using Information Integration Total present in site 139 Using generic patterns + online repositories 35 correct, 5 wrong Precision 35 / 40 = 87.5 % Recall 35 / 139 = 25.2 % F-measure 39.1 % Errors A. Schriffin Eugenio Moggi Peter Gray precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results IE - Sheffield People using Information Extraction Total present in site 139 116 correct, 8 wrong Precision 116 / 124 = 93.5 % Recall 116 / 139 = 83.5 % F-measure 88.2 % Errors Speech and Hearing European Network Department Of Enhancements – Lists, Postprocessor precision = RET REL / RET recall = RET REL / REL (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!) Position Paper The Network To System

Experimental Results - Edinburgh People using Information Integration Total present in site 216 Using generic patterns + online repositories 11 correct, 2 wrong Precision 11 / 13 = 84.6 % Recall 11 / 216 = 5.1 % F-measure 9.6 % using Information Extraction 153 correct, 10 wrong Precision 153 / 163 = 93.9 % Recall 153 / 216 = 70.8 % F-measure 80.7 % precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results - Aberdeen People using Information Integration Total present in site 70 Using generic patterns + online repositories 21 correct, 1 wrong Precision 21 / 22 = 95.5 % Recall 21 / 70 = 30.0 % F-measure 45.7 % using Information Extraction 63 correct, 2 wrong Precision 63 / 65 = 96.9 % Recall 63 / 70 = 90.0 % F-measure 93.3 % precision = RET REL / RET recall = RET REL / REL F-Measure = 2RP/R+P (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Mining Web sites (2) Annotates known papers Trains on annotations to discover the HTML structure Recovers co-authoring information Within the internal database of Armadillo, there exists a small incomplete list of papers per person This list is used to search in Google for a larger lists of papers for that particular person That page is annotated by Amilcare, Amilcare trains upon it with the few papers it already has and extracts futher papers

Experimental Results (1) Papers discovering publications in the department using Information Integration Total present in site 320 Using generic patterns + online repositories 151 correct, 1 wrong Precision 151 / 152 = 99 % Recall 151 / 320 = 47 % F-measure 64 % Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" } precision = RET REL / RET recall = RET REL / REL F measure ??? (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Experimental Results (2) Papers using Information Extraction Total present in site 320 214 correct, 3 wrong Precision 214 / 217 = 99 % Recall 214 / 320 = 67 % F-measure 80 % Errors Wrong boundaries in detection of paper names! Names of workshops mistaken as paper names! precision = RET REL / RET recall = RET REL / REL (if someone comments that with a NER they get better results, tell them that we are not just recognizing people's names: these are names of people WORKING IN THE SPECIFIC SITE!!!!)

Artists domain Task Given the name of an artist, find all the paintings of that artist. Created for the ArtEquAKT project

Artists domain Evaluation Method Precision Recall F-Measure Caravaggio II 100.0% 61% 75.8% IE 98.8% 99.4% Cezanne 27.1% 42.7% 91.0% 42.6% 58.0% Manet 29.7% 45.8% 40.6% 57.8% Monet 14.6% 25.5% 86.3% 48.5% 62.1% Raphael 59.9% 74.9% 96.5% 86.4% 91.2% Renoir 94.7% 40.0% 56.2% 96.4% 60.0% 74.0%

User Role Providing … In case … A URL List of services Already wrapped (e.g. Google is in default library) Train wrappers using examples Examples of fillers (e.g. project names) In case … Correcting intermediate results Reactivating Armadillo when paused User’s role is very limited. A user can either use an already defined service or create a new one. In the first case, very limited information is required, in the CS Scenario, just a URL of the start site is enough To create a new scenario or modify the existing one, armadillo provide tools to do so. A user might be required to give examples of concepts. This is then used to find site independent patterns for those concepts. In the CS domain, it was used to find generic patterns in order to identify projects. No domains to describe

Armadillo Library of known services (e.g. Google, Citeseer) Tools for training learners for other structured sources Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy User-driven revision of results With re-learning after user correction The tool we are presenting is Armadillo Tool which provides several approaches in order to facilitate information harvesting information discovery information integration 1. provides wrappers to known services (Such as Google, Citeseer) 2. tools to facilitate the creation of such wrappers to add new services 3. tools used to harvest information from structured sources and use it to discover information in unstructured sources for this no user annotation is required, the system creates annotations, sets up an AIE learner, trains the algorithm and extracts the information without any user intervention At any stage the user can revise the results and feed it back to the system in order to bootstrap further learning

Rationale Armadillo learns how to extract information Use: From large repositories By integrating information from diverse and distributed resources Use: Ontology population Information highlighting Document enrichment Enhancing user experience Realtime enrichment

Data Navigation (1) Armadillo GUI user can control the processes and manage the data Armadillo’s output as a graph Armadillo’s output as rdf triples

Data Navigation (2)

Data Navigation (3)

IE for SW: The Vision Automatic annotation services Effects: For a specific ontology Constantly re-indexing/re-annotating documents Semantic search engine Effects: No annotation in the document As today’s indexes are not stored in the documents No legacy with the past Annotation with the latest version of the ontology Multiple annotations for a single document Simplifies maintenance Page changed but not re-annotated

Questions?