1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.

Slides:



Advertisements
Similar presentations
Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.
Advertisements

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Large Scale Knowledge Management across Media Prof. Fabio Ciravegna, Department of Computer Science University of Sheffield
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
University of Sheffield NLP Module 4: Machine Learning.
Knowledge Management and Engineering David Riaño.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Low-cost semantics-enhanced web browsing with Magpie Enrico Motta Knowledge Media Institute The Open University, UK.
Adaptive Book: A Platform for teaching, learning and student modeling Ananda Gunawardena School of Computer Science Carnegie Mellon University.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Introduction to Machine Learning Approach Lecture 5.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Information Extraction From Medical Records by Alexander Barsky.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Populating Ontologies for the Semantic Web Alexiei Dingli.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Ontology-Based Information Extraction: Current Approaches.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
COMM89 Knowledge-Based Systems Engineering Lecture 8 Life-cycles and Methodologies
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Introduction to the Semantic Web and Linked Data
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
 A content management system ( CMS ) is a system providing a collection of procedures used to manage work flow in a collaborative environment. These.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
1 DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen, Germany.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Social Knowledge Mining
Data Warehousing and Data Mining
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Introduction to Web Science
Web Mining Research: A Survey
Presentation transcript:

1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW

2 Acquire Model Reuse Retrieve Publish Maintain Six challenges of the Knowledge Life Cycle

3 Information Extraction vs. Retrieval IRIE

4 A couple of approaches … Active learning to reduce annotation burden –Supervised learning –Adaptive IE –The Melita methodology Automatic annotation of large repositories –Largely unsupervised –Armadillo

5 Created by Carnegie Mellon School of Computer Science How to retrieve –Speaker –Location –Start Time –End Time From seminar announcements received by The Seminar Announcements Task

6 Dr. Steals presents in Dean Hall at one am. becomes Dr. Steals presents in Dean Hall at one am. Seminar Announcements Example

7 How many documents out of the retrieved documents are relevant? How many retrieved documents are relevant out of all the relevant documents? Weighted harmonic mean of precision and recall Information Extraction Measures

8 If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? IE Measures Examples

9 If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? Precision = 4/8 = 50% Recall = 4/10 = 40% F=(2*50*40)/(50+40) = 44.4% IE Measures Answers

10 What is IE? –Automated ways of extracting unstructured or partially structured information from machine readable files What is AIE? –Performs tasks of traditional IE –Exploits the power of Machine Learning in order to adapt to complex domains having large amounts of domain dependent data different sub-language features different text genres –Considers important the Usability and Accessibility of the system Adaptive IE

11 Amilcare Tool for adaptive IE from Web-related texts –Specifically designed for document annotation –Based on (LP) 2 algorithm *Linguistic Patterns by Learning Patterns Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types –free texts –semi-structured texts –structured texts –Uses Gate and Annie for preprocessing

12 CMU: detailed results 1.Best overall accuracy 2.Best result on speaker field 3.No results below 75%

13 Gate General Architecture for Text Engineering –provides a software infrastructure for researchers and developers working in NLP Contains –Tokeniser –Gazetteers –Sentence Splitter –POS Tagger –Semantic Tagger (ANNIE) –Co-reference Resolution –Multi lingual support –Protégé –WEKA –many more exist and can be added

14 Current practice of annotation for knowledge identification and extraction Annotation is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management

15 Different Annotation Systems SGML T E X Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Alembic The Gate Annotation Tool iMarkup, Yawas MnM, S-CREAM

16 Tool for assisted automatic annotation Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) Users: annotates document samples IE System: –Trains while users annotate –Generalizes over seen cases –Provides preliminary annotation for new documents Performs smart ordering of documents Advantages –Annotates trivial or previously seen cases –Focuses slow/expensive user activity on unseen cases –User mainly validates extracted information Simpler & less error prone / Speeds up corpus annotation –The system learns how to improve its capabilities Melita

17 Methodology: Melita Bootstrap Phase Bare Text Amilcare Learns in background User Annotates

18 Methodology: Melita Checking Phase Bare Text Learning in background from missing tags, mistakes User Annotates Amilcare Annotates

19 Methodology: Melita Support Phase Bare Text Corrections used to retrain Amilcare Annotates User Corrects

20 Smart ordering of Documents Bare Text Tries to annotate all the documents and selects the document with partial annotations Learns annotations User Annotates

21 An evolving system is difficult to control Goal: –Avoiding unwelcome/unreliable suggestions –Adapting proactivity to user’s needs Method: –Allow users to tune proactivity –Monitor user reactions to suggestions Intrusivity

22 Methodology: Melita Ontology defining concepts Control Panel Document Panel

23 Results TagAmount of Texts needed for training PrecRec stime etime location speaker

24 Research better ways of annotating concepts in documents Optimise document ordering to maximise the discovery of new tags Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects user annotations !! Future Work

25 Semantic Web requires document annotation –Current approaches Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita) BUT: –Manual/Semi-automatic annotation of Large diverse repositories Containing different and sparse information is unfeasible E.g. a Web site (So: 1,600 pages) Annotation for the Semantic Web

26 Information on the Web (or large repositories) is Redundant Information repeated in different superficial formats –Databases/ontologies –Structured pages (e.g. produced by databases) –Largely structured pages (bibliography pages) –Unstructured pages (free texts) Redundancy

27 Largely unsupervised annotation of documents –Based on Adaptive Information Extraction –Bootstrapped using redundancy of information Method –Use the structured information (easier to extract) to bootstrap learning on less structured sources ( more difficult to extract ) The Idea

28 –Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references –Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch Example: Extracting Bibliographies

29 Mining Web sites (1) Mines the site looking for People’s names Uses Generic patterns (NER) Citeseer for likely bigrams Looks for structured lists of names Annotates known names Trains on annotations to discover the HTML structure of the page Recovers all names and hyperlinks

30 Experimental Results II - Sheffield People –discovering who works in the department –using Information Integration Total present in site 139 Using generic patterns + online repositories –35 correct, 5 wrong –Precision35 / 40 = 87.5 % –Recall35 / 139 = 25.2 % –F-measure 39.1 % Errors –A. Schriffin –Eugenio Moggi –Peter Gray

31 Experimental Results IE - Sheffield People –using Information Extraction Total present in site 139 –116 correct, 8 wrong –Precision116 / 124= 93.5 % –Recall116 / 139= 83.5 % –F-measure 88.2 % Errors –Speech and Hearing –European Network –Department Of Enhancements – Lists, Postprocessor –Position Paper –The Network –To System

32 Experimental Results - Edinburgh People –using Information Integration Total present in site 216 Using generic patterns + online repositories –11 correct, 2 wrong –Precision11 / 13 = 84.6 % –Recall11 / 216 = 5.1 % –F-measure 9.6 % –using Information Extraction –153 correct, 10 wrong –Precision153 / 163= 93.9 % –Recall153 / 216= 70.8 % –F-measure80.7 %

33 Experimental Results - Aberdeen People –using Information Integration Total present in site 70 Using generic patterns + online repositories –21 correct, 1 wrong –Precision21 / 22 = 95.5 % –Recall21 / 70 = 30.0 % –F-measure 45.7 % –using Information Extraction –63 correct, 2 wrong –Precision63 / 65= 96.9 % –Recall63 / 70 = 90.0 % –F-measure93.3 %

34 Mining Web sites (2) Annotates known papers Trains on annotations to discover the HTML structure Recovers co-authoring information

35 Experimental Results (1) Papers –discovering publications in the department –using Information Integration Total present in site 320 Using generic patterns + online repositories –151 correct, 1 wrong –Precision151 / 152 = 99 % –Recall151 / 320 = 47 % –F-measure64 % Errors - Garbage in computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/ html" }

36 Experimental Results (2) Papers –using Information Extraction Total present in site 320 –214 correct, 3 wrong –Precision214 / 217 = 99 % –Recall214 / 320 = 67 % –F-measure 80 % Errors –Wrong boundaries in detection of paper names! –Names of workshops mistaken as paper names!

37 Task –Given the name of an artist, find all the paintings of that artist. –Created for the ArtEquAKT project Artists domain

38 Artists domain Evaluation ArtistMethodPrecisionRecallF-Measure CaravaggioII100.0%61%75.8% IE100.0%98.8%99.4% CezanneII100.0%27.1%42.7% IE91.0%42.6%58.0% ManetII100.0%29.7%45.8% IE100.0%40.6%57.8% MonetII100.0%14.6%25.5% IE86.3%48.5%62.1% RaphaelII100.0%59.9%74.9% IE96.5%86.4%91.2% RenoirII94.7%40.0%56.2% IE96.4%60.0%74.0%

39 –Providing … A URL List of services –Already wrapped (e.g. Google is in default library) –Train wrappers using examples Examples of fillers (e.g. project names) –In case … Correcting intermediate results Reactivating Armadillo when paused User Role

40 –Library of known services (e.g. Google, Citeseer) –Tools for training learners for other structured sources –Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy –User-driven revision of results With re-learning after user correction Armadillo

41 Armadillo learns how to extract information –From large repositories By integrating information –from diverse and distributed resources Use: –Ontology population –Information highlighting –Document enrichment –Enhancing user experience Rationale

42 Data Navigation (1)

43 Data Navigation (2)

44 Data Navigation (3)

45 Automatic annotation services –For a specific ontology –Constantly re-indexing/re-annotating documents –Semantic search engine Effects: –No annotation in the document As today’s indexes are not stored in the documents –No legacy with the past Annotation with the latest version of the ontology Multiple annotations for a single document –Simplifies maintenance Page changed but not re-annotated IE for SW: The Vision

46 Questions?