Ontology-based Information Extraction for Business Intelligence

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

European Commission DG Research SMcL Brussels SME-NCP 23 October 2002 THE 6th FRAMEWORK PROGRAMME Economic & Technological Intelligence S. McLaughlin.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Mining the web to improve semantic-based multimedia search and digital libraries
In the universe of knowledge with linguistic intelligence and semantic logic.
Human Language Technologies. Issue Corporate data stores contain mostly natural language materials. Knowledge Management systems utilize rich semantic.
Xyleme A Dynamic Warehouse for XML Data of the Web.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Funded by: European Commission – 6th Framework Project Reference: IST WP8: Exploitation and Dissemination Mondeca return of experience on transitioning.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Overview of Search Engines
University of Sheffield NLP Natural Language Technology for Business Intelligence Horacio Saggion & Adam Funk.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Business Intelligence. business intelligence is a broad category of applications and technologies for gathering, providing access to, and analyzing data.
CompuBase Data for CRM / PRM Integration How compuBase fits to an existing CRM / PRM system? Last review 25/03/2007.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
MUSING1 MUSING: MUlti-industry, Semantic-based next generation business INtelliGence IST Project Number : FP6 – Funded by the European Commission,
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Deutsche Forschungsgemeinschaft DFG The Use of Research Funding Databases for Research Assessment Information Systems Presented at the 8th international.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Ontology based Information Extraction
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Natural Language Interfaces to Ontologies Danica Damljanović
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Co-funded by the European Union Semantic CMS Community Reference Architecture for Semantic CMS Copyright IKS Consortium 1 Lecturer Organization Date of.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
EVALUATION OF THE SEE SARMa Project. Content Project management structure Internal evaluation External evaluation Evaluation report.
The DEER Distributed European Electronic Resource Dr Suzanne Keene Francesca Monti University College London.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Of 24 lecture 11: ontology – mediation, merging & aligning.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Linked Open Data Approaches within the ARIADNE project
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana Maynard, Kalina Bontcheva Natural Language Processing Group University of Sheffield United Kingdom

Outline The MUSING Project Ontology-based IE MUSING Natural Language Processing Technology MUSING applications Customisation Results Conclusions & Future Work

MUSING project Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making Many systems in BI are portals which allow business analysts access to information It is the work of the business analyst to dig into the documents in order to extract useful facts for decision making MUSING is a 7th Framework Programme Project from the European Commission which promotes the adoption of BI tools based on semantic-based knowledge and content systems Analytical techniques traditionally used in BI rely on structured information and hardly ever use qualitative information which the industry is keen in using (e.g. opinions) One of the goals of MUSING is to use structured as well as unstructured information for decision making

Ontology-based Information Extraction (OBIE) Information extraction (IE) is a technology which extracts key pieces of information from text generic: identify specific name mentions in text (person names, location names, money, etc.) specific: populate a structured representation (e.g. template) with “strings” from text (e.g., full information on a joint venture) OBIE is the process of finding in text and other sources concepts, instances, and relations expressed in an Ontology

Ontology-based Information Extraction (OBIE) Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc. These associated pieces of information should be asserted as properties values of the company instance Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)

Ontology-based Information Extraction in MUSING DATA SOURCE PROVIDER DOMAIN EXPERT ONTOLOGY CURATOR USER DOCUMENT MUSING ONTOLOGY DOCUMENT COLLECTOR USER INPUT DOCUMENT MUSING APPLICATION MUSING DATA REPOSITORY ONTOLOGY-BASED DOCUMENT ANNOTATION REGION SELECTION MODEL ECONOMIC INDICATORS REGION RANK ENTERPRISE INTELLIGENCE COMPANY INFORMATION ANNOTATED DOCUMENT REPORT DOMAIN EXPERT ONTOLOGY POPULATION KNOWLEDGE BASE INSTANCES & RELATIONS

Data Sources and Ontology Data sources are provided by MUSING partners and include balance sheets, company profiles, press data, web data, etc. (some private data) Il Sole 24 ORE, CreditReform data Companies’ web pages (main, “about us”, “contact us”, etc.) Wikipedia, CIA Fact Book, etc. Ontology is manually developed through interaction with domain experts and ontology curators It extends the PROTON ontology and covers the financial, international, and IT operative risk domain

Partial Ontology View

Natural Language Processing Technology The OBIE system for English is being developed using the GATE system (http://gate.ac.uk); the German and Italian systems are based on Sprout tools developed by DFKI GATE components used include: tokeniser, sentence splitter; parts-of-speech tagger; morphological analyser; parsers; etc. GATE comes with an extraction system called ANNIE, it targets only a small fraction of the MUSING application domain

Natural Language Processing Technology Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition New components include an ontology mapping component – entities are mapped into specific classes in the given ontology a component creates RDF statements for ontology population based on the application specification for example create a company instance with all its properties as found in the text

Cross-source Coreference One important problem to be addressed in extraction from multiple source is deciding if a person name – or any other name - occurring in two different sources refer to the same individual. Given a set of documents containing a given person name we apply an agglomerative clustering algorithm, at the end documents referring to the same individual belong in the same cluster The algorithm uses vector representations of the documents (terms and weights) We experimented with two types of terms: words and entity names and our results indicate that a representation using one specific type of name (i.e., Organization) achieves state-of-the-art performance however performance varies depending on the data set

MUSING Applications A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include Collecting Company Information from multiple multilingual sources (English, German, Italian) to provide up-to-date information on competitors Identifying Chances of success in regions in a particular country Identify appropriate partners to do business with Creation of a Joint Ventures Database from multiple sources

Region Selection Application Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model

Region Selection Application Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia) Gazetteer lists contain location names and associated information together with keywords to help identify the key information Grammars use contextual information and named entities to identify the target variables “unemployment rate of 25% (2001)” Extraction performance obtained: F-score > 80%

Region Selection Application: example Tamil Nadu ... Population (2001) 62,405,679 (6) Density 478/km <rdf> <indicator:Measurement rdf:ID="Measurement_91567"> <indicator:hasValue>478</indicator:hasValue> <indicator:hasPoliticalRegion rdf:resource=".../int/region#TamilNadu" /> <indicator:hasIndicator rdf:resource=".../int/indicator#DENS" /> <time:hasTimeSlice xmlns:time=".../general/time#"> <time:TimeSlice rdf:ID="TimeSlice_40715"> <time:hasTemporalEntity> <time:ProperInstantYear rdf:ID="ProperInstantYear_57895"> <time:year rdf:datatype="#int">2001</time:year> </time:ProperInstantYear> </time:hasTemporalEntity> </time:TimeSlice> </time:hasTimeSlice> </indicator:Measurement> </rdf>

Conclusions and Future Work MUSING integrates ontology-based extraction as a useful tool for Business Intelligence The NLP applications analyse documents and populate a knowledge base A number of practical applications have been defined which will use the KB’s stored facts. Extraction technology and performance so far is promising Our future work will concentrate on the full problem of ontology population including a cross-source coreference mechanism the identification of qualitative information (such as opinions) e.g. to model company reputation moving from a rule-based system to a machine learning approach