Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.

Slides:

Advertisements

Similar presentations

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

Advertisements

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

An Introduction to GATE

26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.

University of Sheffield NLP Module 4: Machine Learning.

Jianwei Lu1 Information Extraction from Event Announcements Student: Jianwei Lu ( ) Supervisor: Robert Dale.

1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.

Alex Meng Chunshi Jin Elliott Conant Jonathan Fung.

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

Aki Hecht Seminar in Databases (236826) January 2009

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.

Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.

Ontology-based Information Extraction for Business Intelligence

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.

Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Analysing Crime-Scene Reports Katerina Pastra and Horacio Saggion University of Sheffield Scene of Crime Information System.

Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.

What is a Usable Library Website? Results from a Nationwide Study Anthony Chow, Ph.D., Assistant Professor Michelle Bridges, Patricia Commander, Amy Figley,

The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

Resource Sharing Development and Challenge in Academic Libraries: the Case Study of CALIS Yao XiaoXia CALIS Administrative Center ， PUL ， shanghai.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

FAO, Library and Documentation Systems Division – Dr. Johannes Keizer | May 2006 AGRIS – A new Vision and Strategy CAAS, Beijing May 2006 A new vision.

Survey of Semantic Annotation Platforms

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.

Information Extraction From Medical Records by Alexander Barsky.

Chapter 1 Introduction to Data Mining

Requirements Engineering Requirements Elicitation Process Lecture-8.

1 Technologies for (semi-) automatic metadata creation Diana Maynard.

Flexible Text Mining using Interactive Information Extraction David Milward

What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.

Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.

Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.

De-identifying the EHR: building a resource for research Clinical e-Science Framework De-identifying the EHR: building a resource for research All Hands.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.

Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.

Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

NLP and Big Data Shanxi HPC Research Center Xiaoge LI WBDB2013, Xi’an, China.

Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Using Semantic Relations to Improve Information Retrieval

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

LaSIE: The Large Scale Information Extraction System Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.

WIKT 2007Košice, november Tvorba sémantických metadát Michal Laclavík Ústav Informatiky SAV.

Introduction to Information Extraction

Social Knowledge Mining

Presentation transcript:

Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA General Conference and Council, August 2006, Seoul, Korea Library of Chinese Academy of Sciences Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying

outline 1. Introduction 2. What is IE (Information Extraction)? 3. Potential functions in Innovations of Library Services 4. Constructing a Chinese Information Extraction System 5. Tests and Evaluation

1. Introduction Library of Library of Chinese Academy of Sciences –Now changing the name to National Science Library of China –about 400 staffs, HQ in Beijing, 3 branches in Lanzhou, Chengdu, Wuhan, –serve 90 CAS research institutes across the country –in 2001,initiated Chinese National Science Digital Library (CSDL) program

1. Introduction CSDL (Chinese National Science Digital Library ) CSDL (Chinese National Science Digital Library ) –provided abundant digital information resources for users. (e-journals,6000 west,11000 Chinese, in one day) –developed information systems to support networked services.

Union Catalogs & Document Delivery

Federated database search

Digital reference

remote authentication

1. Introduction CSDL (Chinese National Science Digital Library ) CSDL (Chinese National Science Digital Library ) –provided abundant digital information resources for users. (e-journals,6000 west,11000 Chinese, in one day) –developed information systems to support networked services. –Carried out lots of training and propaganda program

1. Introduction CSDL become one of the key research facility to researcher and graduated students of CAS. CSDL become one of the key research facility to researcher and graduated students of CAS. While While –Information requirement of researcher and graduated students changed rapidly –Traditional information retrieval methods is not sufficient

1. Introduction The User of CSDL want to: The User of CSDL want to: –get rid of the information noise –effectively get a comprehensive view of recent development of domain –disclose significant relationships between information The Librarian of CSDL want to: The Librarian of CSDL want to: –improve the service standard of CSDL –turn the digital library into a knowledge repository

1. Introduction Information Extraction (IE) is the emerging technology serves to our needs Information Extraction (IE) is the emerging technology serves to our needs

outline 1. Introduction 2. What is IE (Information Extraction)? 3. Potential functions in Innovations of Library Services 4. Constructing a Chinese Information Extraction System 5. Tests and Evaluation

2.What is IE (Information Extraction)? NLP Research Group, Univsity of Sheffield NLP Research Group, Univsity of Sheffield –Information extraction (IE) is a term that has come to be applied to the activity of automatically extracting pre-specified sorts of information from natural language texts

2.What is IE (Information Extraction)? Dr. Hamish Cunningham Dr. Hamish Cunningham –IE is a process that takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output –Input unstructured unstructured free text free text –Output fixed-format fixed-format unambiguous unambiguous

2.What is IE (Information Extraction)? Output (structured information source) can be used for: Output (structured information source) can be used for: –searching –analysis –generating summary –constructing indices

##### ####### NHS TRUST - PATIENT CASE NOTE ########:######### ####### DOB: 1944 CLEF-RMH-Entry-Key: 52A4F6DB2B46E AB 1992 Seen in General Surgical This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of XXXXXXXXXXX. She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem. The original problem was that she complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local, regional or distant recurrence the pain itself warrants management in a pain clinic. XXXXXXXXX could be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home. I wonder whether it would be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management. I have otherwise arranged for her to be seen in the clinic again in a year's time. There are no signs of recurrence at this time. 5213A4F612F1 IE, A example recurrence no signs of recurrence bony lymphoedema shooting pain in the direction of ulna nerve pain Interventions Problems Problem Site Locations left arm local, regional or distant a year’s time today at this time Time pain clinic clinic pain clinic General Surgical pain clinic mastectomy left open capsulotomy removal of her prosthesis management

IE, A example Extracted Information could be collected… Interventions Problems Problem Site Locations Time recurrence no signs of recurrence bony lymphoedema shooting pain in the direction of ulna nerve pain left arm local, regional or distant a year’s time today at this time pain clinic clinic pain clinic General Surgical pain clinic mastectomy left open capsulotomy removal of her prosthesis management recurrence no signs of recurrence bony lymphoedema shooting pain in the direction of ulna nerve pain left arm local, regional or distant a year’s time today at this time pain clinic clinic pain clinic General Surgical pain clinic mastectomy left open capsulotomy removal of her prosthesis management recurrence no signs of recurrence bony lymphoedema shooting pain in the direction of ulna nerve pain left arm local, regional or distant a year’s time today at this time pain clinic clinic pain clinic General Surgical pain clinic mastectomy left open capsulotomy removal of her prosthesis management

2.What is IE (Information Extraction)? 5 kinds of Information Extraction tasks 5 kinds of Information Extraction tasks –Named Entity recognition (NE) –Coreference resolution (CO) –Template Element construction (TE) –Template Relation construction (TR) –Scenario Template production (ST)

2.What is IE (Information Extraction)? NE is about finding entities NE is about finding entities CO about which entities and references (such as pronouns) refer to the same thing CO about which entities and references (such as pronouns) refer to the same thing TE about what attributes entities have TE about what attributes entities have TR about what relationships between entities there are TR about what relationships between entities there are ST about events that the entities participate in. ST about events that the entities participate in.

2.What is IE (Information Extraction)? Information Extraction will: Information Extraction will: –play a very important role in coping with the huge collections of digital information –bring innovations in library services

outline 1. Introduction 2. What is IE (Information Extraction)? 3. Potential functions in Innovations of Library Services 4. Constructing a Chinese Information Extraction System 5. Tests and Evaluation

3. Potential functions in Innovations of Library Services 1. Automatic annotation and metadata creation –automatic annotation of digital materials –automatic acquisition of metadata –For example, MnM, S-CREAM, AERODAML, SemTag, KIM, hTechsight –ontology-based IE techniques

3. Potential functions in Innovations of Library Services 2. Improving data mining in information analysis –Large-scale data analysis –Detection of many types of evidence –Get enough structured data for analysis

3. Potential functions in Innovations of Library Services 3. Developing knowledge base from free text –statistical and numeric databases –terminological database –fact sheets –SOBA (SmartWeb Ontology-Based Annotation)

3. Potential functions in Innovations of Library Services 4. Generating answers in digital reference system –Most research libraries establish digital reference service –Can we get answers directly from information systems –Natural language QA (Question Answering)

SO … IE is very important IE is very important How to build an IE system (Chinese) How to build an IE system (Chinese) CSDL try to find an effective way CSDL try to find an effective way

outline 1. Introduction 2. What is IE (Information Extraction)? 3. Potential functions in Innovations of Library Services 4. Constructing a Chinese Information Extraction System 5. Tests and Evaluation

4. Constructing a Chinese Information Extraction System A Chinese IE solution A Chinese IE solution –which makes full use of GATE –trying to develop a Chinese IE plug-in to process Chinese information resource based on GATE framework.

4. Constructing a Chinese Information Extraction System GATE GATE – (General Architecture for Text Engineering) –Open Source, Developed from 1995 –GATE, a framework Language Resources (LRs) Language Resources (LRs) Processing Resources (PRs) Processing Resources (PRs) Visual Resources (VRs) Visual Resources (VRs) –ANNIE (A Nearly-New IE system) tokeniser, sentence splitter, POS tagger, gazetteer, finite state transducer and orthomatcher tokeniser, sentence splitter, POS tagger, gazetteer, finite state transducer and orthomatcher

ANNIE Pipeline

GATE: good for English

GATE: Not so good for Chinese

4.Constructing a Chinese Information Extraction System Key difficulties for Chinese information extraction Key difficulties for Chinese information extraction –Chinese tokenizing –Chinese gazetteers –Chinese named entity recognition

Chinese tokenizing English language English language –words are separated by white space and punctuation Chinese Language Chinese Language –without any separation between words

a simple sentence (I am a Chinese) can be broken into several forms with segmenter (I am a Chinese) (I am China person) (I am center country person)

Chinese gazetteers GATE gazetteer lists for English GATE gazetteer lists for English – very abundant GATE gazetteer lists for Chinese process GATE gazetteer lists for Chinese process –simple and short gazetteers such as date, time, organization, location, money, province etc –for a flexible language like Chinese, the list is very limited

Chinese named entity recognition GATE system uses JAPE (a Java Annotation Patterns Engine) rules to recognize NE GATE system uses JAPE (a Java Annotation Patterns Engine) rules to recognize NE

JAPE rules grammar of Chinese is quite different from that of English grammar of Chinese is quite different from that of English the JAPE rules provided by GATE are not suitable for Chinese texts the JAPE rules provided by GATE are not suitable for Chinese texts We need to rewrite JAPE rules to implement Chinese information extraction We need to rewrite JAPE rules to implement Chinese information extraction

Solutions to the problems

three main tasks we have done 1. Integrating ICTCLAS to perform words segmentation

three main tasks we have done 2. Developing Chinese gazetteers to enrich GATE language resources

three main tasks we have done 3. Rewriting JAPE rules to recognize Chinese NE

Chinese JAPE rule Chinese JAPE rule

outline 1. Introduction 2. What is IE (Information Extraction)? 3. Potential functions in Innovations of Library Services 4. Constructing a Chinese Information Extraction System 5. Tests and Evaluation

one years of working, we implemented the system one years of working, we implemented the system carry out experiments carry out experiments

Same piece of article

Our output

Conclusions bring forth a solution for Chinese information extraction system bring forth a solution for Chinese information extraction system carried out a valuable experiment carried out a valuable experiment still many works need to be done still many works need to be done lay a good foundation for our future works lay a good foundation for our future works

Thanks! Thanks! 谢谢！谢谢！