Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical.

Slides:



Advertisements
Similar presentations
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Advertisements

Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Biodiversity Heritage Library by Connie Rinaldo. Overview History EOL/BHL: WHY? Members/Collaborators Process Governance Sustainability: Legal and Financial.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Chapter 5: Information Retrieval and Web Search
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
The Encyclopedia of Life: A Web Site for Every Species James Edwards Executive Director, EOL Barcode of Life Conference Taipei 20 September 2007.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Data Curation Education and Biological Information Specialists DigCCurr 2007 Chapel Hill, April 20, 2007 P. Bryan Heidorn, Carole L. Palmer, Melissa H.
Improving search in scanned documents: Looking for OCR mismatches David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal.
Tom Garnett April 12, 2007 Smithsonian Institution Libraries National Museum of Natural History Board Science Committee Meeting Biodiversity Heritage Library.
Biodiversity Heritage Library © 2008 Biodiversity Heritage Librarywww.biodiversitylibrary.org Scientific Disciplines From Discovery to Delivery Cathy Norton.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Progress since the February 2005 London DNA Barcode of Life Conference Scott Miller, Chair Consortium for the Barcode of Life Smithsonian Institution.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials Arwen Hutt, University of Tennessee.
Richard White Biodiversity Informatics. What is biodiversity informatics? The preceding project, among others, shows that the challenges facing biodiversity.
Crowd-sourcing the creation of “articles” within the Biodiversity Heritage Library Bianca Crowley Trish Rose-Sandler
TDWG 2006 Conference, St Louis Digitizing the legacy literature of biodiversity An introduction to the Biodiversity Heritage Library (BHL) Neil Thomson.
Botanicus.org: Prototyping a Web 2.0 interface to digitized taxonomic literature Chris Freeland - Application Development Manager Doug Holland – Director.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Martin R. Kalfatovic Smithsonian Institution Libraries Refactoring Natural History Literature, April 17-18, 2006 University of Illinois, Urbana-Champaign.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Presenter: Shanshan Lu 03/04/2010
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
The Evolving Digital Mathematics Library: A Mathematics Librarian’s Perspective Timothy W. Cole University of Illinois at Urbana-Champaign 8 Dec
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Removing Taxonomic Impediments: How the EOL and BHL Projects can help…. Graham Higley Natural History Museum, London At TDWG 2007.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Hua-Jung Liu (National Central University) Chen-Chi Chang (Nanya Institute of Technology) Yih-Chearng Shiue (National Central University) The key sustainable.
Biodiversity Heritage Library: A Successful Collaboration, A Fully Open Access Collection Marty Schlabach Mann Library, Cornell University Upstate New.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Freeland, LAPI II, 18 NOV 2008 Digital Libraries for Science: Botanicus & Biodiversity Heritage Library Chris Freeland Director of Bioinformatics, Missouri.
World wide access to biodiversity literature The Biodiversity Heritage Library Henning Scholz 1 & Tom Garnett 2 1 Museum für Naturkunde, Berlin, Germany.
Language Identification and Part-of-Speech Tagging
Text Based Information Retrieval
An Empirical Study of Learning to Rank for Entity Search
Data challenges in the pharmaceutical industry
Improving search in scanned documents: Looking for OCR mismatches
What is Pattern Recognition?
Marcos André Gonçalves
Topic: Semantic Text Mining
Presentation transcript:

Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical Garden (密苏里植物园)

6/12/ :20:46 PM TNR in BHL2 Co-author Chris Freeland Director of Biodiversity Heritage Library IT division manager of Missouri Botanical Garden

6/12/ :20:46 PM TNR in BHL3 Biodiversity Heritage Library(BHL) “Ten major natural history museum libraries, botanical libraries, and research institutions have joined to form the BHL. The group is developing a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections. This literature will be available through a global biodiversity commons.” ( 10 个机构合作建立 的全球最大生物多样性数字图书馆) More information about BHL could be found at

6/12/ :20:46 PM TNR in BHL4 Participating institutions( 参 与机构 ) American Museum of Natural History (New York, NY) The Field Museum (Chicago, IL) Harvard University Botany Libraries (Cambridge, MA) Harvard University (Cambridge, MA) Marine Biological Laboratory / Woods Hole Oceanographic Institution (Woods Hole, MA) Missouri Botanical Garden (St. Louis, MO) Natural History Museum (London, UK) The New York Botanical Garden (New York, NY) Royal Botanic Gardens, Kew (Richmond, UK) Smithsonian Institution Libraries (Washington, DC)

6/12/ :20:46 PM TNR in BHL5 Open Access (免费开放式图 书馆) “BHL Project strives to establish a major corpus of digitized publications on the Web drawn from the historical biodiversity literature. This material will be available for open access and responsible use as a part of a global Biodiversity Commons. We will work with the global taxonomic community, rights holders, and other interested parties to ensure that this legacy literature is available to all.”

6/12/ :20:46 PM TNR in BHL6

6/12/ :20:46 PM 7 TNR in BHL (数字图书馆的分 类学名称识别) A significant aspect of BHL is the incorporation of algorithmic Taxonomic intelligence provided by uBio.org. (区别于其他数字图书馆:智能式搜 索,不是简单的全文检索) uBio.org As materials are scanned, the image files are processed through ABBY FineReader or PrimeOCR to create text derivatives. Those text files are then submitted to uBio’s TaxonFinder web service to identifies strings in the text that match the characteristics of scientific names. (自 动文字识别以及自动分类学名称识别)

6/12/ :20:46 PM TNR in BHL8 Two TNR algorithms ( 2 个重 要的分类学名称识别算法) TaxonFinder is developed by uBio and it uses statistical models that were created from the validated organism names that are in NameBank. These models aim to describe the structure and frequency of common character sequences of organism names, such that TaxonFinder can infer whether an unknown word has a similar structure as a known organism name. (利用统计学方法进 行名称特征识别) Online only (只能在线使用)

6/12/ :20:46 PM TNR in BHL9 Two TNR algorithms FAT, short for “Finds All Taxonomic names”, was developed aiming to automatically extract all the taxonomic name from the biological literature. It then use the parts already classified to build lexica and statistics (dictionary lookup), which will be used to classify the rest of the text. (Sautter et al) (利用生物分类学名称字典以及字母特征 进行识别) Offline usage and customized dictionaries( 可脱机 使用以及自定义字典 )

6/12/ :20:46 PM TNR in BHL10

6/12/ :20:46 PM TNR in BHL11 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data images to texts TNR Ubio Database Text Mining process text find names 无结构数据有结构数据

6/12/ :20:46 PM TNR in BHL12 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data OCR Error TNR Ubio Database Text Mining TNR Error Authority File Error

6/12/ :20:46 PM TNR in BHL13 Sample Characteristics (样本) Number of Pages 392 Average Number of Tokens Average Number of Names 7.7

6/12/ :20:46 PM TNR in BHL14 Evaluation Measures (评价指 标) Precision is the proportion of matching strings that are valid names. In our case,the precision means the capability of the algorithm to exclude the non- valid name in the result. (检准率) Recall is the proportion of valid names in the whole database that were returned as true positives. It means the capability of finding all valid names from the database. (检全率) In this evaluation, we also use a single measure F- score which is a harmonic mean of R and P: F-score=2(Precision*Recall)/(Precision+Recall)

6/12/ :20:46 PM TNR in BHL15 Sample Language Distribution (样本的语言分布)

6/12/ :20:46 PM TNR in BHL16 Ground Fact (基本数据) Total: 3003 valid names Unique Name One Word Two Words Three Words More Words No. of Valid names Percentage 89.01%35.76%40.63%17.25%6.36%

6/12/ :20:46 PM TNR in BHL17 OCR Overall Performance (文字识别性能) TotalWrong OCRError Rate %

6/12/ :20:46 PM TNR in BHL18 Error Breakdown

6/12/ :20:46 PM TNR in BHL19 OCR-Is language matters?

6/12/ :20:46 PM TNR in BHL20 TOP OCR error patterns 1 Insert Space 8 n->v 2 Omit Space 9l->i 3 e->c 10r->i 4u->i11u->ii 5u->n12h->l 6i->l13h->ii 7c->e14e->o

6/12/ :20:46 PM TNR in BHL21 NameBank For TaxonFinder, NameBank impleteness error rate is 6% NoBankID244 IsValidName92 Found 1540 Total 1696

6/12/ :20:46 PM TNR in BHL22 Error Analysis Exact Match Without_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist) No. of Names Found by algorithms Correct Precision (检准率) 40.32%28.20% Recall (检全率) 36.62%23.34% F-score38.47%25.77% With_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist) No. of Names Found by algorithms Correct Precision (检准率) 43.77%32.25% Recall (检全率) 25.82%17.21% F-score34.80%24.73%

6/12/ :20:46 PM TNR in BHL23 Overall Performances StepNo. of NamesError Rate OCR % TaxonFinder* % NameBank923.06% Total %

6/12/ :20:46 PM TNR in BHL24 Conclusion (结论) Our result indicate that TaxonFinder is slightly better than FAT. But even TaxonFinder only got an F-score of 38.47% which is relatively lower compared to other Named entity recognition results. For instance, the best system entering Message Understanding Conferences (MUC) scored 93.39% of F-score while human annotators scored 97.60% and 96.95%. We could see that there is a large space we could improve the algorithm to get better result. ( 名称识 别算法尚待改进 )

6/12/ :20:46 PM TNR in BHL25 Future Work (未来研究方向) Artificial Intelligent Retrieval is the trend ( 发展方向是 更加智能的搜索 ) How could we achieve it? ( 如何做到? ) Experiments on machine learning methods (机器学 习试验) Using other external sources, e.g. ontologies (利用 其他外部资源) Automatic OCR correction (自动更正文字识别错误) Fuzzy matching algorithms in IR (模糊匹配算法)

6/12/ :20:46 PM TNR in BHL26 References [1] S. Rice, J. Kanai, and T. Nartker. An evaluation of OCR accuracy. In UNLV Information Science Research Institute Annual Report, pages 9-20, 1993 [2] Koning, D., N. Sarkar, and T. Moritz TaxonGrab: Extracting Taxonomic Names from Text. Biodiversity Informatics 2: [3] Sautter, G., K. Bohm, and D. Agosti A combining approach to find all taxon names (FAT) in legacy biosisystematics literature. Biodiversity Informatics 3, ( [4] McCray, A. T., A. R. Aronson, A. C. Browne, T. C. Rindflesch, A. Razi, and S. Srinivasan UMLS knowledge for biomedical language processing. Bulletin Medical Library Association 81:184-94

6/12/ :20:46 PM TNR in BHL27 Questions?

6/12/ :20:46 PM TNR in BHL28 Thanks!