Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical Garden (密苏里植物园)
6/12/ :20:46 PM TNR in BHL2 Co-author Chris Freeland Director of Biodiversity Heritage Library IT division manager of Missouri Botanical Garden
6/12/ :20:46 PM TNR in BHL3 Biodiversity Heritage Library(BHL) “Ten major natural history museum libraries, botanical libraries, and research institutions have joined to form the BHL. The group is developing a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections. This literature will be available through a global biodiversity commons.” ( 10 个机构合作建立 的全球最大生物多样性数字图书馆) More information about BHL could be found at
6/12/ :20:46 PM TNR in BHL4 Participating institutions( 参 与机构 ) American Museum of Natural History (New York, NY) The Field Museum (Chicago, IL) Harvard University Botany Libraries (Cambridge, MA) Harvard University (Cambridge, MA) Marine Biological Laboratory / Woods Hole Oceanographic Institution (Woods Hole, MA) Missouri Botanical Garden (St. Louis, MO) Natural History Museum (London, UK) The New York Botanical Garden (New York, NY) Royal Botanic Gardens, Kew (Richmond, UK) Smithsonian Institution Libraries (Washington, DC)
6/12/ :20:46 PM TNR in BHL5 Open Access (免费开放式图 书馆) “BHL Project strives to establish a major corpus of digitized publications on the Web drawn from the historical biodiversity literature. This material will be available for open access and responsible use as a part of a global Biodiversity Commons. We will work with the global taxonomic community, rights holders, and other interested parties to ensure that this legacy literature is available to all.”
6/12/ :20:46 PM TNR in BHL6
6/12/ :20:46 PM 7 TNR in BHL (数字图书馆的分 类学名称识别) A significant aspect of BHL is the incorporation of algorithmic Taxonomic intelligence provided by uBio.org. (区别于其他数字图书馆:智能式搜 索,不是简单的全文检索) uBio.org As materials are scanned, the image files are processed through ABBY FineReader or PrimeOCR to create text derivatives. Those text files are then submitted to uBio’s TaxonFinder web service to identifies strings in the text that match the characteristics of scientific names. (自 动文字识别以及自动分类学名称识别)
6/12/ :20:46 PM TNR in BHL8 Two TNR algorithms ( 2 个重 要的分类学名称识别算法) TaxonFinder is developed by uBio and it uses statistical models that were created from the validated organism names that are in NameBank. These models aim to describe the structure and frequency of common character sequences of organism names, such that TaxonFinder can infer whether an unknown word has a similar structure as a known organism name. (利用统计学方法进 行名称特征识别) Online only (只能在线使用)
6/12/ :20:46 PM TNR in BHL9 Two TNR algorithms FAT, short for “Finds All Taxonomic names”, was developed aiming to automatically extract all the taxonomic name from the biological literature. It then use the parts already classified to build lexica and statistics (dictionary lookup), which will be used to classify the rest of the text. (Sautter et al) (利用生物分类学名称字典以及字母特征 进行识别) Offline usage and customized dictionaries( 可脱机 使用以及自定义字典 )
6/12/ :20:46 PM TNR in BHL10
6/12/ :20:46 PM TNR in BHL11 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data images to texts TNR Ubio Database Text Mining process text find names 无结构数据有结构数据
6/12/ :20:46 PM TNR in BHL12 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data OCR Error TNR Ubio Database Text Mining TNR Error Authority File Error
6/12/ :20:46 PM TNR in BHL13 Sample Characteristics (样本) Number of Pages 392 Average Number of Tokens Average Number of Names 7.7
6/12/ :20:46 PM TNR in BHL14 Evaluation Measures (评价指 标) Precision is the proportion of matching strings that are valid names. In our case,the precision means the capability of the algorithm to exclude the non- valid name in the result. (检准率) Recall is the proportion of valid names in the whole database that were returned as true positives. It means the capability of finding all valid names from the database. (检全率) In this evaluation, we also use a single measure F- score which is a harmonic mean of R and P: F-score=2(Precision*Recall)/(Precision+Recall)
6/12/ :20:46 PM TNR in BHL15 Sample Language Distribution (样本的语言分布)
6/12/ :20:46 PM TNR in BHL16 Ground Fact (基本数据) Total: 3003 valid names Unique Name One Word Two Words Three Words More Words No. of Valid names Percentage 89.01%35.76%40.63%17.25%6.36%
6/12/ :20:46 PM TNR in BHL17 OCR Overall Performance (文字识别性能) TotalWrong OCRError Rate %
6/12/ :20:46 PM TNR in BHL18 Error Breakdown
6/12/ :20:46 PM TNR in BHL19 OCR-Is language matters?
6/12/ :20:46 PM TNR in BHL20 TOP OCR error patterns 1 Insert Space 8 n->v 2 Omit Space 9l->i 3 e->c 10r->i 4u->i11u->ii 5u->n12h->l 6i->l13h->ii 7c->e14e->o
6/12/ :20:46 PM TNR in BHL21 NameBank For TaxonFinder, NameBank impleteness error rate is 6% NoBankID244 IsValidName92 Found 1540 Total 1696
6/12/ :20:46 PM TNR in BHL22 Error Analysis Exact Match Without_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist) No. of Names Found by algorithms Correct Precision (检准率) 40.32%28.20% Recall (检全率) 36.62%23.34% F-score38.47%25.77% With_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist) No. of Names Found by algorithms Correct Precision (检准率) 43.77%32.25% Recall (检全率) 25.82%17.21% F-score34.80%24.73%
6/12/ :20:46 PM TNR in BHL23 Overall Performances StepNo. of NamesError Rate OCR % TaxonFinder* % NameBank923.06% Total %
6/12/ :20:46 PM TNR in BHL24 Conclusion (结论) Our result indicate that TaxonFinder is slightly better than FAT. But even TaxonFinder only got an F-score of 38.47% which is relatively lower compared to other Named entity recognition results. For instance, the best system entering Message Understanding Conferences (MUC) scored 93.39% of F-score while human annotators scored 97.60% and 96.95%. We could see that there is a large space we could improve the algorithm to get better result. ( 名称识 别算法尚待改进 )
6/12/ :20:46 PM TNR in BHL25 Future Work (未来研究方向) Artificial Intelligent Retrieval is the trend ( 发展方向是 更加智能的搜索 ) How could we achieve it? ( 如何做到? ) Experiments on machine learning methods (机器学 习试验) Using other external sources, e.g. ontologies (利用 其他外部资源) Automatic OCR correction (自动更正文字识别错误) Fuzzy matching algorithms in IR (模糊匹配算法)
6/12/ :20:46 PM TNR in BHL26 References [1] S. Rice, J. Kanai, and T. Nartker. An evaluation of OCR accuracy. In UNLV Information Science Research Institute Annual Report, pages 9-20, 1993 [2] Koning, D., N. Sarkar, and T. Moritz TaxonGrab: Extracting Taxonomic Names from Text. Biodiversity Informatics 2: [3] Sautter, G., K. Bohm, and D. Agosti A combining approach to find all taxon names (FAT) in legacy biosisystematics literature. Biodiversity Informatics 3, ( [4] McCray, A. T., A. R. Aronson, A. C. Browne, T. C. Rindflesch, A. Razi, and S. Srinivasan UMLS knowledge for biomedical language processing. Bulletin Medical Library Association 81:184-94
6/12/ :20:46 PM TNR in BHL27 Questions?
6/12/ :20:46 PM TNR in BHL28 Thanks!