Download presentation
Presentation is loading. Please wait.
Published byArchibald Phelps Modified over 8 years ago
1
Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (生物多样性图书馆分 类学名称识别) Qin Wei (魏琴), Chris Freeland, P. Bryan Heidorn Missouri Botanical Garden (密苏里植物园)
2
6/12/2016 11:20:46 PM TNR in BHL2 Co-author Chris Freeland Director of Biodiversity Heritage Library IT division manager of Missouri Botanical Garden chris.freeland@mobot.org
3
6/12/2016 11:20:46 PM TNR in BHL3 Biodiversity Heritage Library(BHL) “Ten major natural history museum libraries, botanical libraries, and research institutions have joined to form the BHL. The group is developing a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections. This literature will be available through a global biodiversity commons.” ( 10 个机构合作建立 的全球最大生物多样性数字图书馆) More information about BHL could be found at http://www.biodiversitylibrary.org
4
6/12/2016 11:20:46 PM TNR in BHL4 Participating institutions( 参 与机构 ) American Museum of Natural History (New York, NY) The Field Museum (Chicago, IL) Harvard University Botany Libraries (Cambridge, MA) Harvard University (Cambridge, MA) Marine Biological Laboratory / Woods Hole Oceanographic Institution (Woods Hole, MA) Missouri Botanical Garden (St. Louis, MO) Natural History Museum (London, UK) The New York Botanical Garden (New York, NY) Royal Botanic Gardens, Kew (Richmond, UK) Smithsonian Institution Libraries (Washington, DC)
5
6/12/2016 11:20:46 PM TNR in BHL5 Open Access (免费开放式图 书馆) “BHL Project strives to establish a major corpus of digitized publications on the Web drawn from the historical biodiversity literature. This material will be available for open access and responsible use as a part of a global Biodiversity Commons. We will work with the global taxonomic community, rights holders, and other interested parties to ensure that this legacy literature is available to all.”
6
6/12/2016 11:20:46 PM TNR in BHL6
7
6/12/2016 11:20:46 PM 7 TNR in BHL (数字图书馆的分 类学名称识别) A significant aspect of BHL is the incorporation of algorithmic Taxonomic intelligence provided by uBio.org. (区别于其他数字图书馆:智能式搜 索,不是简单的全文检索) uBio.org As materials are scanned, the image files are processed through ABBY FineReader or PrimeOCR to create text derivatives. Those text files are then submitted to uBio’s TaxonFinder web service to identifies strings in the text that match the characteristics of scientific names. (自 动文字识别以及自动分类学名称识别)
8
6/12/2016 11:20:46 PM TNR in BHL8 Two TNR algorithms ( 2 个重 要的分类学名称识别算法) TaxonFinder is developed by uBio and it uses statistical models that were created from the validated organism names that are in NameBank. These models aim to describe the structure and frequency of common character sequences of organism names, such that TaxonFinder can infer whether an unknown word has a similar structure as a known organism name. (利用统计学方法进 行名称特征识别) Online only (只能在线使用)
9
6/12/2016 11:20:46 PM TNR in BHL9 Two TNR algorithms FAT, short for “Finds All Taxonomic names”, was developed aiming to automatically extract all the taxonomic name from the biological literature. It then use the parts already classified to build lexica and statistics (dictionary lookup), which will be used to classify the rest of the text. (Sautter et al) (利用生物分类学名称字典以及字母特征 进行识别) Offline usage and customized dictionaries( 可脱机 使用以及自定义字典 )
10
6/12/2016 11:20:46 PM TNR in BHL10
11
6/12/2016 11:20:46 PM TNR in BHL11 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data images to texts TNR Ubio Database Text Mining process text find names 无结构数据有结构数据
12
6/12/2016 11:20:46 PM TNR in BHL12 Digitalization Process (数字化 过程) OCR BHL (Text Mining Database) BHL Web Structured Data Text Database Libraries Unstructured Data OCR Error TNR Ubio Database Text Mining TNR Error Authority File Error
13
6/12/2016 11:20:46 PM TNR in BHL13 Sample Characteristics (样本) Number of Pages 392 Average Number of Tokens 446.8 Average Number of Names 7.7
14
6/12/2016 11:20:46 PM TNR in BHL14 Evaluation Measures (评价指 标) Precision is the proportion of matching strings that are valid names. In our case,the precision means the capability of the algorithm to exclude the non- valid name in the result. (检准率) Recall is the proportion of valid names in the whole database that were returned as true positives. It means the capability of finding all valid names from the database. (检全率) In this evaluation, we also use a single measure F- score which is a harmonic mean of R and P: F-score=2(Precision*Recall)/(Precision+Recall)
15
6/12/2016 11:20:46 PM TNR in BHL15 Sample Language Distribution (样本的语言分布)
16
6/12/2016 11:20:46 PM TNR in BHL16 Ground Fact (基本数据) Total: 3003 valid names Unique Name One Word Two Words Three Words More Words No. of Valid names 267310741220518191 Percentage 89.01%35.76%40.63%17.25%6.36%
17
6/12/2016 11:20:46 PM TNR in BHL17 OCR Overall Performance (文字识别性能) TotalWrong OCRError Rate 3003105635.16%
18
6/12/2016 11:20:46 PM TNR in BHL18 Error Breakdown
19
6/12/2016 11:20:46 PM TNR in BHL19 OCR-Is language matters?
20
6/12/2016 11:20:46 PM TNR in BHL20 TOP OCR error patterns 1 Insert Space 8 n->v 2 Omit Space 9l->i 3 e->c 10r->i 4u->i11u->ii 5u->n12h->l 6i->l13h->ii 7c->e14e->o
21
6/12/2016 11:20:46 PM TNR in BHL21 NameBank For TaxonFinder, NameBank impleteness error rate is 6% NoBankID244 IsValidName92 Found 1540 Total 1696
22
6/12/2016 11:20:46 PM TNR in BHL22 Error Analysis Exact Match Without_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist)16961937 No. of Names Found by algorithms15401603 Correct621452 Precision (检准率) 40.32%28.20% Recall (检全率) 36.62%23.34% F-score38.47%25.77% With_OCR_Error*TaxonFinderFAT No. of Names (identified by biologist)26103003 No. of Names Found by algorithms15401603 Correct674517 Precision (检准率) 43.77%32.25% Recall (检全率) 25.82%17.21% F-score34.80%24.73%
23
6/12/2016 11:20:46 PM TNR in BHL23 Overall Performances StepNo. of NamesError Rate OCR105635.16% TaxonFinder*123441.09% NameBank923.06% Total300379.32%
24
6/12/2016 11:20:46 PM TNR in BHL24 Conclusion (结论) Our result indicate that TaxonFinder is slightly better than FAT. But even TaxonFinder only got an F-score of 38.47% which is relatively lower compared to other Named entity recognition results. For instance, the best system entering Message Understanding Conferences (MUC) scored 93.39% of F-score while human annotators scored 97.60% and 96.95%. We could see that there is a large space we could improve the algorithm to get better result. ( 名称识 别算法尚待改进 )
25
6/12/2016 11:20:46 PM TNR in BHL25 Future Work (未来研究方向) Artificial Intelligent Retrieval is the trend ( 发展方向是 更加智能的搜索 ) How could we achieve it? ( 如何做到? ) Experiments on machine learning methods (机器学 习试验) Using other external sources, e.g. ontologies (利用 其他外部资源) Automatic OCR correction (自动更正文字识别错误) Fuzzy matching algorithms in IR (模糊匹配算法)
26
6/12/2016 11:20:46 PM TNR in BHL26 References [1] S. Rice, J. Kanai, and T. Nartker. An evaluation of OCR accuracy. In UNLV Information Science Research Institute Annual Report, pages 9-20, 1993 [2] Koning, D., N. Sarkar, and T. Moritz. 2005. TaxonGrab: Extracting Taxonomic Names from Text. Biodiversity Informatics 2: 79-82. [3] Sautter, G., K. Bohm, and D. Agosti. 2006. A combining approach to find all taxon names (FAT) in legacy biosisystematics literature. Biodiversity Informatics 3, 41-53. (http://jbi.nhm.ku.edu/index.php/jbi/article/view/34/16)http://jbi.nhm.ku.edu/index.php/jbi/article/view/34/16 [4] McCray, A. T., A. R. Aronson, A. C. Browne, T. C. Rindflesch, A. Razi, and S. Srinivasan. 1993. UMLS knowledge for biomedical language processing. Bulletin Medical Library Association 81:184-94
27
6/12/2016 11:20:46 PM TNR in BHL27 Questions?
28
6/12/2016 11:20:46 PM TNR in BHL28 Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.