NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004

What do we have? Biology Literature (huge amount of text) E.g. Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

What do we want? Named entities: gene names, protein names, drugs, etc. Interaction events between entities: transcription, translation, post translational modification, etc. Relationships between basic events: caused by, inhibited by, etc. (from Hirschman et al. 02)

Preliminary System Structure Pre-processed data ready to mine POS TaggerParserEntity Extractor … Collections of raw textual data Genes, proteins, other entitiesNouns, Verbs, etc. NPs, VPs, Relations … Text Pre-processing: NLP Text Mining Modules: TM

POS Taggers Tree Tagger Brill Tagger SNoW Tagger LT Chunk Stanford Tagger

Results of POS Tagging Raw text: Mites in the genus Varroa are the primary parasites of honey bees … Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non-coding region, a trait not yet seen in other arthropods. … (from Biological Abstracts)

Results of POS Tagging (cont.) TreeBrillSNoWLTStanford theDT 12SJJCDNNJJCD ribosomalJJ RNANPNNP subunitNN isVBZ invertedVVNJJ VBNJJ andCC separatedVVNVBNVBDVBNJJ fromIN theDT

Results of POS Tagging (cont.) TreeBrillSNoWLTStanford 16SJJCDNNJJCD rRNANN NNP byIN aDT novelNN JJNN non-codingVVGJJNNJJ regionNN

Comparison of POS Taggers TreeBrillSNoWLTStanford Src. StuttgartEric BrillUIUCEdinburghStanford Alg. decision tree transformation based network of linear functions HMM dis- ambiguation maximum entropy Speed 1min/5M< 1min/5M~8mins/5M40mins/5M80mins/5M Adapt. yes source included highlow source & API included Other punc- tuation sensitive commonly used help available 96 – 98% precision

Conclusions Existing general-purpose POS taggers work fine for our task. Most nouns and verbs correctly identified There is still room to improve existing POS taggers for biology data. E.g. to identify gene and protein names Speed and adaptability are important.

A Little Bit More on SNoW SNoW has a POS tagger and a shallow parser. Speed is reasonable. Software is adaptable as help is available from CCG. The network model can be trained if we have training data.

Result of SNoW Shallow Parser [NP the 12 S ribosomal RNA subunit] [VP is] [ADJP inverted] and [VP separated] [PP from] [NP the 16 S rRNA] [PP by] [NP a novel non-coding region] (from online demo) Problems:  Currently the package is not available for download from the new CCG page.  There is still problem running the old package on our machine. (compilation, path setting, etc.)

Parsers SNoW (already covered) LT-Chunk MiniPar Collins Stanford

Result of LT-Chunk [[ the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN ]] (( is_VBZ inverted_VBN and_CC separated_VBN )) from_IN [[ the_DT 16S_JJ rRNA_NNP ]] by_IN [[ a_DT novel_JJ non-coding_JJ region_NN ]]

Result of MiniPar 16(the~ Det20det(gov subunit)) 17(12S~ N20nn(gov subunit)) 18(ribosomal~ A20mod(gov subunit)) 19(RNA~ N20nn(gov subunit)) 20(subunit~ N22s(gov invert)) 21(isbe be22be(gov invert)) 22(invertedinvert VE0i(gov fin)) E4(()subunit N22obj(gov invert) 23(and~ U22lex-mod(gov invert)) 24(separatedseparate V22lex-dep(gov invert)) 25(from~ Prep22mod(gov invert)) 26(the~ Det28det(gov rRNA)) 27(16S~ N28nn(gov rRNA)) 28(rRNA~ N25pcomp-n(gov from))

Results of Collins Parser (S~is~2~2 (NPB~subunit~5~5 the/DT 12S/CD ribosomal/JJ RNA/NNP subunit/NN ) (VP~is~2~1 is/VBZ (UCP~inverted~3~1 (ADJP~inverted~1~1 inverted/JJ ) and/CC (VP~separated~3~1 separated/VBN (PP~from~2~1 from/IN (NPB~rRNA~3~3 the/DT 16S/CD rRNA/NN ) ) (PP~by~2~1 by/IN (NP~region~2~1 (NPB~region~4~4 a/DT novel/JJ non-coding/JJ region/NN,/PUNC, )

Comparison of Parsers LTMiniParCollinsStanford Src. EdinburghU AlbertaM. CollinsStanford Prec. Part of LT-POS Slightly over 88% ~ 85% Speed 40min/5M14min/5M> 3 hrs/5Mvery slow … Adapt. Low, training not allowed High, provides API Source included Source & API included Other LT-Chunk is a part of LT-POS; Readable output Complex Output of dependency and governing info. Well-known. Tagged input needed. Java based.

Conclusion on Parsers MiniPar has advantages so far: Fast Outputs dependency & governing info. and useful relations Provides API If SNoW is tuned for the task, we can easily plug it into the module.

Entity Extractors Abner: extracts protein, DNA, RNA, cell line, and cell type Yagi: extracts only gene names, a brother of Abner LingPipe: Named entity extraction that can be trained for different domains.

Result of Abner Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12 S ribosomal RNA subunit is inverted and separated from the 16 S rRNA by a novel non-coding region, …

Result of LingPipe Ten of 22 transfer RNAs are in different locations relative to hard ticks, and the 12S ribosomal RNA subunit is inverted and separated from the 16S rRNA by a novel non- coding region, …

Comparison of Entity Extractors AbnerYagiLingPipe Src. U Wisconsin Alias-i, Inc. Alg. CRF Model B-CUBED alg. Prec. 89.3%/69.9% (seen/unseen) data, 72% for protein 75% on unseen data Exact Match: 64.9 % Recall 65%Exact: ~ 70% Speed 40mins/5M3mins/5M 5mins/5M (model1) 3hrs/5M (model2) Adapt. Java based, pre-trained Java based, pre-trained with BioCreative Two trained models, training allowed Other Graphic Interface; files <= 500k Should split into small files <= 1M, can take directory as input Command line & demo. Also does co- referencing.

Conclusion on Entities Extractors Still a lot of room to improve. However, with existing extractors we can begin high level text mining work. Performances over honeybee data need to be evaluated. As soon as better extractor is constructed, we can plug in easily.

Summary Some Existing NLP tools for supporting Biology Literature Mining: POS Taggers, Parsers and Entity- Extractors are evaluated Observations along two lines: Still considerable room of improvement beyond the existing NLP tools, especially customize them for special domains. We can begin exploring higher-level text mining research with support of these toolkits. Text Preprocessing Modules are independent, easy to plug and play

References Hirschman, L. et al. Accomplishments and challenges in literature data mining for biology Bioinformatics, 2002 Dekang Lin. Dependency-based evaluation of MiniPar In Workshop on the Evaluation of Parsing Systems, 1998

End of Talk Thank you!

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Similar presentations

Presentation on theme: "NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.

Similar presentations

Presentation on theme: "NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004."— Presentation transcript:

Similar presentations

About project

Feedback