TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester, UK Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN

2 Increments ： accumulation Increase in Medline 2002200019981992199419961990 19881980 19821984198619781970197219741976196819661964 0 100,000 200,000 300,000 400,000 500,000 600,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation G-protein coupled receptor Before 1988 9 papers 1992 256 papers 2005 14,000 papers MEDLINE alone More than 0.5 million per year More than 1.3 thousand per day Articles added Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month [D.L.Banville 2006] 500 times more

3 NaCTeM www.nactem.ac.uk www.nactem.ac.uk First such centre in the world Funding: JISC, BBSRC, EPSRC Consortium investment Chair in TM (Prof. J. Tsujii, Univ. Tokyo) Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trustwww.mib.ac.uk Initial focus: biomedical academic community Extend services to industry Extend focus to other domains (social sciences)

4 Consortium Universities of Manchester, Liverpool Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) Self-funded partners –San Diego Supercomputing Center –University of California, Berkeley –University of Geneva –University of Tokyo Strong industrial & academic support – IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …

11 NLP and TM Text Mining Text as a bag of words Words as surface strings Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects NLP-based TM Linking text with knowledge

12 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language Terminology Parsing Paraphrasing From surface diversities and ambiguities to conceptual invariants

13 Example

14 Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra

15 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3

16 述語 / 項構造確率ＨＰＳＧ解析器 (Enju) の出力 The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod Semantic Retrieval System Using Deep Syntax MEDIE Passive Passive and Infinitival Clause

26 Demos MEDIE Info-PubMed

27 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3

31 Penn TreebankGENIA Coverage 99.7%99.2% F-Value (PArelations) 87.4%86.4% Sentence Precison 39.2%31.8% Processing Time 0.68sec1.00sec Performance of Semantic Parser

32 Scalability of TM Tools The number of papers14,792,890 The number of abstracts7,434,879 The number of sentences70,815,480 The number of words1,418,949,650 Compressed data size3.2GB Uncompressed data size10GB Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence …. 70 million seconds, that is, about 2 years

33 TM and GRID Solution –The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs –Parallel processing was managed by grid platform GXP [Taura2004] Experiments –The entire MEDLINE was parsed in 8 days Output –Syntactic parse trees and predicate argument structures in XML format –The data sizes of compressed/uncompressed output were 42.5GB/260GB.

34 Efficient Parsing for HPSG

35 Background: HPSG Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] –Lexicalized and Constraints-based Grammar –A few Rule Schema  General constraints on linguistic constructions –Constraints embedded in Lexicon  Word-Specific Constraints –Constraints between phrase structures and semantic structures

36 Ilikeit Parsing by HPSG

37 HEAD noun SUBJ COMPS I it HEAD verb SUBJ COMPS like Parsing by HPSG Assignment of Lexical Entries

38 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS 1 <> Head-Complement Application of Rule Schema

39 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS 1 <> 1 Subject-Head Application of Rule Schema

40 Inefficiency of HPSG Parsing Complex DAG ： Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation （⇔ CFG Approximation: CFG Filtering ） Assignment of Lexical Entries –High reduction of search space / Super tagging

41 Filtering with CFG (1/5) 2-phased parsing –Approximate HPSG with CFG with keeping important constraints. –Obtained CFG might over-generate, but can be used in filtering. –Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Compile HPSG CFG Feature Structures Input Sentences Built-in CFG Parser LiLFeS Unification Parsing + Output Complete parse trees

42 Inefficiency of HPSG Parsing Complex DAG ： Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation （⇔ CFG Approximation: CFG Filtering ） Assignment of Lexical Entries –High reduction of search space / Super tagging

43 HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS <> HEAD noun SUBJ COMPS HEAD verb SUBJ <> COMPS <> HPSG and Parsing Most of Constraints are in LEs Assignment of LEs ＝ Parsing results are implicitly determined like HEAD verb SUBJ COMPS Correct LE assignment  Constructing parse trees is straightforward However Errors in LE assignments  Irrecoverable in later stages

44 Supertagging and Efficient Parsing [Clark and Curran, 2004; Ninomiya et al., 2006] Supertagging : P(Seq-Of-LEs| SeQ-Of-Words ） Selection of lexical entry assignments [Bangalore and Joshi, 1999] I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS High Probability Threshold

45 Chart parsing I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS

46 Efficient Parser Smaller Number of LE assignments LE assignments that lead to complete parse trees Previous methods ： 1. Chart parsing by using initial LE assignment 2. Extend LE assignment when parsing fails Assignments filtered by CFG: assignments with parse trees Claim １ Deterministic Parsing with Classifiers is good enough, if parse trees exist Claim ２

47 System Overview I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS P High Supertagger I like it Input sentence CFG Filtering I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS... Deterministic Shift/Reduce Parser Ilikeit

Experiment Results LP(%)LR(%)F1(%) Avg. time Staged/Deterministic model 86.9386.4786.70 30ms/snt Previous method 1 （ Supertagger+ChartParser ） 87.3586.2986.81183ms/snt Previous method 2 （ Unigram + ChartParser ） 84.9684.2584.60674ms/snt 6 times faster 20 times faster than the initial model

49 Domain/Text Type Adaptation

50 F-scoreTraining Time （ Sec ） Baseline （ PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining （ GENIA ） 88.4514,695 Retraining （ PTB+GENIA) ） 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637

51 Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Original model Feature function Feature weight

52 83 84 85 86 87 88 89 90 02000400060008000 Number of Sentence of the GENIA Training Set F - score Baseline (PTB) Simple Retraining （ GENIA) Retraining (GENIA+PTB) Structure with Ref.Dist Lexical with RefDist Lexical/Structure woth RefDist

53 83 84 85 86 87 88 89 90 0100002000030000 Training Time （ Sec ） F - score Retrinaing (GENIA) Structure with RefDist Lexicon woth RefDist Lex/Str with RefDist

54 F-scoreTraining Time （ Sec ） Baseline （ PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining （ GENIA ） 88.4514,695 Retraining （ PTB+GENIA) ） 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637

55 Tool1: POS Tagger General-Purpose POS taggers, trained by WSJ –Brill’s tagger, TnT tagger, MX POST, etc. –97% General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

56 Errors seen in TnT tagger (Brants 2000) A chromosomal translocation in … DT JJ NN IN … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

57 Performance of GENIA Tagger Training corpus WSJGENIA WSJ97.084.3 GENIA75.298.1 WSJ+GENIA96.998.1 Training corpus WSJGENI A WSJ96.784.3 GENIA80.197.9 WSJ+GENI A 96.597.5 GENIA tagger (Ref.) TnT tagger No degradation of the tagger trained by the mixed corpus Some degradations (0.2 ~ 0.4) were observed, compared with the taggers trained by “pure” corpora

58 CRF-based POS + Active Learning GENIA 3,000 sentences : 98.4 20,000 sentences: 98.58

59 10,000 sentences: 96.76 Best Performance: 97.18 CRF-based POS + Active Learning PTB

60 Applications

61 Our Policy for Information Extraction Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System

62 Our Policy Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure

63 Our Policy Distinguish a domain-independent part from a domain-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure Learned automatically from a corpus

GENIA Event Annotation - example LinkCause –For an identified event in the given sentence, classify the type of events and record the text span giving the clue of it (ClueType). identify the theme of the events and record the text span linking the theme to the event (LinkTheme). identify the cause of the events and record the text span linking the cause to the event (LinkCause). record the environment (location, time) of the events (ClueLoc, ClueTime). LinkTheme ClueLoc ClueType

Gene_expression Theme patterns observed (2,958) –Protein2,308 –DNA 591 –RNA 25 –Peptide 4 –Protein Protein 2 –Erroneous 27 Keywords –coexpress, nonexpress, overexpress, express, biosynthesis, product, synthesize, constitute, … coexpression

Transcription Theme patterns observed (929) –DNA 449 –RNA 272 –Protein167 –Peptide2 –Erroneous22 Keyword –Transcrib, transcript, synthesi, express, …

Localization Theme patterns observed (730) –Protein608 –Lipid 31 –Atom 29 –Other_organic_compound 14 –DNA 12 –Virus 5 –Carbohydrate5 –RNA4 –Inorganic4 –Peptide3 Keywords –Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, … ClueLoc –NONE241 –nuclear140 –to the nucleus 12 –into the nucleus11 –Cytoplasmic 8 –in the cytoplasm 7 –macrophages 5 –nuclear … in t lymphocytes4 –monocytes4 –in the nucleus 4 –in the cytosol 4 –in colostrum 4 –from the cytoplasm to the nucleus 4

Localization Keywords and Locations –translocation (166) nuclear108 NONE 38 … –secretion (100) NONE 57 name_of_cells 43 –release (80) NONE 51 name_of_cells 19 … –localization (30) nuclear25 intracellular3 –uptake (24) NONE 14 name_of_cells 20 Keywords and Themes –translocation (166) Protein161 Virus 4 RNA 1 –secretion (100) Protein 98 Lipid 1 Peptide 1 –release (80) Protein 67 Other_organic_compoun 6 Lipid 3 –localization (30) Protein30 –uptake (24) Lipid15 Carbohydrate 5 Protein 4

69 Future Plan Kitano’s group, Kell’s group

72 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation

73 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation

TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Similar presentations

Presentation on theme: "TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Similar presentations

Presentation on theme: "TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,"— Presentation transcript:

Similar presentations

About project

Feedback