Presentation is loading. Please wait.

Presentation is loading. Please wait.

TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Similar presentations


Presentation on theme: "TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,"— Presentation transcript:

1 TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester, UK Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN

2 2 Increments : accumulation Increase in Medline 2002200019981992199419961990 19881980 19821984198619781970197219741976196819661964 0 100,000 200,000 300,000 400,000 500,000 600,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation G-protein coupled receptor Before 1988 9 papers 1992 256 papers 2005 14,000 papers MEDLINE alone More than 0.5 million per year More than 1.3 thousand per day Articles added Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month [D.L.Banville 2006] 500 times more

3 3 NaCTeM www.nactem.ac.uk www.nactem.ac.uk First such centre in the world Funding: JISC, BBSRC, EPSRC Consortium investment Chair in TM (Prof. J. Tsujii, Univ. Tokyo) Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trustwww.mib.ac.uk Initial focus: biomedical academic community Extend services to industry Extend focus to other domains (social sciences)

4 4 Consortium Universities of Manchester, Liverpool Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) Self-funded partners –San Diego Supercomputing Center –University of California, Berkeley –University of Geneva –University of Tokyo Strong industrial & academic support – IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …

5 5

6 6

7 7

8 8

9 9

10 10

11 11 NLP and TM Text Mining Text as a bag of words Words as surface strings Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects NLP-based TM Linking text with knowledge

12 12 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language Terminology Parsing Paraphrasing From surface diversities and ambiguities to conceptual invariants

13 13 Example

14 14 Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra

15 15 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3

16 16 述語 / 項構造 確率HPSG解析器 (Enju) の出力 The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod Semantic Retrieval System Using Deep Syntax MEDIE Passive Passive and Infinitival Clause

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26 Demos MEDIE Info-PubMed

27 27 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3

28 28

29 29

30 30

31 31 Penn TreebankGENIA Coverage 99.7%99.2% F-Value (PArelations) 87.4%86.4% Sentence Precison 39.2%31.8% Processing Time 0.68sec1.00sec Performance of Semantic Parser

32 32 Scalability of TM Tools The number of papers14,792,890 The number of abstracts7,434,879 The number of sentences70,815,480 The number of words1,418,949,650 Compressed data size3.2GB Uncompressed data size10GB Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence …. 70 million seconds, that is, about 2 years

33 33 TM and GRID Solution –The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs –Parallel processing was managed by grid platform GXP [Taura2004] Experiments –The entire MEDLINE was parsed in 8 days Output –Syntactic parse trees and predicate argument structures in XML format –The data sizes of compressed/uncompressed output were 42.5GB/260GB.

34 34 Efficient Parsing for HPSG

35 35 Background: HPSG Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] –Lexicalized and Constraints-based Grammar –A few Rule Schema  General constraints on linguistic constructions –Constraints embedded in Lexicon  Word-Specific Constraints –Constraints between phrase structures and semantic structures

36 36 Ilikeit Parsing by HPSG

37 37 HEAD noun SUBJ COMPS I it HEAD verb SUBJ COMPS like Parsing by HPSG Assignment of Lexical Entries

38 38 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS 1 <> Head-Complement Application of Rule Schema

39 39 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS 1 <> 1 Subject-Head Application of Rule Schema

40 40 Inefficiency of HPSG Parsing Complex DAG : Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering ) Assignment of Lexical Entries –High reduction of search space / Super tagging

41 41 Filtering with CFG (1/5) 2-phased parsing –Approximate HPSG with CFG with keeping important constraints. –Obtained CFG might over-generate, but can be used in filtering. –Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Compile HPSG CFG Feature Structures Input Sentences Built-in CFG Parser LiLFeS Unification Parsing + Output Complete parse trees

42 42 Inefficiency of HPSG Parsing Complex DAG : Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering ) Assignment of Lexical Entries –High reduction of search space / Super tagging

43 43 HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS <> HEAD noun SUBJ COMPS HEAD verb SUBJ <> COMPS <> HPSG and Parsing Most of Constraints are in LEs Assignment of LEs = Parsing results are implicitly determined like HEAD verb SUBJ COMPS Correct LE assignment  Constructing parse trees is straightforward However Errors in LE assignments  Irrecoverable in later stages

44 44 Supertagging and Efficient Parsing [Clark and Curran, 2004; Ninomiya et al., 2006] Supertagging : P(Seq-Of-LEs| SeQ-Of-Words ) Selection of lexical entry assignments [Bangalore and Joshi, 1999] I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS High Probability Threshold

45 45 Chart parsing I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS

46 46 Efficient Parser Smaller Number of LE assignments LE assignments that lead to complete parse trees Previous methods : 1. Chart parsing by using initial LE assignment 2. Extend LE assignment when parsing fails Assignments filtered by CFG: assignments with parse trees Claim 1 Deterministic Parsing with Classifiers is good enough, if parse trees exist Claim 2

47 47 System Overview I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS P High Supertagger I like it Input sentence CFG Filtering I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS... Deterministic Shift/Reduce Parser Ilikeit

48 Experiment Results LP(%)LR(%)F1(%) Avg. time Staged/Deterministic model 86.9386.4786.70 30ms/snt Previous method 1 ( Supertagger+ChartParser ) 87.3586.2986.81183ms/snt Previous method 2 ( Unigram + ChartParser ) 84.9684.2584.60674ms/snt 6 times faster 20 times faster than the initial model

49 49 Domain/Text Type Adaptation

50 50 F-scoreTraining Time ( Sec ) Baseline ( PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining ( GENIA ) 88.4514,695 Retraining ( PTB+GENIA) ) 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637

51 51 Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Original model Feature function Feature weight

52 52 83 84 85 86 87 88 89 90 02000400060008000 Number of Sentence of the GENIA Training Set F - score Baseline (PTB) Simple Retraining ( GENIA) Retraining (GENIA+PTB) Structure with Ref.Dist Lexical with RefDist Lexical/Structure woth RefDist

53 53 83 84 85 86 87 88 89 90 0100002000030000 Training Time ( Sec ) F - score Retrinaing (GENIA) Structure with RefDist Lexicon woth RefDist Lex/Str with RefDist

54 54 F-scoreTraining Time ( Sec ) Baseline ( PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining ( GENIA ) 88.4514,695 Retraining ( PTB+GENIA) ) 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637

55 55 Tool1: POS Tagger General-Purpose POS taggers, trained by WSJ –Brill’s tagger, TnT tagger, MX POST, etc. –97% General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

56 56 Errors seen in TnT tagger (Brants 2000) A chromosomal translocation in … DT JJ NN IN … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

57 57 Performance of GENIA Tagger Training corpus WSJGENIA WSJ97.084.3 GENIA75.298.1 WSJ+GENIA96.998.1 Training corpus WSJGENI A WSJ96.784.3 GENIA80.197.9 WSJ+GENI A 96.597.5 GENIA tagger (Ref.) TnT tagger No degradation of the tagger trained by the mixed corpus Some degradations (0.2 ~ 0.4) were observed, compared with the taggers trained by “pure” corpora

58 58 CRF-based POS + Active Learning GENIA 3,000 sentences : 98.4 20,000 sentences: 98.58

59 59 10,000 sentences: 96.76 Best Performance: 97.18 CRF-based POS + Active Learning PTB

60 60 Applications

61 61 Our Policy for Information Extraction Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System

62 62 Our Policy Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure

63 63 Our Policy Distinguish a domain-independent part from a domain-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure Learned automatically from a corpus

64 GENIA Event Annotation - example LinkCause –For an identified event in the given sentence, classify the type of events and record the text span giving the clue of it (ClueType). identify the theme of the events and record the text span linking the theme to the event (LinkTheme). identify the cause of the events and record the text span linking the cause to the event (LinkCause). record the environment (location, time) of the events (ClueLoc, ClueTime). LinkTheme ClueLoc ClueType

65 Gene_expression Theme patterns observed (2,958) –Protein2,308 –DNA 591 –RNA 25 –Peptide 4 –Protein Protein 2 –Erroneous 27 Keywords –coexpress, nonexpress, overexpress, express, biosynthesis, product, synthesize, constitute, … coexpression

66 Transcription Theme patterns observed (929) –DNA 449 –RNA 272 –Protein167 –Peptide2 –Erroneous22 Keyword –Transcrib, transcript, synthesi, express, …

67 Localization Theme patterns observed (730) –Protein608 –Lipid 31 –Atom 29 –Other_organic_compound 14 –DNA 12 –Virus 5 –Carbohydrate5 –RNA4 –Inorganic4 –Peptide3 Keywords –Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, … ClueLoc –NONE241 –nuclear140 –to the nucleus 12 –into the nucleus11 –Cytoplasmic 8 –in the cytoplasm 7 –macrophages 5 –nuclear … in t lymphocytes4 –monocytes4 –in the nucleus 4 –in the cytosol 4 –in colostrum 4 –from the cytoplasm to the nucleus 4

68 Localization Keywords and Locations –translocation (166) nuclear108 NONE 38 … –secretion (100) NONE 57 name_of_cells 43 –release (80) NONE 51 name_of_cells 19 … –localization (30) nuclear25 intracellular3 –uptake (24) NONE 14 name_of_cells 20 Keywords and Themes –translocation (166) Protein161 Virus 4 RNA 1 –secretion (100) Protein 98 Lipid 1 Peptide 1 –release (80) Protein 67 Other_organic_compoun 6 Lipid 3 –localization (30) Protein30 –uptake (24) Lipid15 Carbohydrate 5 Protein 4

69 69 Future Plan Kitano’s group, Kell’s group

70 70

71 71

72 72 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation

73 73 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation


Download ppt "TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,"

Similar presentations


Ads by Google