Download presentation
Presentation is loading. Please wait.
Published byKatrina Stewart Modified over 9 years ago
1
TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester, UK Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN
2
2 Increments : accumulation Increase in Medline 2002200019981992199419961990 19881980 19821984198619781970197219741976196819661964 0 100,000 200,000 300,000 400,000 500,000 600,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation G-protein coupled receptor Before 1988 9 papers 1992 256 papers 2005 14,000 papers MEDLINE alone More than 0.5 million per year More than 1.3 thousand per day Articles added Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month [D.L.Banville 2006] 500 times more
3
3 NaCTeM www.nactem.ac.uk www.nactem.ac.uk First such centre in the world Funding: JISC, BBSRC, EPSRC Consortium investment Chair in TM (Prof. J. Tsujii, Univ. Tokyo) Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trustwww.mib.ac.uk Initial focus: biomedical academic community Extend services to industry Extend focus to other domains (social sciences)
4
4 Consortium Universities of Manchester, Liverpool Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) Self-funded partners –San Diego Supercomputing Center –University of California, Berkeley –University of Geneva –University of Tokyo Strong industrial & academic support – IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …
5
5
6
6
7
7
8
8
9
9
10
10
11
11 NLP and TM Text Mining Text as a bag of words Words as surface strings Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects NLP-based TM Linking text with knowledge
12
12 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language Terminology Parsing Paraphrasing From surface diversities and ambiguities to conceptual invariants
13
13 Example
14
14 Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra
15
15 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3
16
16 述語 / 項構造 確率HPSG解析器 (Enju) の出力 The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod Semantic Retrieval System Using Deep Syntax MEDIE Passive Passive and Infinitival Clause
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26 Demos MEDIE Info-PubMed
27
27 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) S p53 has been shown to directly activate the Bcl-2 protein NP VP ADVP S VP NP arg1 arg2 arg3
28
28
29
29
30
30
31
31 Penn TreebankGENIA Coverage 99.7%99.2% F-Value (PArelations) 87.4%86.4% Sentence Precison 39.2%31.8% Processing Time 0.68sec1.00sec Performance of Semantic Parser
32
32 Scalability of TM Tools The number of papers14,792,890 The number of abstracts7,434,879 The number of sentences70,815,480 The number of words1,418,949,650 Compressed data size3.2GB Uncompressed data size10GB Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence …. 70 million seconds, that is, about 2 years
33
33 TM and GRID Solution –The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs –Parallel processing was managed by grid platform GXP [Taura2004] Experiments –The entire MEDLINE was parsed in 8 days Output –Syntactic parse trees and predicate argument structures in XML format –The data sizes of compressed/uncompressed output were 42.5GB/260GB.
34
34 Efficient Parsing for HPSG
35
35 Background: HPSG Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] –Lexicalized and Constraints-based Grammar –A few Rule Schema General constraints on linguistic constructions –Constraints embedded in Lexicon Word-Specific Constraints –Constraints between phrase structures and semantic structures
36
36 Ilikeit Parsing by HPSG
37
37 HEAD noun SUBJ COMPS I it HEAD verb SUBJ COMPS like Parsing by HPSG Assignment of Lexical Entries
38
38 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS 1 <> Head-Complement Application of Rule Schema
39
39 HEAD noun SUBJ COMPS I HEAD verb SUBJ COMPS likeit 1 <> 2 <> 2 HEAD verb SUBJ COMPS 1 <> 1 Subject-Head Application of Rule Schema
40
40 Inefficiency of HPSG Parsing Complex DAG : Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering ) Assignment of Lexical Entries –High reduction of search space / Super tagging
41
41 Filtering with CFG (1/5) 2-phased parsing –Approximate HPSG with CFG with keeping important constraints. –Obtained CFG might over-generate, but can be used in filtering. –Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Compile HPSG CFG Feature Structures Input Sentences Built-in CFG Parser LiLFeS Unification Parsing + Output Complete parse trees
42
42 Inefficiency of HPSG Parsing Complex DAG : Typed-feature structures –Abstract machine for Unification (LiLFeS) Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering ) Assignment of Lexical Entries –High reduction of search space / Super tagging
43
43 HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS <> HEAD noun SUBJ COMPS HEAD verb SUBJ <> COMPS <> HPSG and Parsing Most of Constraints are in LEs Assignment of LEs = Parsing results are implicitly determined like HEAD verb SUBJ COMPS Correct LE assignment Constructing parse trees is straightforward However Errors in LE assignments Irrecoverable in later stages
44
44 Supertagging and Efficient Parsing [Clark and Curran, 2004; Ninomiya et al., 2006] Supertagging : P(Seq-Of-LEs| SeQ-Of-Words ) Selection of lexical entry assignments [Bangalore and Joshi, 1999] I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS High Probability Threshold
45
45 Chart parsing I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS
46
46 Efficient Parser Smaller Number of LE assignments LE assignments that lead to complete parse trees Previous methods : 1. Chart parsing by using initial LE assignment 2. Extend LE assignment when parsing fails Assignments filtered by CFG: assignments with parse trees Claim 1 Deterministic Parsing with Classifiers is good enough, if parse trees exist Claim 2
47
47 System Overview I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS P High Supertagger I like it Input sentence CFG Filtering I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS I like it HEAD noun SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS... Deterministic Shift/Reduce Parser Ilikeit
48
Experiment Results LP(%)LR(%)F1(%) Avg. time Staged/Deterministic model 86.9386.4786.70 30ms/snt Previous method 1 ( Supertagger+ChartParser ) 87.3586.2986.81183ms/snt Previous method 2 ( Unigram + ChartParser ) 84.9684.2584.60674ms/snt 6 times faster 20 times faster than the initial model
49
49 Domain/Text Type Adaptation
50
50 F-scoreTraining Time ( Sec ) Baseline ( PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining ( GENIA ) 88.4514,695 Retraining ( PTB+GENIA) ) 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637
51
51 Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Original model Feature function Feature weight
52
52 83 84 85 86 87 88 89 90 02000400060008000 Number of Sentence of the GENIA Training Set F - score Baseline (PTB) Simple Retraining ( GENIA) Retraining (GENIA+PTB) Structure with Ref.Dist Lexical with RefDist Lexical/Structure woth RefDist
53
53 83 84 85 86 87 88 89 90 0100002000030000 Training Time ( Sec ) F - score Retrinaing (GENIA) Structure with RefDist Lexicon woth RefDist Lex/Str with RefDist
54
54 F-scoreTraining Time ( Sec ) Baseline ( PTB-trained, PTB-applied) 89.810 Baseline (PTB-trained, GENIA-applied ) 86.390 Retraining ( GENIA ) 88.4514,695 Retraining ( PTB+GENIA) ) 89.94238,576 Structure with RefDist88.1821,833 Lexical with RefDist89.0412,957 Lex/Structure with RefDist90.1531,637
55
55 Tool1: POS Tagger General-Purpose POS taggers, trained by WSJ –Brill’s tagger, TnT tagger, MX POST, etc. –97% General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
56
56 Errors seen in TnT tagger (Brants 2000) A chromosomal translocation in … DT JJ NN IN … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
57
57 Performance of GENIA Tagger Training corpus WSJGENIA WSJ97.084.3 GENIA75.298.1 WSJ+GENIA96.998.1 Training corpus WSJGENI A WSJ96.784.3 GENIA80.197.9 WSJ+GENI A 96.597.5 GENIA tagger (Ref.) TnT tagger No degradation of the tagger trained by the mixed corpus Some degradations (0.2 ~ 0.4) were observed, compared with the taggers trained by “pure” corpora
58
58 CRF-based POS + Active Learning GENIA 3,000 sentences : 98.4 20,000 sentences: 98.58
59
59 10,000 sentences: 96.76 Best Performance: 97.18 CRF-based POS + Active Learning PTB
60
60 Applications
61
61 Our Policy for Information Extraction Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System
62
62 Our Policy Separate a domain/task-independent part from a domain/task-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure
63
63 Our Policy Distinguish a domain-independent part from a domain-specific part. Task-independent Task-specific IE System a full parser: normalizes sentences into PASs extraction rules on PASs PAS = Predicate-Argument Structure Learned automatically from a corpus
64
GENIA Event Annotation - example LinkCause –For an identified event in the given sentence, classify the type of events and record the text span giving the clue of it (ClueType). identify the theme of the events and record the text span linking the theme to the event (LinkTheme). identify the cause of the events and record the text span linking the cause to the event (LinkCause). record the environment (location, time) of the events (ClueLoc, ClueTime). LinkTheme ClueLoc ClueType
65
Gene_expression Theme patterns observed (2,958) –Protein2,308 –DNA 591 –RNA 25 –Peptide 4 –Protein Protein 2 –Erroneous 27 Keywords –coexpress, nonexpress, overexpress, express, biosynthesis, product, synthesize, constitute, … coexpression
66
Transcription Theme patterns observed (929) –DNA 449 –RNA 272 –Protein167 –Peptide2 –Erroneous22 Keyword –Transcrib, transcript, synthesi, express, …
67
Localization Theme patterns observed (730) –Protein608 –Lipid 31 –Atom 29 –Other_organic_compound 14 –DNA 12 –Virus 5 –Carbohydrate5 –RNA4 –Inorganic4 –Peptide3 Keywords –Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, … ClueLoc –NONE241 –nuclear140 –to the nucleus 12 –into the nucleus11 –Cytoplasmic 8 –in the cytoplasm 7 –macrophages 5 –nuclear … in t lymphocytes4 –monocytes4 –in the nucleus 4 –in the cytosol 4 –in colostrum 4 –from the cytoplasm to the nucleus 4
68
Localization Keywords and Locations –translocation (166) nuclear108 NONE 38 … –secretion (100) NONE 57 name_of_cells 43 –release (80) NONE 51 name_of_cells 19 … –localization (30) nuclear25 intracellular3 –uptake (24) NONE 14 name_of_cells 20 Keywords and Themes –translocation (166) Protein161 Virus 4 RNA 1 –secretion (100) Protein 98 Lipid 1 Peptide 1 –release (80) Protein 67 Other_organic_compoun 6 Lipid 3 –localization (30) Protein30 –uptake (24) Lipid15 Carbohydrate 5 Protein 4
69
69 Future Plan Kitano’s group, Kell’s group
70
70
71
71
72
72 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation
73
73 Future Directions Domain Adaptation + Inter-operability –High performance can be obtained by using domain specific characteristics and domain semantics –Differences among abstracts, full papers, comments in DBs –Standardized Interfaces (API) of NLP tools Text Archives –Abstracts + Full Papers + Comments/Summary Descriptions in DBs Combining NLP tools with Mining tools –Knowledge Discovery (Disease Gene Association) –Hypotheses Generation –Automatic Data Interpretation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.