Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.
Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
1 Final Projects  Please make an appointment to come talk to me (or office hours)  What additional things should you add to your project?  Are you on.
Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Unambiguous + Unlimited = Unsupervised Marti Hearst School of Information, UC Berkeley Invited Talk, University of Toronto January 31, 2006 This research.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley UCB Neyman.
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Unambiguous + Unlimited = Unsupervised or Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley This research.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
AUTOMATIC ANNOTATION OF GEO-INFORMATION IN PANORAMIC STREET VIEW BY IMAGE RETRIEVAL Ming Chen, Yueting Zhuang, Fei Wu College of Computer Science, Zhejiang.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Natural Language Processing
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Overview Project Goals –Represent a sentence in a parse tree –Use parses in tree to search another tree containing ontology of project management deliverables.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
LOGO Dinh Cong Triet. TEST DESIGNING & TEST SPECIFICATION BUILDING Summer training course- August 2011.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Using Artificial Intelligence to Support Peer Review of Writing Diane Litman Department of Computer Science, Intelligent Systems Program, & Learning Research.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Discriminative Dialog Analysis Using a Massive Collection of BBS comments Eiji ARAMAKI (University of Tokyo) Takeshi ABEKAWA (University of Tokyo) Yohei.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Emotion Recognition from Text Using Situational Information and a Personalized Emotion Model Yong-soo Seol 1, Han-woo Kim 1, and Dong-joo Kim 2 1 Department.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Linguistic Graph Similarity for News Sentence Searching
Nouns Nouns Verbs Verbs Verbs Verbs Plurals Plurals Categories Side Tabs for Interactive Language Notebooks: Page 1 Pronouns Pronouns Nouns Nouns.
A tool for automated extraction of multi-word expressions
Supporting Annotation Layers for Natural Language Processing
Supporting Annotation Layers for Natural Language Processing
Supporting Annotation Layers for Natural Language Processing
Noun Compounds Interpretation简单调研
Supported by NSF DBI and a gift from Genentech
Category-Based Pseudowords
Supported by NSF DBI and a gift from Genentech
Statistical n-gram David ling.
CS246: Information Retrieval
Biomedical Language Processing: What's Beyond PubMed?
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Overview Motivation: NLP processing requires re-use of results for additional processing: pipeline for end applications: data mining, IR, etc. Proposed solution: Layers of annotations over text Layered Query Language (LQL) Illustration: Application to noun compound bracketing

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Noun Compound Bracketing (a)[ [ liver cell ] antibody ] (left bracketing) (b)[ liver [cell line] ] (right bracketing) In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

Dependency vs. Adjacency dependency modeladjacency model rightleft

Related Work Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model:Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model:Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Nakov & Hearst (2005): will be presented at coNLL! use the Web n-grams paraphrases surface features or #, MI,  2 Pr that w 1 precedes w 2 dependency modeladjacency model

Nakov & Hearst (2005) Web page hits: proxy for n-gram frequencies Sample surface features amino-acid sequence  left brain stem’s cell  left brain’s stem cell  right Majority vote to combine the different models Accuracy 89.34% (on the Lauer’s set: baseline 66.70%, previous best result: 80.70% ) state of the art

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Web Counts: Problems Page hits are inaccurate maybe not that bad (Keller&Lapata,2003) The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Solution: MEDLINE+LQL MEDLINE: ~13M abstracts We annotated: ~1.4M abstracts ~10M sentences ~320M annotations Layered Query Language: demo at ACL!

The System Built on top of an RDBMS system Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML Specialized query language LQL (Layered Query Language)

Annotated Example LocusLink MeSH

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

Noun Compound Extraction (1) FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text layers’ beginnings should match layers’ endings should match By default: nothing can go in between

Noun Compound Extraction (2) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC BUT! Does not allow adjectives determiners etc.

Noun Compound Extraction (3) SELECT LOWER(comp.text) AS lc, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_name="noun"] ( [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] ) $ ) $ ] AS compound SELECT compound.text END_LQL ) AS comp GROUP BY lc ORDER BY freq DESC layer negation artificial range For details: paper online demo demo at ACL

Finding Bigram Counts SELECT COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’shallow_parse’ && tag_name=’NP’ [layer=’pos’ && tag_name=“noun” && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_name=“noun” && (content="virus"||content="viruses")] ] SELECT word1.content END_LQL ) AS word ORDER BY freq DESC Inflections: UMLS Specialist lexicon just count

Paraphrases Types of noun compounds (Warren,1978): Paraphrasable Prepositional  immunodeficiency virus in humans  right Verbal  virus causing human immunodeficiency  left  immunodeficiency virus found in humans  right Copula  immunodeficiency virus that is human  right Other

Prepositional Paraphrases SELECT LOWER(prp.content) lp, COUNT(*) AS freq FROM ( BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_name="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_name="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_name=’IN’] AS prep ?[layer=’pos’ && tag_name=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_name="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQL ) AS prp GROUP BY lp, ORDER BY freq DESC optional layer

Plan Overview Noun compound (NC) bracketing Problems with Web Counts Layers of annotation Applying LQL to NC bracketing Evaluation

obtained 418,678 noun compounds (NCs) annotated the top 232 NCs agreement 88% kappa.606 baseline (left): 83.19% n-grams: Pr, #, χ 2 prepositional paraphrases

Results correct N/Awrong

Discussion Semantics of bone marrow cells top verbal paraphrases cells derived from (the) bone marrow (22 instances) cells isolated from (the) bone marrow (14 instances) top prepositional paraphrases cells in (the) bone marrow (456 instances) cells from (the) bone marrow (108 instances) Finding hard examples for NC bracketing w 1 w 2 w 3 such that both w 1 w 2 and w 2 w 3 are MeSH terms Web cannot do it!

The End Thank you! Layered Query Language: demo at ACL!