Exploiting domain and task regularities for robust named entity recognition Ph.D. thesis defense Andrew O. Arnold Machine Learning Department Carnegie.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Exploiting domain and task regularities for robust named entity recognition Ph.D. thesis proposal Andrew O. Arnold Machine Learning Department Carnegie.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
A Comparative Study of Methods for Transductive Transfer Learning Andrew Arnold, Ramesh Nallapati, William W. Cohen Machine Learning Department Carnegie.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Intra-Document Structural Frequency Features for Semi-Supervised Domain Adaptation Andrew O. Arnold and William W. Cohen Machine Learning Department Carnegie.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
by B. Zadrozny and C. Elkan
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Querying Structured Text in an XML Database By Xuemei Luo.
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Experimental Evaluation of Learning Algorithms Part 1.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.
How to read a scientific paper
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Andrew Arnold, Ramesh Nallapati, William W. Cohen
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Introduction Task: extracting relational facts from text
Word embeddings (continued)
Hierarchical, Perceptron-like Learning for OBIE
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Presentation transcript:

Exploiting domain and task regularities for robust named entity recognition Ph.D. thesis defense Andrew O. Arnold Machine Learning Department Carnegie Mellon University July 10, 2009 Thesis committee: William W. Cohen (CMU), Chair Tom M. Mitchell (CMU) Noah A. Smith (CMU) ChengXiang Zhai (UIUC) 1

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 2

Thesis We attempt to discover regularities and relationships among various aspects of data, and exploit these to help create classifiers that are more robust across the data as a whole (both source and target). 3

Domain: Biological publications 4

Problem: Named entity recognition (NER) Protein-name extraction 5

6 Overview What we are able to do: – Train on large, labeled data sets drawn from same distribution as testing data What we would like to be able do: – Make learned classifiers more robust to shifts in domain and task Domain: Distribution from which data is drawn: e.g. abstracts, s, etc Task: Goal of learning problem; prediction type: e.g. proteins, people How we plan to do it: – Leverage data (both labeled and unlabeled) from related domains and tasks – Target: Domain/task we’re ultimately interested in » data scarce and labels are expensive, if available at all – Source: Related domains/tasks » lots of labeled data available – Exploit stable regularities and complex relationships between different aspects of that data

What we are able to do: The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35) Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1) Training data: Test: Train: Test: Supervised, non-transfer learning – Train on large, labeled data sets drawn from same distribution as testing data – Well studied problem 7

Transfer learning (domain adaptation): – Leverage large, previously labeled data from a related domain Related domain we’ll be training on (with lots of data): Source Domain we’re interested in and will be tested on (data scarce): Target – [Ng ’06, Daumé ’06, Jiang ’06, Blitzer ’06, Ben-David ’07, Thrun ’96] The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35) Neuronal cyclin-dependent kinase p35/cdk5 (Fig 1, a) comprises a catalytic subunit (cdk5, left panel) and an activator subunit (p35, fmi #4) Train (source domain: ):Test (target domain: IM): Train (source domain: Abstract): Test (target domain: Caption): What we would like to be able to do: 8

The neuronal cyclin-dependent kinase p35/cdk5 comprises a catalytic subunit (cdk5) and an activator subunit (p35) Reversible histone acetylation changes the chromatin structure and can modulate gene transcription. Mammalian histone deacetylase 1 (HDAC1) What we’d like to be able to do: Transfer learning (multi-task): Same domain, but slightly different task Related task we’ll be training on (with lots of data): Source Task we’re interested in and will be tested on (data scarce): Target –[Ando ’05, Sutton ’05] Train (source task: Names):Test (target task: Pronouns): Train (source task: Proteins): Test (target task: Action Verbs): 9

10 How we’ll do it: Relationships

How we’ll do it: Related tasks full protein name abbreviated protein name parenthetical abbreviated protein name Image pointers (non-protein parenthetical) 11 genes units

12 Motivation Why is robustness important? – Often we violate non-transfer assumption without realizing. How much data is truly identically distributed (the i.d. from i.i.d.)? E.g. Different authors, annotators, time periods, sources Why are we ready to tackle this problem now? – Large amounts of labeled data & trained classifiers already exist Can learning be made easier by leveraging related domains and tasks? Why waste data and computation? Why is structure important? – Need some way to relate different domains to one another, e.g.: Gene ontology relates genes and gene products Company directory relates people and businesses to one another

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 13

State-of-the-art features: Lexical 14

Feature Hierarchy Sample sentence: Give the book to Professor Caldwell Examples of the feature hierarchy:Hierarchical feature tree for ‘Caldwell’: 15 (Arnold, Nallapati and Cohen, ACL 2008)

Hierarchical prior model (HIER) Top level: z, hyperparameters, linking related features Mid level: w, feature weights per each domain Low level: x, y, training data:label pairs for each domain 16

17 Relationship: feature hierarchies

Results: Baselines vs. HIER – Points below Y=X indicate HIER outperforming baselines HIER dominates non-transfer methods (GUASS, CAT) Closer to non-hierarchical transfer (CHELBA), but still outperforms 18

Conclusions Hierarchical feature priors successfully – exploit structure of many different natural language feature spaces – while allowing flexibility (via smoothing) to transfer across various distinct, but related domains, genres and tasks New Problem: – Exploit structure not only in features space, but also in data space E.g.: Transfer from abstracts to captions of papers From Headers to Bodies of s 19

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 20

Transfer across document structure: Abstract: summarizing, at a high level, the main points of the paper such as the problem, contribution, and results. Caption: summarizing the figure it is attached to. Especially important in biological papers (~ 125 words long on average). Full text: the main text of a paper, that is, everything else besides the abstract and captions. 21

Sample biology paper full protein name (red), abbreviated protein name (green) parenthetical abbreviated protein name (blue) non-protein parentheticals (brown) 22

Structural frequency features Insight: certain words occur more or less often in different parts of document – E.g. Abstract: “Here we”, “this work” Caption: “Figure 1.”, “dyed with” Can we characterize these differences? – Use them as features for extraction? 23 (Arnold and Cohen, CIKM 2008)

YES! Characterizable difference between distribution of protein and non-protein words across sections of the document 24

25 Relationship: intra-document structure

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 26

Snippets Tokens or short phrases taken from one of the unlabeled sections of the document and added to the training data, having been automatically positively or negatively labeled by some high confidence method. – Positive snippets: Match tokens from unlabelled section with labeled tokens Leverage overlap across domains Relies on one-sense-per-discourse assumption Makes target distribution “look” more like source distribution – Negative snippets: High confidence negative examples Gleaned from dictionaries, stop lists, other extractors Helps “reshape” target distribution away from source 27 (Arnold and Cohen, CIKM 2008)

28 Relationship: high-confidence predictions

Performance: abstract  abstract Precision versus recall of extractors trained on full papers and evaluated on abstracts using models containing: – only structural frequency features (FREQ) – only lexical features (LEX) – both sets of features (LEX+FREQ). 29

Performance: abstract  abstract Ablation study results for extractors trained on full papers and evaluated on abstracts – POS/NEG = positive/negative snippets 30

Performance: abstract  captions How to evaluate? – No caption labels – Need user preference study: Users preferred full (POS+NEG+FREQ) model’s extracted proteins over baseline (LEX) model (p =.00036, n = 182) 31

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 32

Graph relations Represent data and features as a graph: – Nodes represent: Entities we are interested in: – E.g. words, papers, authors, years – Edges represent: Properties of and relationships between entities – E.g. isProtein, writtenBy, yearPublished Use graph-learning methods to answer queries – Redundant/orthogonal graph relations provide robust structure across domains 33 (Arnold and Cohen, ICWSM, SNAS 2009)

Relations – Mentions: paper  protein – Authorship: author  paper – Citation: paper  paper – Interaction: gene  gene 34

Relation: Mention (paper  protein) 35

Relation: Authorship (author  paper) 36

Relation: Citation (paper  paper) 37

Relation: Interaction (gene  gene) 38

All together: curated citation network 39

Data – PubMed Free, on-line archive Over 18 million biological abstracts published since 1948 – Including author list – PubMed Central (PMC) Subset of above papers for which full-text is available – Over one million papers » Including: abstracts, full text and bibliographies – The Saccharomyces Genome Database (SGD) Curated database of facts about yeast – Over 40,000 papers manually tagged with associated genes – The Gene Ontology (GO): Ontology describing properties of and relationships between various biological entities across numerous organisms 40

Nodes The nodes of our network represent the entities we are interested in: 44,012 Papers contained in SGD for which PMC bibliographic data is available. 66,977 Authors of those papers parsed from the PMC citation data. Each author’s position in the paper’s citation (i.e. first author, last author, etc.) is also recorded. 5,816 Genes of yeast, mentioned in those papers 41

Edges We likewise use the edges of our network to represent the relationships between and among the nodes, or entities. Authorship: 178,233 bi-directional edges linking author nodes and the nodes of the papers they authored. Mention: 160,621 bi-directional edges linking paper nodes and the genes they discuss. Cites: 42,958 uni-directional edges linking nodes of citing papers to the nodes of the papers they cite. Cited: 42,958 uni-directional edges linking nodes of cited papers to the nodes of the papers that cite them RelatesTo: 1,604 uni-directional edges linking gene nodes to the nodes of other genes appearing in their GO description. RelatedTo: 1,604 uni-directional edges linking gene nodes to the nodes of other genes in whose GO description they appear. 42

Graph summary 43

Problem Predict which genes and proteins a biologist is likely to write about in the future Using: – Information contained in publication networks – Information concerning an individual’s own publications – No textual information (for now) Model as link prediction problem: – From: protein extraction P(token is a protein | token properties) – To: link prediction P(edge between gene and author| relation network) 44

More generally: Link prediction Given: historical curated citation network And: author/paper/gene query Predict: distribution over related authors/papers/genes Interpretation: related entities more likely to be written about 45

Method How to find related entities in graph? – Walk! Intuitively: – Start at query nodes – For search depth: » Simultaneously follow each out-edge with probability inversely proportional to current node’s total out-degree – At end of walk, return list of nodes, sorted by fraction of paths that end up on that node Practically: – Power method: » Exponentiate adjacency matrix 46

Example 47

Questions Which, if any, relations are most useful? – We can choose to ignore certain edge/node types in our walk – Compare predictive performance of various networks ablated in different ways – Protocol: Given: set of authors who published in And: Pre-2008 curated citation network Predict: New genes authors will write about in 2008 How much can a little information help? – We can selectively add one known 2008 gene to the query and see how much it helps 48

Ablated networks: Baselines 49

Ablated networks: Social 50

Ablated networks: Social + Biological 51

Results: – First author most predictive Except: CITATIONS_CITED – Pure bio models do poorly On par with worst baseline Full model benefits from removing bio relations 52

Results: – Social models beat baselines Simple collaborative filtering RELATED_PAPERS does best* – FULL model benefits slightly* from removing coauthorship relations – Adding 1_GENE helps a lot 50% better than FULL Significantly* better than second best RELATED_PAPERS 53

On-line demo Web application implementing our algorithm – Includes links to data files 54

55

56

Graph-based priors for named entity extraction Social network features are informative but… – Do they replace lexical features (i.e. redundant)? – Or supplement them (i.e. orthogonal)? Combine network relations with lexical features – More robust model built from different views of data 57

Method How to combine these two representations: – Graph-based social features – Vector-based lexical features Use ranked predictions made by curated citation network as features in standard lexical classifier: – Query each test paper’s authors against citation graph Return ranked list of genes Add top-k genes to a dictionary of ‘related genes’ – For each token in the paper Test membership of token in ‘related genes’ dictionary Use membership as feature in classifier, along with lexical features 58

Experiment Evaluate contribution of curated citation network features to standard lexical-feature-based CRF extractor – Split data train/test 298 GENIA abstracts with PMC, SGD and GO data (method requires labeled abstracts and citation information) – Compare performances of two trained models: CRF_LEX: standard CRF model trained on lexical features CRF_LEX+GRAPH_SUPERVISED: standard CRF model trained on lexical features, augmented with curated citation network based features (i.e., membership in dictionaries of ‘related genes’) 59

Results Adding graph-relations to lexical model improves performance – CRF_LEX+GRAPH_SUPERVISED > CRF_LEX 60

Graph relations conclusions & future work Graphs offer natural way to integrate disparate sources of information Network-based social features alone can do well – Significant predictive information in these relations – Improved by combining with lexical features Social knowledge often trumps biological input – What other types of relations can we try? Sub-structure of a document (abstract, captions, etc)? First authors are most predictive – What other social hypotheses can we test? Impact factor? We can leverage high confidence predictions to improve overall performance – What other types of leverage can we find: Collective classification? Transfer across domains, document sections, organisms? 61

Outline Overview – Thesis, problem definition, goals and motivation Contributions: – Feature hierarchies – Structural frequency features – Snippets – Graph relations Conclusions & future work 62

Conclusions Robust learning across domains is possible but requires some other type of information joining the two domains: Linguistically-inspired feature hierarchies: – Exploit hierarchical relationship between lexical features – Allow for natural smoothing and sharing of information across features Structural frequency features take advantage of the information in: – Structure of data itself – Distribution of instances across that structure (e.g. words across the sections of an article) Snippets leverage the relationship of entities: – Among themselves – Across tasks and labels within a dataset. Graph relations allow us to tie together: – Disparate entities and sources of data Lets brittleness in one task be supported by robustness from another 63

Future work Hierarchical feature models – Other models besides linguistic hierarchy Correlation between features Snippets – Other ways of grouping besides exact match Back-off tree based matching Edit distance – Other positive and negative heuristics Better task-specific classifiers Better domain-specific taxonomies and grammars Graphical relations: – Incorporate temporal information – Represent structure of document text itself as graph – Full transfer experiment 64

65 ☺ ¡Thank you! ☺ No, really, thank you all… ¿ Questions ? For details and references please see thesis document: