Download presentation
Presentation is loading. Please wait.
Published byCharlotte Porter Modified over 9 years ago
1
Information Extraction from the Cancer Literature The Pediatric Hematology/Oncology Seminar Series Children’s Hospital of Philadelphia March 8, 2005 Philadelphia, PA
2
A Global Challenge Cell Clinic DNA sequence Genomic variation Microarrays RNAi Protein interactions Patient records Test results Clinical reports Procedures Phone calls MDS1 Leukemia DNA sequence Genomic variation Microarrays RNAi Protein interactions Patient records Test results Clinical reports Procedures Phone calls MDS1 Leukemia Text Phenotype Natural language understanding
3
Solution 2: Read everything Leukemia: 181,394 articles 20/day=25 years 385,034 new articles by then Biomedical text: 15 million articles 1.5 billion words Too Much Text Solution 3: Impose structure on the descriptions Solution 1: Approximate What you can find What finds you ?
4
Phase 1: Domain selection and definition Phase 2: Manual annotation Phase 3: Create and train machine-learning algorithms Phase 4: “Active Annotation” Phase 5: Utilization of annotations IE Process
5
Biological Domains Genomic variations in malignancy Neuroblastoma Entity Classes Genes (genes, transcripts, proteins) Genomic variations (type, location, state) Malignant type Malignancy attributes –Developmental state –Clinical stage –Histology –Malignancy site –Differentiation status –Heredity status Domain
6
Document Sets MEDLINE: Abstracts --> Full Text Annotation training set: 4,000 MEDLINE abstracts –Genes commonly mutated in various malignancies –Genes implicated in neuroblastoma Abstracts are manually annotated (dual pass) Results are used to train automated taggers
7
Workflow Management
8
leukemiacauseoftenMDS1genealterations Extraction Process
9
MDS1genealterationsleukemiacauseoftenleukemiacauseoftenMDS1genealterations Parsing Separate
10
MDS1genealterationsleukemiacauseoftenSeparate Part-of-speech Tagging MDS1 Noun gene Noun leukemia Noun cause Verb often Adverb Plural noun alterations Grammar
11
Part-of-speech Tagging
12
MDS1genealterationsleukemiacauseoftenSeparate MDS1 Noun gene Noun leukemia Noun cause Verb often AdverbPlural noun alterationsGrammar
13
leukemia NounPlural noun alterationsMDS1 Noun gene Noun cause Verb often Adverb Grammar Label Named Entity Recognition MDS1 Gene gene Process alterations leukemia Disease
14
Definitions: Process Initial Definitions: Domain Experts –Analyze representative subset of text mentions –Input of specific knowledge Manual Annotation –Tag text with initial definitions –Iterative re-definition process –More text: Tighter and more robust definitions Widen Domain Expertise Publication and Utilization
15
Definitions Gene Entities Genes Other Transcripts Proteins Genes Individual Gene Gene Superfamily Gene Family
16
Definitions Gene The Gene-Entity category includes genes as well as their downstream products such as transcripts and proteins, in addition to the more general groups of gene and protein families, super-families, and so forth. Note that the category name 'Gene-Entity’ is not a completely accurate description of the members of this class since the category includes things other than genes. However, most things in this class are genes, and everything is either a gene or gene derived (transcripts and proteins). The diagram that follows attempts to illustrate this point and provides some examples. What is and What is Not Included? There are two ways to think about genes. 1. Genes as conceptual entities. (This is what we want to capture.) Genes refer to segments of the genome which have been identified with a specific function or product (for example, the gene for eye color in a fly or a membrane receptor in humans). Although they are "things", they really represent abstract concepts. We can talk about the gene "K-Ras", but we are really referring to an abstract concept – an "ideal form" of the K-Ras gene, which has known attributes. We can’t point to K-Ras; we can only point to instances of K-Ras. Each of these instances (a specific manifestation of the gene as described in #2 below) has the attributes and characteristics of the abstract concept of K-Ras but the different instances of K-Ras may vary slightly between them. (This parallels the concept of "species". We all have an intuitive grasp of the species concept, and can differentiate most species apart: a grizzly bear from a polar bear. However, when we visit the zoo we encounter instances of a species -- individual bears -- and not the concept itself.) Although this may seem pedantic, there is an important reason for making this distinction which we’ll describe below. Let’s consider some examples based upon this logic: a. For genes: c-kit, CD117, and alpha-smooth muscle actin b. A non-biology example: a 2003 Ferrari Modena. This is an abstract concept for a specific type of car. However, you can’t point to an abstract 2003 Ferrari Modena, you can only point to specific instances which may vary, even if slightly, between one another. c. K-Ras as investigated in Bob. This can be a tricky example since it would appear as though we are talking about a specific instance of K-Ras. But remember, in nearly all cases, genes are paired in humans (sometimes there are even more
17
Definitions Confounding Issues: Levels of specificity –Protein/enzyme/kinase/tyrosine kinase/NTRK1 –TRK antibody –Colon cancer vs. cancer of the colon Boundary issues –Retinoblastoma –Head and neck cancer –MEN type 2B syndrome
18
Entity Annotation
19
MDS1 Noun Labelleukemia NounPlural noun alterationsgene Noun cause Verb often Adverb Named Entity Recognition MDS1 Gene gene Process alterations leukemia Disease
20
geneleukemiacauseoftenalterationsMDS1 Label DiseaseGeneProcess Syntactic Analysis Syntax Noun phrase Adverb phrase Verb phrase Noun phrase leukemiacauseoftenalterations leukemiacauseoften leukemiacauseleukemia
21
Treebanking
22
Syntactic Analysis Syntax geneleukemiacauseoftenalterationsMDS1 DiseaseGeneProcess Noun phrase Adverb phrase Verb phrase Noun phrase leukemiacauseoftenalterations leukemiacauseoften leukemiacause leukemia
23
geneleukemiacauseoftenalterationsMDS1 Label DiseaseGeneProcess Syntax Noun phrase Adverb phrase Verb phrase Noun phrase leukemiacauseoftenalterations leukemiacauseoften leukemiacause leukemia Result: leukemia Relation Tagging Event: alterations Action:cause Frequency:often Relationships Object:MDS1gene
24
Relation Tagging
25
Annotation Viewer
26
Annotations AnnotationStartAnnotated TaskDateDocumentsWords Pre-tagging11/3/0338341,456,000 Entity tagging9/24/0338291,455,000 POS tagging8/27/032332886,160 Treebanking2/26/042300874,000 Relation tagging10/31/04618234,000
27
Automated Algorithms Pretagger –Assigns token, sentence, paragraph, section boundaries –Nearly 100% accuracy –Pipeline implementation: Finished Bio Part-of-speech tagger –Assigns part-of-speech tags to tokens –Uses pretagging annotations –Accuracy of 97.3% –Pipeline implementation: Finished
28
Entity Taggers Entity Taggers: Automated, machine-learning algorithms for named entity recognition in text Goals –Highly accurate, precision > recall –Rapid deployment –Flexible design Technique –Conditional random fields –Text feature-based –Uses pretagging, POS annotations –Probabilistic maximization of feature weights –Corrects for overfitting
29
Entity Taggers GeneTaggerCRF –Tags gene symbols, names, and descriptions KDR, VEGFR-2, VEGF receptor-2 vascular endothelial growth factor receptor type 2 –86% precision/79% recall –Pipeline implementation: Imminent VTag –Simulataneously tags variation types, locations, states point mutation, loss of heterozygosity codon 12, 11q23, base pair 17, Ki-ras GGT, glycine, Asp –85% precision/79% recall –Pipeline implementation: Imminent
30
Entity Taggers Mtag –Tags malignant type labels acute myeloid leukemias (AMLs) translocation t( 9;11) - positive leukemia NB transitional cell carcinoma of the bladder Hypoplastic myelodysplastic syndrome predominantly cystic bilateral neuroblastomas –85% precision/82% recall –Pipeline implementation: Imminent
31
Entity Taggers
32
Relation Taggers: Identifying relationships between entities Given this text: Missense mutation at codon 45 (TCT to TTT) Can we automatically identify: 1. Pairwise associations [(codon 45 and TCT); (TCT and TTT); etc.] 2. The entire mutation event: VARIATION EVENT #60609 Variation type: missense mutation Variation location: codon 45 Variation state 1: TCT Variation state 2: TTT Relation Tagger
33
Goals: Accurate, rapid, flexible Technique –Maximum entropy –Feature-based probabilistic model –Events built upon binary associations –Uses pretagging, POS, and entity annotations Domain –Genomic variation events –Tested on 447 abstracts: 1218 relations, 4773 entities –38% of relations were non-binary –Baseline: Two entities within 5 words = related Relation Tagger
34
Results Binary Tagger: 77% precision/82% recall Baseline: 66% precision/77% recall Event-wide Tagger: 63% precision/77% recall Baseline: 43% precision/66% recall Example ”most common base change was a A ->G transition at codon 12 or 13” Manual annotation: (transition, codon 12, A, G) (transition, codon 13, A, G) Automated annotation: (transition, codon 12, A, G) (transition, codon 13, A, G) (base change, codon 12, A, G) (base change, codon 13, A, G) Relation Tagger
35
Data Management
36
POS tagging Document Annotation Pipeline Pretagging Entity tagging Relation tagging Treebanking DatabaseNormalizationIntegrationInterface Propbanking
37
Annotation Pipeline Carolyn Felix
38
Biomedical Annotation Database Annotation Retrieval
39
What is this all good for, anyway? Objective: To align the literature with genomic objects Goal: Can we replicate a manually curated list of genes implicated in a biological process? Domain: Angiogenesis Rationale:To focus on the subset of genes implicated in the process of angiogenesis from whole- genome expression profiling Applications: Entity Lists
40
The manual list Genes represented on the Affy U133 chips 340 genes, identified through: –Prior knowledge –Literature reviews –PubMed searches –Gene Ontology codes –Gene family-based inference Applications: Entity Lists
41
The automated list Twelve partially specific angiogenic terms Concordancy searching of MEDLINE: 41,276 abstracts Trained GeneTaggerCRF with ~100 hand-annotated angiogenesis abstracts Tagged the document set –104,118 mentions –22,662 non-redundant mentions
42
Applications: Entity Lists Normalization Human gene/alias/identifier list –Compiled identifiers from 19 public databases –302,976 entries –156,860 non-redundant entries –All entries mapped to 25,096 “official” gene symbols Aligned normalized gene and tagged gene lists –50.01% of entries matched a known gene term –2,389 identified genes
43
Applications: Entity Lists GeneDescription Frequency VEGFVascular endothelial growth factor 9688 NUDT6Antisense basic fibroblast growth factor 1887 FGF2Fibroblast growth factor 2 (basic) 1463 KDRKinase insert domain receptor 1287 TGFB1Transforming growth factor, beta 1 909 TNFTumor necrosis factor908 FLT1Fms-related tyrosine kinase 1 (VEGF/VPF receptor)880 MMP2Matrix metalloproteinase 2598 IL8Interleukin 8571 IL28BInterleukin 28B559 PECAM1Platelet/endothelial cell adhesion molecule558 ECGF1Endothelial cell growth factor 1545 EGFEpidermal growth factor524 TP53Tumor protein p53524 THBS1Thrombospondin 1501 PTGS2Prostaglandin-endoperoxide synthase 2427 FN1Fibronectin 1407 IL6Interleukin 6407
44
Accuracy: –247 (72.6%) of manual genes on the automated list –91 (26.8%) of manual genes had no literature support –2 (0.6%) of manual genes were missed for technical reasons –Overall, 99.2% recall Prediction: –Relevance ranked auto-tagged genes by number of mentions –Evaluated the top 40 NOT on the manual list –All 40 appear to be legitimate angiogenesis-related genes Gene Ontology (GO): 42 human genes associated with “angiogenesis” or related terms Applications: Entity Lists
45
GeneDescription Frequency NUDT6Antisense basic fibroblast growth factor 1887 TNFTumor necrosis factor908 IL28BInterleukin 28B559 EGFEpidermal growth factor524 TP53Tumor protein p53524 FN1Fibronectin 1407 IL6Interleukin 6407 CD34CD34 antigen384 EGFREpidermal growth factor receptor373 IL1BInterleukin 1, beta323 PCNAProliferating cell nuclear antigen277 SOS1Son of sevenless homolog 1243 FGF1Fibroblast growth factor 1 (acidic)239 TM7SF2Transmembrane 7 superfamily member 2230 GALGT24-GalNAc transferase229 PRAP1Proline-rich acidic protein 1219 BMP6Bone morphogenetic protein 6202 BCL2B-cell CLL/lymphoma 2201
46
Applications: Directed Retrieval Locus-specific Databases: Repositories of recorded mutation information –> 300 human genes –> 100 databases –Highly curated –Limited resources CDKN2A database: Somatic and germline p16 mutations –Over 1400 mutation instances –Primarily identified through manual literature perusal –Large and inefficient effort –< 20% of identified articles contain mutation instances
47
Applications: Directed Retrieval Experiment: Identify mutation instance-containing articles from “relevant” articles Literature search of PubMed using p16 key words: –418 articles (1/2000 to 6/2002) –78 articles contained mutation data (experts) Training –218 articles –Logistic regression classifier –Features: words and word pairs
48
Applications: Directed Retrieval Evaluation Experts –Identified 200 candidate articles –32 articles contained mutation information –16% precision; ~100%(?) recall; F-measure 0.28 Algorithm –Predicted that 88 of the 200 articles contained relevant info –29 of 32 with relevant info identified –44% precision; 91% recall; F-measure 0.59 –Second random trial: comparable results Relevance ranking: Associated with value –In progress: refinement of relevance with text annotations Conclusion: automation significantly reduces workload
49
The Global Challenge What is MYCN? What is MYCN related to? How? Genes Proteins Pathways Cells Tissues Phenotypes Traits Diseases Behaviors Environment
50
Genome Literature Integration Cell Disease MYCN Genomic position Genomic context Known alteration Cellular location Protein function Cell type Disease association Clinical observation Symptom Environmental factor
51
Resources BioIE group: http://bioie.ldc.upenn.edu/ Resources: http://bioie.ldc.upenn.edu/index.jsp?page=doc_resources.html Documentation: http://bioie.ldc.upenn.edu/index.jsp?page=doc_users.html Software/Tools: http://bioie.ldc.upenn.edu/index.jsp?page=doc_soft_tools.htm
52
Contributors University of Pennsylvania Avik Basu Ann Bies Christine Brisson Dan Caroff Hareesh Chandrupatla Melissa Demian Jacqueline Ewing Nadeene Francesco Hubert Jin Aravind Joshi Sanipa Koetswawasdi Seth Kulick Jeremy LaCivita Justin Lacasse Matt Leger Alexis Lerro Mark Liberman Mark Mandel Mark Manocchio Mitch Marcus Ryan McDonald Tom Morton Grace Mrowicki Sina Neshatian Ben Newman Michael Noda Martha Palmer Eric Pancoast Anita Patel Fernando Pereira Ariel Richmond Karen Rudo Andrew Schein Mike Schultz Jonathan Schwartz Amanda van Scoyoc Nilay Shah Sarah Stippich Sabrina Sumner Rachel Swetz Partha Talukdar Julie Wang Colin Warner Christopher Wright Johanna Wright Dalal Zakhary Ramez Zakhary University of Vermont Claire Anduka Mark Greenblatt Joan Murphy Amy Rodgers Sanger Institute Sally Bamford Elisabeth Dawson Jon Teague Richard Wooster CHOP Shannon Davis Jayanti Jagannathan Yang Jin Jessica Kim Jeremy Lautman Pete White Scott Winters Garrett Brodeur Mike Hogarty John Maris
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.