Download presentation
Presentation is loading. Please wait.
Published byJeffrey Gregory Modified over 9 years ago
1
Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005) KAIST, Daejeon, South Korea Nov, 2005 Alfonso Valencia Centro Nacional de Biotecnología - CSIC
2
Daejeon, 2005 Alfonso Valencia CNB-CSIC Proteomics Predicted networks literature Functional Genomics
3
Daejeon, 2005 Alfonso Valencia CNB-CSIC (mouse model) cdks are not essential, cdk1 can replace others SPECIFIC PROTEINS Cdks, cyclins, kinases
4
Daejeon, 2005 Alfonso Valencia CNB-CSIC Residues determinant of the dimerization of Chemokine receptors Hernanz-Falcon, et al., Nat Immunol. 2004 Bioinform. 2005 Two residues are able to control CCR5 chemokine receptor dimerization, both in vitro and in vivo blocking CCL5-induced responses in human cell lines and in primary T cells. Del Sol, Pazos, Valencia JMB 2003
5
Daejeon, 2005 Alfonso Valencia CNB-CSIC Buchnera aphidicola genome http://www.pdg.cnb.uam.es/fabascal/Buch_ORFand_www/)
6
Daejeon, 2005 Alfonso Valencia CNB-CSIC Query Protein Similar Proteins Standar Sequence Searches Protein groups rab (M. musculus) rab (C. elegans) rab (H. sapiens) ras (H. sapiens) ras (M. musculus) ras (C. elegans) ras2 (H. sapiens) Las dos subfamilias son parálogas entre sí. Homologous Proteins Similar Functions ? by F, Abascal
7
Daejeon, 2005 Alfonso Valencia CNB-CSIC Genequiz flowchart by The Genequiz consortium 1995-2002
8
Daejeon, 2005 Alfonso Valencia CNB-CSIC Annotation workflows (INB)
9
Daejeon, 2005 Alfonso Valencia CNB-CSIC Sequence Based function prediction Del Pozo, Valencia 2004 10 20 30 40 50 60 70 80 90 100 100 90 80 70 60 50 40 30 20 10 0 Identity class (%) % conservation 4th E.C. digit Valencia Curr. Op Struc Biol 05
10
Daejeon, 2005 Alfonso Valencia CNB-CSIC SLIDE WINDOW APPROACH Krallinger Valencia Drug Discovery Today 2005
11
Daejeon, 2005 Alfonso Valencia CNB-CSIC BioCreAtIvE C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press) EMBO Evaluation Workshop, April 2004, Granada, Spain http://www.pdg.cnb.uam.es/BioLINK http://www.pdg.cnb.uam.es/BioLINK Task 1: Extraction of gene or protein names from text, and their mapping into standardized gene identifiers for fly, mouse, yeast. 1a.- Gene list annotation (Creating a list of genes mentioned in abstracts). Useful for indexing “a number of systems (4) were able to extract general gene names from sentences of MEDLINE abstracts at over 80% balanced precision and recall” 1b.- Gene name mentions. Corresponds to “named entity” task in the natural language processing. “ the results ranged from a high for yeast of 92% balanced precision and recall, to somewhat lower scores for fly (82%) and mouse (79%)” BioCreAtIvE © Results, methods, and evaluation papers published in BMC Bioinformatics 2005.
12
Daejeon, 2005 Alfonso Valencia CNB-CSIC Years Evolution of gene names Hoffmann, Valencia TIGs 2003 Gene names The evolution of gene names over time is a “scale free” process - “critical state” system - the evolution of a gene name cannot be predicted - some gene name act as attractors of other names
13
Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names Interspecies Linking to DBs 2.Relations Protein protein Others (regulation, drugs) Function 3.Type of Relation Proteins Metabolic pathways 1. 80% prec/recall (BioCreative) Far less than that Essential (not NLP) 2. Easy on the surface Best known one (accessible?) Dictionaries Very difficult (ie GO in BioCreative) 3. Semantic Summaries very difficult New challenge, unexplored Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005 Krallinger et al., DDToday 2005
14
Daejeon, 2005 Alfonso Valencia CNB-CSIC SUISEKI Extraction of the interactions Human expert manipulation Pubmed 12M entries Extraction of protein names *[proteinA]...verbindicatinganaction...[proteinB] “After extensive purification, Cdk2 was still bound to cyclin D1” Rules (frames) to identify the interactions Selecting terms that indicate interaction activate,associatedwith,bind,interact,phosphorylate,regulate Action words are for example: Selection of the text corpus
15
Daejeon, 2005 Alfonso Valencia CNB-CSIC Blaschke Valencia IEEE 2002
16
Daejeon, 2005 Alfonso Valencia CNB-CSIC Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005 Krallinger et al., Drug Disc. Today 2005
17
Daejeon, 2005 Alfonso Valencia CNB-CSIC Hoffmann Valencia Nat Genet 2004 VISIT: iHOP
18
Daejeon, 2005 Alfonso Valencia CNB-CSIC
19
Daejeon, 2005 Alfonso Valencia CNB-CSIC
20
Daejeon, 2005 Alfonso Valencia CNB-CSIC
21
Daejeon, 2005 Alfonso Valencia CNB-CSIC
22
Daejeon, 2005 Alfonso Valencia CNB-CSIC iHOP inside Hoffmann Valencia Bioinform. 2005
23
Daejeon, 2005 Alfonso Valencia CNB-CSIC Hermjakob, et al., IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004
24
Daejeon, 2005 Alfonso Valencia CNB-CSIC
25
Daejeon, 2005 Alfonso Valencia CNB-CSIC HCAD Chromosomal Translocations Database Hoffmann et al., NAR 2004
26
Daejeon, 2005 Alfonso Valencia CNB-CSIC BioCreAtIvE C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press) EMBO Evaluation Workshop, April 2004, Granada, Spain http://www.pdg.cnb.uam.es/BioLINK http://www.pdg.cnb.uam.es/BioLINK Task 1: Extraction of gene / protein names from text, mapping to identifiers (fly, mouse, yeast) 1a.- Identification of a list of genes in text. indexing 4 systems 80% balanced precision and recall 1b.- Gene name mentions linked to DB entries. entity task identification in NLP yeast of 92% balanced precision and recall, fly (82%) and mouse (79%) Task 2: GO to protein via text for a collection of human genes. 2a.- Text piece for a given GO and protein. identification Best systems with large coverage aprox. 23% correct identification 2b.- Find the GO and text for a list of proteins. Best systems cover most proteins with a 20% correct identification BioCreAtIvE © Results, methods, and evaluation papers published in BMC Bioinformatics 2005.
27
Daejeon, 2005 Alfonso Valencia CNB-CSIC Biocreative Task 2a Krallinger et al., BMC Bioinfo. 05
28
Daejeon, 2005 Alfonso Valencia CNB-CSIC Krallinger, Padron, et al., 2005 Correlation GO - Protein spaces (sub-tags)
29
Daejeon, 2005 Alfonso Valencia CNB-CSIC Protein names sub tag: 1- Original protein name 2- Heuristic typographical variants 3- Variants from external links to db 4- Protein name forming word types 5- External links forming word types 6- GOBO sequence terms 7- GOBO mutation event terms GO term sub tag: 1- OriginalGO term 2- NL variants of GO term 3- GO term forming word types 4- GO term definition word types
30
Daejeon, 2005 Alfonso Valencia CNB-CSIC Krallinger, Padron, et al., 2005 Correlation GO - Protein spaces (sub-tags)
31
Daejeon, 2005 Alfonso Valencia CNB-CSIC
32
Daejeon, 2005 Alfonso Valencia CNB-CSIC Interface for the EBI GO team during the Biocreative evaluation
33
Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names 1.Interspecies 2.Linking to DBs 2.Relations 1.Protein protein 2.Others (regulation, drugs) 3.Function 3.Type of Relation 1.Proteins 2.Metabolic pathways 4.Concepts for groups of genes 1.Existing 2.Creating new ones 1. 80% prec/recall (biocreative) 1.Far less than that 2.Essential (not NLP) 2. Easy on the surface 1.Best known one (accessible?) 2.Dictionaries 3.Very difficult (to GO Biocreative) 3. Semantic 1.Summaries very difficult 2.New challenge, unexplored 4. Knowledge discovery 1.Summaries and generalization 2.Not jet Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005 Krallinger et al., DDToday 2005
34
Daejeon, 2005 Alfonso Valencia CNB-CSIC Experiment: Iyer et al (1999) Science 283, 83-87 Meiosis Cyclin Checkpoint Interphase Nucleoplasma Division Histone Replication Chromatid Dipeptidyl Prolyl nmr Collagen-binding 17 genes PCNA CDC2 MSH2 LBR TOP2A... 24 genes ABCA5 CAT ELF2 PIM1 WNT2... Cell cycle Unknown DNA replication DNA metabolism Cell Cycle control PCNA-MSH2 The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication. LBR-CDC2 LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase. Words GO codes Sentences Words Blaschke, et al., Funct. Integ. Genomics 2001
35
Daejeon, 2005 Alfonso Valencia CNB-CSIC
36
Daejeon, 2005 Alfonso Valencia CNB-CSIC SOTA clustering versus significance of Geisha terms. Oliveros, Blaschke, GIW 2000 ©
37
Daejeon, 2005 Alfonso Valencia CNB-CSIC SOTA and GEISA mixed information Blaschke, Herrero, Dopazo, Valencia 2002 Expression based clustering Weight (expression)+ Weight (text) Term (text) based clustering
38
Daejeon, 2005 Alfonso Valencia CNB-CSIC www.pdg.cnb.uam.es -14th ISMB : Fortaleza, Sept 06 http://www.iscb.org/ Text mining SIGs, Biolinkhttp://www.iscb.org/ - The European School on Bioinformatics. BioSapiens http://www.biosapiens.infohttp://www.biosapiens.info - Winter symposium Bologna Feb 2006 - Master Bioinformatica. U. Complutense Enero - Julio 2006 bbm1.ucm.es/masterbioinfo www.cab.inta.eswww.inba.orgwww.bioalma.com
39
Daejeon, 2005 Alfonso Valencia CNB-CSIC Stable clusters > central processes with expression and functional information agree Unstable groups > contradictory information “jumping” genes, divergent expression and functional classifications. (Gene of very unstable behavior > related with insufficient information)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.