The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Microarray Data Analysis Day 2
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
Data Mining: Concepts and Techniques Mining Text Data
Gene Ontology John Pinney
1 Introduction to Natural Language Processing (Lecture for CS410 Text Information Systems) Jan 28, 2011 ChengXiang Zhai Department of Computer Science.
Introduction to Natural Language Processing Hongning Wang
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
COG and GO tutorial.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
1 Gene Ontology and Semantic Similarity Measures.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
EECS 730 Introduction to Bioinformatics Function Luke Huan Electrical Engineering and Computer Science
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Lecture 18 Text Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
Using The Gene Ontology: Gene Product Annotation.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.
Part II GO-Vocabulary of Genome. S. cerevisiae D. melanogaster.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Mining the Biomedical Research Literature Ken Baclawski.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Data Mining: Text Mining
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Introduction to Natural Language Processing Hongning Wang
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Gene Annotation & Gene Ontology
Annotating with GO: an overview
GO : the Gene Ontology & Functional enrichment analysis
Mining Text Data: An Introduction Data Mining / Knowledge Discovery
Gene expression analysis
Mining Text Data: An Introduction Data Mining / Knowledge Discovery
CS246: Information Retrieval
Presentation transcript:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 10/16/2006 Text Mining Administrative Class presentation schedule is online First class presentation is “kernel based classification” by Han Bin on Nov 6 th Project design is due Oct 30th

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 10/16/2006 Text Mining Overview Gene ontology Challenges What is gene ontology construct gene ontology Text mining, natural language processing and information extraction: An Introduction Summary

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 10/16/2006 Text Mining Ontology A systematic account of Existence. (From philosophy) An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. The hierarchical structuring of knowledge about things by subcategorising them according to their essential (or at least relevant and/or cognitive) qualities. This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices. The philosophy of indexing everything in existence?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 10/16/2006 Text Mining Aristotele’s ( BC) Ontology Substance plants, animals,... Quality Quantity Relation Where When Position Having Action Passion

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 10/16/2006 Text Mining Ontology and -informatics In information sciences, ontology is better defined as: “a domain of knowledge, represented by facts and their logical connections, that can be understood by a computer”. (J. Bard, BioEssays, 2003) “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber, 1993)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 10/16/2006 Text Mining Information Exchange in Bio-sciences Basic challenges: Definition, definition, definition What is a name? What is a function?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 10/16/2006 Text Mining Cell

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 10/16/2006 Text Mining Cell

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 10/16/2006 Text Mining Cell

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 10/16/2006 Text Mining Cell

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 10/16/2006 Text Mining Cell Image from

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 10/16/2006 Text Mining What ’ s in a name? The same name can be used to describe different concepts

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 10/16/2006 Text Mining What’s in a name? Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism Gluconeogenesis All refer to the process of making glucose from simpler components

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 10/16/2006 Text Mining What ’ s in a name? The same name can be used to describe different concepts A concept can be described using different names  Comparison is difficult – in particular across species or across databases

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 10/16/2006 Text Mining Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment What is Function? The Hammer Example

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 10/16/2006 Text Mining Information Explosion

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 10/16/2006 Text Mining Entering the Genome Sequencing Era Eukaryotic Genome Sequences YearGenome# Genes Size (Mb) Yeast ( S. cerevisiae ) ,000 Worm ( C. elegans ) ,100 Fly ( D. melanogaster ) ,600 Plant ( A. thaliana ) ,500 Human ( H. sapiens, 1st Draft )2001 ~3000~35,000

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 10/16/2006 Text Mining A Common Language for Annotation of Genes from Yeast, Flies and Mice What is the Gene Ontology? …and Plants and Worms …and Humans …and anything else!

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 10/16/2006 Text Mining

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 10/16/2006 Text Mining What is the Gene Ontology? Gene annotation system Controlled vocabulary that can be applied to all organisms Organism independent Used to describe gene products proteins and RNA - in any organism

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 10/16/2006 Text Mining Molecular Function = elemental activity/task the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme The 3 Gene Ontologies

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 10/16/2006 Text Mining Cellular Component where a gene product acts

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 10/16/2006 Text Mining Cellular Component

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 10/16/2006 Text Mining Cellular Component

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 10/16/2006 Text Mining Cellular Component Enzyme complexes in the component ontology refer to places, not activities.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 10/16/2006 Text Mining Molecular Function insulin binding insulin receptor activity

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 10/16/2006 Text Mining Molecular Function activities or “ jobs ” of a gene product glucose-6-phosphate isomerase activity

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 10/16/2006 Text Mining Molecular Function A gene product may have several functions; a function term refers to a single reaction or activity, not a gene product. Sets of functions make up a biological process.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 10/16/2006 Text Mining Biological Process a commonly recognized series of events cell division

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 10/16/2006 Text Mining Biological Process transcription

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 10/16/2006 Text Mining Biological Process Metabolism: degradation or synthesis of biomelecules

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 10/16/2006 Text Mining Biological Process Development: how a group of cell become a tissue

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 10/16/2006 Text Mining Biological Process courtship behavior

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 10/16/2006 Text Mining Ontology applications Can be used to: Formalise the representation of biological knowledge Standardise database submissions Provide unified access to information through ontology-based querying of databases, both human and computational Improve management and integration of data within databases. Facilitate data mining

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 10/16/2006 Text Mining Gene Ontology Structure Ontologies can be represented as directed acyclic graphs (DAG), where the nodes are connected by edges Nodes = terms in biology Edges = relationships between the terms is-a part-of

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 10/16/2006 Text Mining Parent-Child Relationships Chromosome Cytoplasmic chromosome Mitochondrial chromosome Plastid chromosome Nuclear chromosome A child is a subset or instances of a parent’s elements

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 10/16/2006 Text Mining Parent-Child Relationships cell membrane chloroplast mitochondrial chloroplast membrane is-a part-of

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 10/16/2006 Text Mining Annotation in GO A gene product is usually a protein but can be a functional RNA An annotation is a piece of information associated with a gene product A GO annotation is a Gene Ontology term associated with a gene product

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 10/16/2006 Text Mining Terms, Definitions, IDs Term: MAPKKK cascade (mating sensu Saccharomyces) Goid: GO: Definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. Evidence code: how annotation is done Definition_reference: PMID:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 10/16/2006 Text Mining Annotation Example GO Term Gene Product nek2 centrosome GO: Reference PMID: Evidence Code IDA Inferred from Direct Assay

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 10/16/2006 Text Mining GO Annotation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 10/16/2006 Text Mining GO Annotation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 10/16/2006 Text Mining GO Annotation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 10/16/2006 Text Mining Evidence Code Indicate the type of evidence in the cited source that supports the association between the gene product and the GO term

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 10/16/2006 Text Mining Types of evidence codes Types of evidence code Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND Two types of annotation  Manual Annotation  Electronic Annotation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 10/16/2006 Text Mining IDA: Inferred from Direct Assay direct assay for the function, process, or component indicated by the GO term Enzyme assays In vitro reconstitution (e.g. transcription) Immunofluorescence (for cellular component) Cell fractionation (for cellular component)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 10/16/2006 Text Mining IMP: Inferred from Mutant Phenotype variations or changes such as mutations or abnormal levels of a single gene product Gene/protein mutation Deletion mutant RNAi experiments Specific protein inhibitors Allelic variation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 10/16/2006 Text Mining IGI: Inferred from Genetic Interaction Any combination of alterations in the sequence or expression of more than one gene or gene product Traditional genetic screens - Suppressors, synthetic lethals Functional complementation Rescue experiments An entry in the ‘with’ column is recommended

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 10/16/2006 Text Mining IPI: Inferred from Physical Interaction Any physical interaction between a gene product and another molecule, ion, or complex 2-hybrid interactions Co-purification Co-immunoprecipitation Protein binding experiments An entry in the ‘with’ column is recommended

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 10/16/2006 Text Mining IEP: Inferred from Expression Pattern Timing or location of expression of a gene Transcript levels Northerns, microarray Exercise caution when interpreting expression results

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 10/16/2006 Text Mining ISS: Inferred from Sequence or structural Similarity Sequence alignment, structure comparison, or evaluation of sequence features such as composition Sequence similarity Recognized domains/overall architecture of protein An entry in the ‘with’ column is recommended

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 10/16/2006 Text Mining RCA: Inferred from Reviewed Computational Analysis non-sequence-based computational method large-scale experiments genome-wide two-hybrid genome-wide synthetic interactions integration of large-scale datasets of several types text-based computation (text mining)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 10/16/2006 Text Mining IGC Inferred from Genomic Context Chromosomal position Most often used for Bacteria - operons Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 10/16/2006 Text Mining IEA: Inferred from Electronic Annotation depend directly on computation or automated transfer of annotations from a database Hits from BLAST searches InterPro2GO mappings No manual checking Entry in ‘with’ column is allowed (ex. sequence ID)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 10/16/2006 Text Mining TAS: Traceable Author Statement publication used to support an annotation doesn't show the evidence Review article Text mining! Would be better to track down cited reference and use an experimental code

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 10/16/2006 Text Mining NAS: Non-traceable Author Statement Statements in a paper that cannot be traced to another publication

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 10/16/2006 Text Mining ND: No biological Data available Can find no information supporting an annotation to any term Indicate that a curator has looked for info but found nothing Place holder Date

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 10/16/2006 Text Mining IC: Inferred by Curator annotation is not supported by evidence, but can be reasonably inferred from other GO annotations for which evidence is available ex. evidence = transcription factor (function) IC = nucleus (component)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 10/16/2006 Text Mining Ask yourself: What is the experiment that was done? Choosing the correct evidence code Text Mining can help you review papers faster!

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 10/16/2006 Text Mining Beyond GO – Open Biomedical Ontologies Orthogonal to existing ontologies to facilitate combinatorial approaches Share unique identifier space Include definitions

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide62 10/16/2006 Text Mining Gene Ontology and Text Mining Derive ontology from text data More general goal: understand text data automatically

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide63 10/16/2006 Text Mining Finding GO terms In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… Process: response to wounding GO: Function: protein serine/threonine kinase activity GO: Component: integral to plasma membrane GO: …for B. napus PERK1 protein (Q9ARH1) PubMed ID:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide64 10/16/2006 Text Mining Mining Text Data Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in He paid $200,000 under a15-year loan from MW Financial. Frank Rizzo Bought this home from Lake View Real Estate In Loans($200K,[map],...) (Taken from ChengXiang Zhai, CS 397cxz, UIUC, CS – Fall 2003)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide65 10/16/2006 Text Mining Bag-of-Tokens Approaches Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Feature Extraction Loses all order-specific information! Severely limits context! Documents Token Sets

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide66 10/16/2006 Text Mining Natural Language Processing A dog is chasing a boy on the playground DetNounAuxVerbDetNounPrepDetNoun Noun Phrase Complex Verb Noun Phrase Prep Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide67 10/16/2006 Text Mining General NLP — Too Difficult! Word-level ambiguity “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity “natural language processing” (Modification) “A man saw a boy with a telescope.” (PP Attachment) Anaphora resolution “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) Presupposition “He has quit smoking.” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document!

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide68 10/16/2006 Text Mining Shallow Linguistics Progress on Useful Sub-Goals: English Lexicon Part-of-Speech Tagging Word Sense Disambiguation Phrase Detection / Parsing

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide69 10/16/2006 Text Mining WordNet An extensive lexical network for the English language Contains over 138,838 words. Several graphs, one for each part-of-speech. Synsets (synonym sets), each defining a semantic sense. Relationship information (antonym, hyponym, meronym …) Downloadable for free (UNIX, Windows) Expanding to other languages (Global WordNet Association) Funded >$3 million, mainly government (translation interest) to George Miller, National Medal of Science, wet dry watery moist damp parched anhydrous arid synonym antonym

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide70 10/16/2006 Text Mining Part-of-Speech Tagging This sentence serves as an example of annotated text… Det N V1 P Det N P V2 N Training data (Annotated text) POS Tagger “This is a new sentence.” This is a new sentence. Det Aux Det Adj N Pick the most likely tag sequence. Partial dependency (HMM) Independent assignment Most common tag

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide71 10/16/2006 Text Mining Word Sense Disambiguation Supervised Learning Features: Neighboring POS tags (N Aux V P N) Neighboring words (linguistics are rooted in ambiguity) Stemmed form (root) Dictionary/Thesaurus entries of neighboring words High co-occurrence words (plant, tree, origin,…) Other senses of word within discourse Algorithms: Rule-based Learning (e.g. IG guided) Statistical Learning (i.e. Naïve Bayes) Unsupervised Learning (i.e. Nearest Neighbor) “The difficulties of computational linguistics are rooted in ambiguity.” N Aux V P N ?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide72 10/16/2006 Text Mining Parsing Choose most likely parse tree… the playground S NPVP BNP N Det A dog VPPP AuxV is on a boy chasing NPP Probability of this tree= S NPVP BNP N dog PP AuxV is on a boy chasing NP P Det A the playground NP Probability of this tree= S  NP VP NP  Det BNP NP  BNP NP  NP PP BNP  N VP  V VP  Aux V NP VP  VP PP PP  P NP V  chasing Aux  is N  dog N  boy N  playground Det  the Det  a P  on Grammar Lexicon … … … … Probabilistic CFG

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide73 10/16/2006 Text Mining Obstacles Ambiguity “A man saw a boy with a telescope.” Computational Intensity Imposes a context horizon. Text Mining NLP Approach: 1.Locate promising fragments using fast IR methods (bag-of-tokens). 2.Only apply slow NLP techniques to promising fragments.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide74 10/16/2006 Text Mining Summary: Shallow NLP However, shallow NLP techniques are feasible and useful: Lexicon – machine understandable linguistic knowledge possible senses, definitions, synonyms, antonyms, typeof, etc. POS Tagging – limit ambiguity (word/POS), entity extraction “...research interests include text mining as well as bioinformatics.” NP N WSD – stem/synonym/hyponym matches (doc and query) Query: “Foreign cars” Document: “I’m selling a 1976 Jaguar…” Parsing – logical view of information (inference?, translation?) “A man saw a boy with a telescope.” Even without complete NLP, any additional knowledge extracted from text data can only be beneficial. Ingenuity will determine the applications.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide75 10/16/2006 Text Mining Reference for GO Gene ontology teaching resources:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide76 10/16/2006 Text Mining References for TM 1.C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi- Structured Data”, Morgan Kaufmann, G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall M. Hearst, Untangling Text Data Mining, ACL’99, invited paper R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall A Road Map to Text Mining and Web Mining, University of Texas resource page. 9.Computational Linguistics and Text Mining Group, IBM Research,