BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.

BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009 1

Goal of Informatics Research Develop general and scalable computational methods to enable –Semantic integration of data and information –Effective information access and exploration –Knowledge discovery –Hypothesis formulation and testing Reinforcement of research in biology and computer science –CS research to automate manual tasks of biologests –Biology research to raise new challenges for CS 2

Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering 3

Informatics Research Accomplishments Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Test Information Access & Exploration Content Analysis Question Answering Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08] Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b] Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b], [Chee & Schatz 08] Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08] Automatic Function Annotation [He et al. 09/10] 4

Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering Part 1. Information Extraction Part 2. Navigation Support Part 3. Entity Summarization Part 4. Function Analysis 5

Part 1. Information Extraction 6

Natural Language Understanding …We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to … NP VP Gene 7

Entity & Relation Extraction Gene XGene Y Bcdhb …. …… Genetic Interaction Gene XAnatomy Y Bcdembryo Hbegg …… Expression Location … 8 Lopes FJ et al., 2005 J. Theor. Biol.

General Approach: Machine Learning Computers learn from labeled examples to compute a function to predict labels of new examples Examples of predictions –Given a phrase, predict whether it is a gene name –Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation Many learning methods are available, but training data isn’t always available 9

Extraction Example 1: Gene Name Recognition … expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation. 10 Gene?

Features for Recognizing Genes Syntactic clues: –Capitalization (especially acronyms) –Numbers (gene families) –Punctuation: -, /, :, etc. Contextual clues: –Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. –Global: same noun phrase occurs several times in the same article 11

Maximum Entropy Model for Gene Tagging Given an observation (a token or a noun phrase), together with its context, denoted as x Predict y  {gene, non-gene} Maximum entropy model: P(y|x) = K exp(  i f i (x, y)) Typical f: –y = gene & candidate phrase starts with a capital letter –y = gene & candidate phrase contains digits Estimate i with training data 12

Special Challenges Gene name disambiguation Domain adaptation 13

Gene Name Disambiguation Gene names can be common English words: for (foraging), in (inturned), similar (sima), yellow (y), black (b)… Solution: –Disambiguate by looking at the context of the candidate word –Train a classifier 14

Discriminative Neighbor Words 15

Sample Disambiguation Results 16... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359 ( for ) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980 the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759 “foraging”, “for” “black”

Nov 27, 200717 Problem of Domain Overfitting gene name recognizer 54.1% gene name recognizer 28.1% ideal setting realistic setting wingless daughterless eyeless apexless … fly

Solution: Learn Generalizable Features …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. 18 Generalizable Feature: “w +2 = expressed”

Generalizability-Based Feature Ranking … training data … -less … expressed … expressed … -less … expressed … -less … expressed … -less … 1234567812345678 1234567812345678 1234567812345678 1234567812345678 … expressed … -less … 0.125 … 0.167 … 19

20 Effectiveness of Domain Adaptation Fly + Mouse Yeast gene name recognizer 63.3% Fly + Mouse Yeast gene name recognizer 75.9% standard learning domain adaptive learning

More Results on Domain Adaptation ExpMethodPrecisionRecallF1 F+M→YBaseline0.5570.4660.508 Domain0.5750.5160.544 % Imprv.+3.2%+10.7%+7.1% F+Y→MBaseline0.5710.3350.422 Domain0.5820.3810.461 % Imprv.+1.9%+13.7%+9.2% M+Y→FBaseline0.5830.0970.166 Domain0.5910.1390.225 % Imprv.+1.4%+43.3%+35.5% Text data from BioCreAtIvE (Medline) 3 organisms (Fly, Mouse, Yeast) 21

Extraction Example 2: Genetic Interaction Relation 22 Gene Is there a genetic interaction relation here? Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.

Challenges No/little training data What features to use? 23

Solution: Pseudo Training Data 24 Gene: Bcd + These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are known to act in concert for most anterior segmented development.

Pseudo Training Data Works Reasonably Well 25 Precision Recall Using all features works the best

Large-Scale Entity/Relation Extraction Entity annotation Relation extraction Entity TypeResourceMethod GeneNCBI, FlyBase, …Dictionary string search + machine learning AnatomyFlyBaseDictionary string search ChemicalMeSH, Biosis, …Dictionary string search Behavior“x x behavior” pattern search Relation TypeMethod RegulatoryPre-defined pattern + machine learning Expressed InCo-occurrence + relevant keywords Gene  BehaviorCo-occurrence Gene  ChemicalCo-occurrence 53

Part 2: Semantic Navigation 27

Space-Region Navigation Literature Spaces Bee Fly Behavior Bird … Topic Regions Bee Forager MAP Bird Singing EXTRACT … Fly Rover EXTRACT SWITCHING Intersection, Union,… My Regions/Topics My Spaces 28

General Approach: Language Models Topic = word distribution Modeling text in a space with mixture models of multinomial distributions Text Mining = Parameter Estimation + Inferences Matching = Computer similarity between word distributions Users can “control” a model by specifying topic preferences 29

A Sample Topic & Corresponding Space filaments 0.0410238 muscle 0.0327107 actin 0.0287701 z 0.0221623 filament 0.0169888 myosin 0.0153909 thick 0.00968766 thin 0.00926895 sections 0.00924286 er 0.00890264 band 0.00802833 muscles 0.00789018 antibodies 0.00736094 myofibrils 0.00688588 flight 0.00670859 images 0.00649626 actin filaments flight muscle flight muscles labels actin filaments in honeybee-flight muscle move collectively arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections identification of a connecting filament protein in insect fibrillar flight muscle the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles structure of thick filaments from insect flight muscle Word Distribution (language model) Example documents Meaningful labels 30

MAP: Topic/Region  Space MAP: Use the topic/region description as a query to search a given space Retrieval algorithm: –Query word distribution: p(w|  Q ) –Document word distribution: p(w|  D ) –Score a document based on similarity of  Q and  D Leverage existing retrieval toolkits: Lemur/Indri 31

EXTRACT: Space  Topic/Region Assume k topics, each being represented by a word distribution Use a k-component mixture model to fit the documents in a given space (EM algorithm) The estimated k component word distributions are taken as k topic regions Likelihood: Maximum likelihood estimator: Bayesian estimator: 32

User-Controlled Exploration: Sample Topic 1 age 0.0672687 division 0.0551497 labor 0.052136 colony 0.038305 foraging 0.0357817 foragers 0.0236658 workers 0.0191248 task 0.0190672 behavioral 0.0189017 behavior 0.0168805 older 0.0143466 tasks 0.013823 old 0.011839 individual 0.0114329 ages 0.0102134 young 0.00985875 genotypic 0.00963096 social 0.00883439 Prior: labor 0.2 division 0.2 33

behavioral 0.110674 age 0.0789419 maturation 0.057956 task 0.0318285 division 0.0312101 labor 0.0293371 workers 0.0222682 colony 0.0199028 social 0.0188699 behavior 0.0171008 performance 0.0117176 foragers 0.0110682 genotypic 0.0106029 differences 0.0103761 polyethism 0.00904816 older 0.00808171 plasticity 0.00804363 changes 0.00794045 Prior: behavioral 0.2 maturation 0.2 34 User-Controlled Exploration: Sample Topic 2

foraging 0.290076 nectar 0.114508 food 0.106655 forage 0.0734919 colony 0.0660329 pollen 0.0427706 flower 0.0400582 sucrose 0.0334728 source 0.0319787 behavior 0.0283774 individual 0.028029 rate 0.0242806 recruitment 0.0200597 time 0.0197362 reward 0.0196271 task 0.0182461 sitter 0.00604067 rover 0.00582791 rovers 0.00306051 foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 Exploit Prior for Concept Switching 35

Part 3: Entity Summarization 36

Gene product Expression Sequence Interactions Mutations General Functions Multi-Aspect Gene Summary Automated Gene Summarization?

A Two-Stage Approach

Text Summary of Gene Abl

General Entity Summarizer Task: Given any entity and k aspects to summarize, generate a semi-structured summary Assumption: Training sentences available for each aspect Method: –Train a recognizer for each aspect –Given an entity, retrieve sentences relevant to the entity –Classify each sentence into one of the k aspects –Choose the best sentences in each category 40

Further Generalizations Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary Assumption: Training sentences available for each aspect Method: –Train a recognizer for each aspect –Given an entity, retrieve sentences relevant to the entity –Classify each sentence into one of the k aspects –Choose the best sentences in each category 41 New method based on mixture model and regularized optimization

Part 4. Function Analysis 42

Annotating Gene Lists: GO Terms vs. Literature Mining Limitations of GO annotations: - Labor-intensive - Limited Coverage Literature Mining: - Automatic - Flexible exploration in the entire literature space

For any term: test its significance Segmentation 56.0 Pattern 34.2 Cell_cycle 25.6 Development 22.1 Regulation 20.4 … Enriched concepts Interactive analysis Gene group Bcd Cad … Tll Entrez Gene … Document sets For any gene: retrieve its relevant documents Bcd Cad Tll Overview of Gene List Annotator

Intuition for Literature-based Annotation GeneTPI1GPM1PGK1TDH3TDH2 protein_kinase00200 decarboxylase100 76 protein3926654433 stationary_phase27342 energy_metabolism45580 oscillation00001

Likelihood Ratio Test with 2-Poisson Mixture Model Dataset distribution: Poisson(λ;d) Reference distribution: Poisson(λ 0 ;d)

Agreement with GO-based Method Gene List: 93 genes up-regulated by the manganese treatment GO ThemeRelated Annotator terms neurogenesisaxon guidance, growth cone, commissural axon, proneural gene synaptic transmissionsynaptic vesicle, neurotransmitter release, synaptic transmission, sodium channel cytoskeletal proteinalpha tubulin, actin filament cell communicationtight junction, heparan sulfate proteoglycan 47

Discovering Novel Themes Gene List: 69 genes up-regulated by the methoprene treatment ThemeAnnotator terms muscleflight muscle, muscle myosin, nonmuscle myosin, light chain, myosin ii, thick filament, thin filament, striated muscle synaptic transmissionneurotransmitter release, synaptic transmission, synaptic vesicle signaling pathwaynotch signal 48

Summary Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering Part 1. Information Extraction Part 2. Navigation Support Part 3. Entity Summarization Part 4. Function Analysis 49 Machine Learning + Language Models + Minimum Human Effort General and scalable, but there’s room for deeper semantics

Looking Ahead… Knowledge integration, inferences Support for hypothesis formulation and testing 50

51 Exploring Knowledge Space Gene A2 Gene A1 Gene A4 Gene A3 Gene A4’ Gene A1’ Behavior B4Behavior B3 Behavior B2 Behavior B1 isa Co-occur-fly Orth-mos Co-occur-mos Co-occur-bee Co-occur-fly Reg orth Reg 1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3} 2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3} 3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6} 4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’} Gene A5 Reg P= PathBetween({Z, B4, {co-occur, reg,isa})

52 Full-Fledged BeeSpace V5 Biomedical Literature Entities - Gene - Behavior - Anatomy - Chemical Relations -Orthology - Regulatory interaction - … Experiment Data Analysis Additional entities and relations Expert knowledge Inferences Hypothesis Formulation & Testing

Thanks to Xin He (UIUC) Jing Jiang (SMU) Yanen Li (UIUC) Xu Ling (UIUC) Yue Lu (UIUC) Qiaozhu Mei (UIUC/Michigan) & Bruce Schatz (PI, BeeSpace) 53

Thank You! 54

BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.

Similar presentations

Presentation on theme: "BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.

Similar presentations

Presentation on theme: "BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library."— Presentation transcript:

Similar presentations

About project

Feedback