Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Similar presentations


Presentation on theme: "Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory."— Presentation transcript:

1 Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory Institute for Genomic Biology University of Illinois at Urbana-Champaign schatz@uiuc.edu, www.canis.uiuc.edu

2 What are Analysis Environments Functional Analysis Find the underlying Mechanisms Of Genes, Behaviors, Diseases Comparative Analysis Top-down data mining (vs Bottom-up) Multiple Sources especially literature

3 Building Analysis Environments Manual by Humans Interactionuser navigation Classificationcollection indexing Automatic by Computers Federationsearch bridges Integrationresults links

4 Trends in Analysis Environments Central versus Distributed Viewpoints The 90s Pre-Genome Entrez (NIH NCBI) versus WCS (NSF Arizona) The 00s Post-Genome GO (NIH curators) versus BeeSpace (NSF Illinois)

5 Pre-Genome Environments Focused on Syntax pre-Web WCS (Worm Community System) Search words across sources Follow links across sources Words automatic, Links manual Towards Uniform Searching

6 Post-Genome Environments Focused on Semantics post-Web BeeSpace (Honey Bee Inter Space) Navigate concepts across sources Integrate data across sources Concepts automatic, Links automatic Towards Question Answering

7 Paradigm Shift Towards Dry-Lab Biology, Walter Gilbert (Jan 1991) “The new paradigm, now emerging, is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis.... To use this flood of knowledge [the total sequence of the human and model organisms], which will pour across the computer networks of the world, biologists not only must become computer-literate, but also change their approach to the problem of understanding life.... The Coming of Informational Science Correlation of Information across Sources

8 NCBI Entrez

9 Community Systems browse and share all the knowledge of a community data results (database management)(electronic mail) literature news (information retrieval) (bulletin boards ) knowledge (hypertext annotations) Formal Informal

10 Worm Community System WCS Information: Literature BIOSIS, MEDLINE, newsletters, meetings Data Genes, Maps, Sequences, strains, cells WCS Functionality Browsingsearch, navigation Filteringselection, analysis Sharinglinking, publishing WCS: 250 users at 50 labs across Internet (1991)

11 WCS Molecular

12 WCS Cellular

13 WCS Publishing

14 WCS Linking

15 WCS invokes gm

16 WCS vis-à-vis acedb

17 from Objects to Concepts from Syntax to Semantics Infrastructure is Interaction with Abstraction Internet is packet transmission across computers Interspace is concept navigation across repositories Towards the Interspace

18 THE THIRD WAVE OF NET EVOLUTION PACKETS OBJECTS CONCEPTS

19 Technology Engineering Electrical FORMAL INFORMAL (manual) (automatic) IEEE communities groups individuals LEVELS OF INDEXES

20 COMPUTING CONCEPTS ‘92: 4,000 (molecular biology) ‘93: 40,000 (molecular biology) ‘95: 400,000 (electrical engineering) ‘96: 4,000,000 (engineering) ‘98: 40,000,000 (medicine)

21 Simulating a New World Obtain discipline-scale collection MEDLINE from NLM, 10M bibliographic abstracts human classification: Medical Subject Headings Partition discipline into Community Repositories 4 core terms per abstract for MeSH classification 32K nodes with core terms (classification tree) Community is all abstracts classified by core term 40M abstracts containing 280M concepts concept spaces took 2 days on NCSA Origin 2000 Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K w/ > 10K)

22 Interspace Remote Access Client

23 Navigation in MEDSPACE For a patient with Rheumatoid Arthritis Find a drug that reduces the pain (analgesic) but does not cause stomach (gastrointestinal) bleeding Choose Domain

24 Concept Search

25 Concept Navigation

26 Retrieve Document

27 Navigate Document

28 Retrieve Document

29 Informational Science Computational Science is widely accepted as The Third Branch of Science (beyond Experimental and Theoretical) Genes are Computed, Proteins are Computed, Sequence “equivalences” are Computed. Informational Science is coming to be accepted as The Fourth Branch of Science Based on Information Science technologies for Functional Analysis across Information Sources

30 Post-Genome Informatics I Comparative Analysis within the Dry Lab of Biological Knowledge Classical Organisms have Genetic Descriptions. There will be NO more classical organisms beyond Mice and Men, Worms and Flies, Yeasts and Weeds. Must use comparative genomics on classical organisms Via sequence homologies and literature analysis.

31 Post-Genome Informatics II Functional Analysis within the Dry Lab of Biological Knowledge Automatic annotation of genes to standard classifications, e.g. Gene Ontology via homology on computed protein sequences. Automatic analysis of functions to scientific literature, e.g. concept spaces via text extractions. Thus must use functions in literature descriptions.

32 Informatics: From Bases to Spaces data Bases support genome data e.g. FlyBase has sequences and maps Genes annotated by GO and linked to literature e.g. BeeBase has computed annotations Protein homologies for similar Genes via GO information Spaces support biomedical literature e.g. BeeSpace uses automatically generated conceptual relationships to navigate functions

33 Gene Ontology

34 Gene SymbolData SourceFull Name … CalcaMGIcalcitonin-related polypeptide Cat-1WormbaseNone Cat-2WormbaseNone CCKR-HumanUniProtCholecystokinin receptor CRF2-RatUniProtCorticotropin releasing factor Crhr2RGDcorticotrophin relse hormone Egl-10WormbaseNone Egl-30WormbaseNone Feh-1WormbaseNone ForFlyBaseNone

35 Conceptual Navigation in BeeSpace

36 BeeSpace Analysis Environment Build Concept Space of Biomedical Literature for Functional Analysis of Bee Genes -Partition Literature into Community Collections -Extract and Index Concepts within Collections -Navigate Concepts within Documents -Follow Links from Documents into Databases Locate Candidate Genes in Related Literatures then follow links into Genome Databases

37 Question Answering BehaviourOrganismGene Molecular Function Reference Foraging Rover vs sitter phenotypeDrosophila melanogasterforProtein kinase G8 Roamer vs dweller phenotypeC. elegansegl-4Protein kinase G16 Division of labour: age at onset of foraging Apis melliferaforProtein kinase G9 Division of labour: age at onset of foraging Apis melliferamlvMn transporter19 Division of labour: foraging-related?Apis melliferaperTranscription cofactor68 Division of labour: foraging-related?Apis melliferaache Acetylcholine esterase 69 Division of labour: foraging-related?Apis melliferaIP(3)KInositol signaling70 Foraging specialization: nectar vs. pollen Apis melliferapkcProtein kinase C71 Social feedingDrosophila melanogasterdpnf Neuropeptide Y (NPY) homolog 21 Social feeding (aggregation)C. elegansnpr-1Receptor for NPY22, 23

38 Functional Phrases encodes Sokolowski and colleagues demonstrated in Drosophila melanogaster that the foraging gene (for) encodes a cGMP dependent protein kinase (PKG). The dg2 gene encodes a cyclic guanosine monophosphate (cGMP)- dependent protein kinase (PKG). affects/causes Thus, PKG levels affected food-search behavior. cGMP treatment elevated PKG activity and caused foraging behavior. regulates Amfor, an ortholog of the Drosophila for gene, is involved in the regulation of age at onset of foraging in honey bees. This idea is supported by results for malvolio (mvl), which encodes a manganese transporter and is involved in regulating Drosophila feeding and age at onset of foraging in honey bees.

39 BeeSpace Software Implementation Natural Language Processing Identify noun and verb phrases Recognize biological entities Compute biological relations Statistical Information Retrieval Compute statistical contexts Support conceptual navigation

40 Data Integration (FlyBase Gene) D. melanogaster gene foraging, abbreviated as for, is reported here. It has also been known in FlyBase as BcDNA:GM08338, CG10033 and l(2)06860. It encodes a product with cGMP-dependent protein kinase activity (EC:2.7.1.-) involved in protein amino acid phosphorylation which is a component of the cellular_component unknown. It has been sequenced and its amino acid sequence contains an eukaryotic protein kinase, a protein kinase C-terminal domain, a tyrosine kinase catalytic domain, a serine/Threonine protein kinase family active site, a cAMP- dependent protein kinase and a cGMP-dependent protein kinase. It has been mapped by recombination to 2-10 and cytologically to 24A2--4. It interacts genetically with Csr. There are 27 recorded alleles : 1 in vitro construct (not available from the public stock centers), 25 classical mutants ( 3 available from the public stock centers) and 1 wild-type. Mutations have been isolated which affect the larval nerve terminal and are behavioral, pupal recessive lethal, hyperactive, larval neurophysiology defective and larval neuroanatomy defective. for is discussed in 80 references (excluding sequence accessions), dated between 1988 and 2003. These include at least 6 studies of mutant phenotypes, 2 studies of wild-type function, 3 studies of natural polymorphisms and 7 molecular studies. Among findings on for function, for activity levels influence adult olfactory trap response to a food medium attractant. Among findings on for polymorphisms, the frequency of for R and for s strains in three natural populations are studied to determine the contribution of the local parasitoid community to the differences in for R and for s frequencies.cGMP-dependent protein kinase activity(EC:2.7.1.-)protein amino acid phosphorylationcellular_component unknown sequencedamino acid sequenceeukaryotic protein kinaseprotein kinase C-terminal domaintyrosine kinase catalytic domainserine/Threonine protein kinase family active sitecAMP- dependent protein kinasecGMP-dependent protein kinase24A2--4allelesnerve terminalreferences

41 BeeSpace Information Sources Biomedical Literature - Medline (medicine) - Biosis (biology) - Agricola, CAB Abstracts, Agris (agriculture) Model Organisms (heredity) -Gene Descriptions (FlyBase, WormBase) Natural Histories (environment) -BeeKeeping Books (Cornell Library, Harvard Press)

42 Medical Concept Spaces (1998) Medical Literature (Medline, 10M abstracts) Partition with Medical Subject Headings (MeSH) Community is all abstracts classified by core term 40M abstracts containing 280M concepts computation is 2 days on NCSA Origin 2000 Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K with > 10K)

43 Biological Concept Spaces (2005) Compute concept spaces for All of Biology BioSpace across entire biomedical literature 50M abstracts across 50K repositories Use Gene Ontology to partition literature into biological communities for functional analysis GO same scale as MeSH but adequate coverage? GO light on social behavior (biological process)

44 Paradigm Shift Dissecting Human Disease, Victor McKusick (Feb 2001) Structural genomicsFunctional genomics GenomicsProteomics Map-based gene discoverySequence-based gene discovery Monogenic disordersMultifactorial disorders Specific DNA diagnosisMonitoring susceptibility Analysis of one geneAnalysis of multi-gene pathways Gene actionGene regulation Etiology (mutation)Pathogenesis (mechanism) One speciesSeveral species

45 Needles and Haystacks Genes Honey Bees have 13K genes Perhaps 100 have known functions Paths Perhaps 30K protein families exist KEGG has 200 known pathways Statistical Clustering for Interactive Discovery Across Two Orders of Magnitude!

46 Concept Switching In the Interspace… each Community maintains its own repository Switching is navigating Across repositories use your specialty vocabulary to search another specialty

47 CONCEPT SWITCHING “Concept” versus “Term” set of “semantically” equivalent terms Concept switching region to region (set to set) match term Semantic region Concept Space

48 Biomedical Session

49 Categories and Concepts

50 Concept Switching

51 Document Retrieval

52 Future Technologies Concept Switching Spreading activation, type tagging Dynamic Indexing On-the-fly collections, during session Path Matching Aggregating indexes, many repositories

53 THE NET OF THE 21st CENTURY Beyond Objects to Concepts Beyond Search to Analysis Problem Solving via Cross-Correlating Multimedia Information across the Net Every community has its own special library Every community does semantic indexing The Interspace approximates Cyberspace

54 Interactive Functional Analysis BeeSpace will enable users to navigate a uniform space of diverse databases and literature sources for hypothesis development and testing, with a software system beyond a searchable database, using literature analyses to discover functional relationships between genes and behavior. Genes to Behaviors Behaviors to Genes Concepts to Concepts Clusters to Clusters Navigation across Sources

55 XSpace Information Sources Organize Genome Databases (XBase) Compute Gene Descriptions from Model Organisms Partition Scientific Literature for Organism X Compute XSpace using Semantic Indexing Boost the Functional Analysis from Special Sources Collecting Useful Data about Natural Histories e.g. CowSpace Leverage in AIPL Databases

56 Towards the Interspace The Analysis Environment technology is GENERAL ! BirdSpace? BeeSpace? PigSpace? CowSpace? BehaviorSpace? BrainSpace? BioSpace … Interspace


Download ppt "Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory."

Similar presentations


Ads by Google