Target selection strategies for the mouse genome

Slides:



Advertisements
Similar presentations
Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Section 12 – 5 Gene Regulation
Four of the many different types of human cells: They all share the same genome. What makes them different?
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
A Novel Multigene Family May Encode Odorant Receptors: A Molecular Basis for Odor Recognition Linda Buck and Richard Axel Published in Cell, Volume 65,
Section 8.6: Gene Expression and Regulation
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Protein structure (Part 2 of 2).
2004 PP&CW Optimization of protein expression and solubility Alternative and novel prokaryotic expression systems Eukaryotic expression systems Methods.
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Topic 2 Adam Godzik. JCSG approach: no model archives, building models “on the fly”
How Does A Cell Know? Which Gene To Express Which Gene To Express& Which Gene Should Stay Silent? Which Gene Should Stay Silent?
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Proteomics Understanding Proteins in the Postgenomic Era.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
Biology 10.2 Gene Regulation and Structure Gene Regulation and Structure.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Protein Structure Prediction. Historical Perspective Protein Folding: From the Levinthal Paradox to Structure Prediction, Barry Honig, 1999 A personal.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Structural proteomics
Structural proteomics Handouts. Proteomics section from book already assigned.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Tutorial 3 BLAST 1. BLAST tutorial How to use BLAST Score vs. E-value Exercise Cool story of the day: How Alzheimer is studied in yeast 2.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
Microbial Models I: Genetics of Viruses and Bacteria 8 November, 2004 Text Chapter 18.
How many genes are there?
Specific Defenses of the Host Part 2 (acquired or adaptive immunity)
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Aim: How is DNA organized in a eukaryotic cell?. Why is the control of gene expression more complex in eukaryotes than prokaryotes ? Eukaryotes have:
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
The genome of prokaryotes and eukaryotes- nuclear and extranuclear genetic organization.
Gene Structure and Regulation. Gene Expression The expression of genetic information is one of the fundamental activities of all cells. Instruction stored.
Biotechnology.
Human Genome Project.
The process of evolution drives the diversity and unity of life
Gene Regulation and Expression
Controlling the genes Lecture 15 pp
Virus Basics - part I Viruses are genetic parasites that are smaller than living cells. They are much more complex than molecules, but clearly not alive,
15.2, slides with notes to write down
OPERONS * Indicated slides borrowed from: Kim Foglia
Basics of Comparative Genomics
Regulation of Gene Expression
Gene-set analysis Danielle Posthuma & Christiaan de Leeuw
Human Cells Human genomics
New genes can be added to an organism’s DNA.
Predicting Active Site Residue Annotations in the Pfam Database
Regulation of Gene Expression
Introduction to Bioinformatics II
Chromosome Organization
TRANSCRIPTION--- SYNTHESIS OF RNA
Centrosomes and Mitochondrias
Protein structure prediction.
Regulation of Gene Expression
BSC1010: Intro to Biology I K. Maltz Chapter 21.
From Mendel to Genomics
The Content of the Genome
Basics of Comparative Genomics
Eukaryotic Gene Regulation
Presentation transcript:

Target selection strategies for the mouse genome Adam Godzik

Q1:How far we are from a complete coverage of a sample genome ~200

The same, split into finer categories ~200 ~60, essential (predicted)

Why mouse? Everybody’s favorite model organism Diseases (cancer, diabetes) Several large experimental collaborations target mouse Genome sequenced 2002, shotgun, but also cDNA small (but significant) differences with the human genome

Basic statistics 2.5M bases (14% shorter than human) ~30,000 protein coding genes (about the same as human, with the same uncertainties) 20 chromosomes, significant rearrangements of conserved sequence elements as compared to humans Largest expansions/contractions seen in genes involved in apoptosis, immune response, olfactory functions Fold repertoire is likely to be essentially identical

Differences are quantitative, rather than qualitative

We are facing many challenges There are several challenges we usually don’t see for bacterial proteins Eukaryotic proteins do not express very well HT E.coli expression ~5% success rate yes Baculovirus ~15% test Cell free expression high (claimed) no They tend to have many domains Many approaches to domain recognition, less to defining precise domain boundaries Many form complexes – problems with solubility and crystallization Rational approach to mutation of surface residues

How bad can it get (NAC example) three domains can be reliably modeled (one of them was solved), fold of two additional ones can be predicted, one is unknown Together with two other labs at TBI we spent three years studying the second domain Where exactly it ends? Is the mystery domain part of it? No structure so far after several hundred constructs for five human paralogs and dozens of homologs from several species Four “linkers” of about 300aa – domains? Unstructured linkers? 18 paralogs in mouse and 23 in human. What they do? Different paralogs are involved in diseases as different as cancer, autoimmune diseases and innate immunity disorders – despite identical domain structures ! 1480 1 PAAD AAA* ? LR ? (NB) CARD

Target selection principles Eliminate what is already known (or not interesting or should be solved by other techniques) Homologs of proteins with known fold Transmembrane domains Disordered regions Choose representatives of the unknown Clustering and sampling strategies T1 T2 T3 known Other pipeline

Basic PSI target estimates Structures for ~1/2 (2/5 length) can be reliably predicted using SCOP type domain families with multiple representatives and profile-profile algorithm Proteins with transmembrane domains account for ~¼ of the proteome 25,000 - 60,000 domains are needed to cover the remaining ~ ¼ at 30% sequence identity Reliability of the excluded domain prediction Uncertainty of disordered/low complexity regions Surprising number of ORFan regions (10-20,000) Clustering with profile-profile algorithms can lower this number to 5-10,000 + ORFans

What could be a model for mouse? Mouse is a favorite model for human processes and diseases For fold survey and modeling purposes we can use proteins from lower organisms

Bacterialization of the mouse proteome Bacterial homologs can be found for most of the mouse proteins (54% with PSI-BLAST, 65% with profile-profile (FFAS) and it could still be extended) Distribution of bacterial homologs is an interesting problem it itself

JCSG eukaryotic protein pilot project mouse ~400 targets ~1000 targets* Homologs of mouse proteins in bacteria *together about 20% effort, the rest spent on Thermotoga and few other pilot projects

Experiences so far Mouse – 400 selected 222 expressed 32 crystals 4 structures Effort per structure – about 10 times that of a bacterial protein Bacterialized mouse 1025 selected 380 expressed 63 crystals 12 structures Effort per structure – about 1/10 that of a mouse protein

The plan Some mouse proteins can be solved in a high throughput mode Bacterial homologs are used as a “salvage” pathway for proteins that failed in the direct approach Exact domain boundaries Modeling mutations MR on bacterial templates

Some observations on mouse “bacterialization” Prokaryotic genomes Eukaryotic genomes Common part ~300 proteins, ~5-10% of a typical genome Common part ~5000 proteins, ~20-30% of a typical genome

Conservation patterns between genomes Prokaryotic genomes Eukaryotic genomes Groups of functionally related proteins often are found in specific organisms

Mouse is just a somewhat bigger bacteria? For some functional groups the bacterialization is a very natural approach mitochondrial proteins – mitochondria evolved from prokaryotic symbionts basic metabolic enzymes – the basic biochemistry and fundamental process are the same in prokaryotes and eukaryotes Many homologies are completely puzzling

Not all homologies are trivial Periplasmic binding proteins In G (-) bacteria used to scavenge for food in preriplasm Wide specificity Closed conformation recognized by specialized transporters Human gated ion channels Transporting ions through membranes Regulated by glutamate, glycine and zinc (and other things) Closed conformation opens the channel Distant homology, RMDS ~3Ǻ, models of human proteins built on bacterial templates successfully used in planning experiments

Bacteria as a catalog of spare parts Focus on fundamental processes in the eukaryotic cell Energy production – mitochondria core machinery of life – set of fundamental pathways and processes

Mitochondrial proteome 618 proteins, including 32 from the mitochondrial genome and 586 imported from the nucleus 392 (predicted) soluble proteins 192 with less then 30% sequence id to any known structure 65 with unknown folds

Central core machinery of life

Enzymes from the “central core” are mostly shared between eukaryotes (mouse) and bacteria

Structural coverage of the core set of metabolic pathways ~10,000 enzymes Most have homologs in multiple organisms, including bacteria Could be covered by ~1000 targets at the superfamily level

Missing genes ?

Conclusions We are within a striking distance of achieving complete fold coverage of selected bacterial genomes (~200 structures for T. maritima) Prokaryotic genomes look like a catalog of spare parts eukaryotic genomes were build from Thousands of structures still needed for the coverage of the remaining ~25% of the mouse (or almost any eukaryotic) genome and most of them have bacterial homologs Functionally related groups of proteins could be attractive targets