Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive.

Slides:



Advertisements
Similar presentations
Microarray Data Analysis Day 2
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Gene Regulation in Eukaryotes Same basic idea, but more intricate than in prokaryotes Why? 1.Genes have to respond to both environmental and physiological.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Genome The molecular secret of our lives Shin-Han Shiu The Plant Biology Department & the Genetics Program Michigan State University.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Alternative splicing and evolution Daniel Jeffares.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Cis-regulatory element study in transcriptome Jin Chen CSE Fall
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Bioinformatics Why Can’t It Tell Us Everything?. Bioinformatics What are our Data Sets? Interested in information flow with cells Currently, the key information.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Tools of Human Molecular Genetics. ANALYSIS OF INDIVIDUAL DNA AND RNA SEQUENCES Two fundamental obstacles to carrying out their investigations of the.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
TECHNIQUES INVOVED IN PROTEOMICS,GENOMICS,TRANSCRIPTOMICS…….
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Introduction to biological molecular networks
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
How many genes are there?
1 Genomics Advances in 1990 ’ s Gene –Expressed sequence tag (EST) –Sequence database Information –Public accessible –Browser-based, user-friendly bioinformatics.
BIOL 433 Plant Genetics Term 2, Instructors: Dr. George Haughn Dr. Ljerka Kunst BioSciences 2239BioSciences Tel
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Bioinformatics Overview
Building and Analyzing Genome-Wide Gene Disruption Networks
1 Department of Engineering, 2 Department of Mathematics,
Genomes and Their Evolution
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
BIOL 433 Plant Genetics Term 2,
From Mendel to Genomics
Predicting Gene Expression from Sequence
Presentation transcript:

Genomics and Bioinformatics The "new" biology

What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive study of the interactions and functional dynamics of whole sets of genes and their products. (NIAAA, NIH)  A "scaled-up" version of genetics research in which scientists can look at all of the genes in a living creature at the same time. (NIGMS, NIH)  Which organism’s genome was sequenced first?

Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1977 Bacteriophage fX174 First genome ever! 5, Human mitochondria First organelle 16, Haemophilus influenzae Rd First free- living organism 1,830,137~3, Saccharomyces cerevisiae First eukaryote 12,086,000~6,

Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1998 Caenorhab- ditis elegans First multi- cellular organism 97,000,000~19, Human chromosome 22 First human chromosome 49,000, Arabidopsis thaliana First plant genome 150,000,000~25, Human First human genome 3,000,000,000~30,

Genome sequencing projects (as of 1/26,2007)

Sequencing strategies: Hierarchical shotgun sequencing

Genome size range  What’re there in the genomes? Why are there such a big difference? viruses plasmids bacteria fungi plants algae insects mollusks reptiles birds mammals bony fish amphibians

Information contents in a genome  Gene  Protein coding genes  RNA genes  Regulatory elements  Gene expression control  Chromatin remodeling  Matrix attachment sites  “Non-functional” elements  Selfish elements  “Junk” DNA  ??

The “central dogma” of molecular biology  Central dogma DNA RNA Protein Transcription Translation Replication

Expanded “central dogma” of molecular biology  A more comprehensive view DNA RNA Protein Transcription Translation Replication Metabolite Pheno- type

New disciplines due to the advance in genomics  Omics DNA RNA Protein Transcription Translation Replication Metabolite Pheno- type Structural genomics Transcriptomics Proteomics Metabolomics Genomic DNA sequences Transcript seq Microarray data Cis-elements TF binding sites Epigenetic regulation Shotgun protein seq Subcellular location Post-translational mod Protein interaction Protein structure Metabolite concn Metabolic flux Genetic interactions Systematic KO Disease information

Nature omics gateway

Three perspectives of our biological world  The cellular level, the individual, the tree of life Rosenzweig et al., Conservation Biol. Image: htto:// Image: ~10 14 cells per individual2-100x10 6 species~3x10 4 genes

Further complications  Cell-cell interactions  Cell types  Environmental conditions  Developmental programming  Interactions at the organismal level  Interactions at the population, ecosystem level

Definition of bioinformatics  Bioinformatics  Research, development, or application of  Computational tools and approaches for expanding the use of  Biological, medical, behavioral or health data, including those to  Acquire, store, organize, archive, analyze, or visualize such data.  Computational biology  The development and application of  Data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to  The study of biological, behavioral, and social systems  Q: What kinds of data are we taking about?

Example: Sequence assembly  Cut into ~150kb pieces  Clone into Bacterial Artificial Chromosome (BAC)  Mapped to determine order of the BAC clones (golden/tiling path)  Shear a BAC clone randomly  Sequencing  Assembie sequence reads

Sequence assembly  Challenges  The presence of gaps  Due to incomplete coverage  Sequencing error and quality issue: worse at the end of reactions  So can’t rely on perfectly identical sequences all the time  Sequences derived from one strand of DNA  Need to take orientations of reads into account  Non-random sequencing of DNA  Presence of repeats Correct layout Mis-assembly

Overlap-layout consensus  The relationships between reads can be represented as a graph  Nodes (vertices): reads  Edges (lines): connecting “overlapping reads”  Goal: identifying a path through that graph that visits each node exactly once Genome

Example: Gene prediction  How can we identify functional elements in the genomes?  How can we assign functions to these elements?  How can we determine/predict the structures of these elements?  How can we reconstruct networks describing the relationships and dynamics between these elements?  How can we link genotypes to phenotypes?

Characteristic of protein coding genes  Similarity to other genes  Assuming there is some level of conservation.  Substitutions that change amino acids vs. those that won’t.

Hidden Markov Model and gene finding  Goal:  Choose a path that maximize the probability that you will enjoy the trip (or the other way around if you wish)  How is the probability determined? p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2

Example: Sequence alignment  Align retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin >RBP MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRL LNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPN GLPPEAQKIVRQRQEELCLARQYRLIV >lactoglobulin MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN GECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKA LPMHIRLSFNPTQLEEQCHI

Goal of PSA  Find an alignment between 2 sequences with the maximum score

Extreme value distribution  Normal vs. extreme value distribution x probability extreme value distribution normal distribution

Example: Microarray  A solid support (e.g. a membrane or glass slide) on which DNA of known sequence is deposited in a grid-like fashion

Microarray data analysis  A simplified pipeline

What’s in the cel files  Intensities of perfect and mismatch probes #### Dimension of the data matrix nrow(M); ncol(M) ### Perfect match pm <- pm(M) # perfect match intensities dim(pm) # dimension of the pm matrix pm[1:5,] # the first five columns summary(pm) # summary stat for the pm matrix GSM CEL GSM CEL GSM CEL GSM CEL GSM CEL GSM CEL [1,] [2,] [3,] [4,] [5,] GSM CEL GSM CEL GSM CEL GSM CEL Min. : 56.3 Min. : 67.5 Min. : 69.5 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. :

Probe intensity behaviors between arrays  Distributions vary widely between experiments ### Summarize the intensity par(mfrow=c(1,2)) # get a plotting region with 1 row, 2 col hist(M) # generate log2 histograms boxplot(M) # generate log2 boxplots log intensity

Example: Identification of cis-elements  The on-off switches and rheostats of a cell operating at the gene level.  They control whether and how vigorously that genes will be transcribed into RNAs.

Motif model: Position Frequency Matrix (PFM)  f b,i : freuqnecy of a base b occurred at the i-th position D’haeseleer (2006) Nature Biotech. 24:423

Motif model: Position Weight Matrix (PWM)  Suppose p A,T = 0.32 and p G,C = 0.18 (Arabidopsis thaliana) A80442 T00022 G08422 C00002 Position Frequency Matrix A T G C Position Wight Matrix

Example: Cis-regulatory logic  Based on a high confidence set of binding sites:  3,353 interactions between  116 regulators and  1,296 promoters Harbison et al. (2004) Nature 43:99

Identification of putative cis elements  Pearson's correlation coefficient as the similarity measure.  k-mean clustering to identify co-regulated genes.  Motifs identified only with AlignACE Beer and Tavazoie (2004) Cell 117:185

Bayesian network  Bayes' theorem  Bayesian network Charniak (1991) Bayesian networks without tears

Final example: Relationships between sequences  Sanger and colleagues (1950s): 1st sequence  Insulin from various mammals

Trees  An acyclic, un-directed graph with nodes and edges A B C D E F G H I time Li Molecular Evolution. p101 one unit A B C D E Operational taxonomic unit Ancestral taxonomic units External branch Internal branch

Enumerating trees  Suppose there are n OTUs (n ≥ 3)  Bifurcating rooted trees:  Unrooted trees:  For 10 OTUs  3.4x10 7 possible rooted trees  2.0x10 6 possible unrooted trees

Impacts of genomics and bioinformatics  New ways to ask and answer question?  Hypothesis driven vs. data driven  A matter of scale  A matter of integration  Quantitative emphasis  Multi-displinary approaches  How is genomics different from genetics?  Whole genome approach versus a few genes  Investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion.  Genetics looks at single genes, one at a time, as a snapshot.  Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense

The END ...