Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive.

Similar presentations


Presentation on theme: "Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive."— Presentation transcript:

1 Genomics and Bioinformatics The "new" biology

2 What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive study of the interactions and functional dynamics of whole sets of genes and their products. (NIAAA, NIH)  A "scaled-up" version of genetics research in which scientists can look at all of the genes in a living creature at the same time. (NIGMS, NIH)  Which organism’s genome was sequenced first?

3 Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1977 Bacteriophage fX174 First genome ever! 5,38611 1981 Human mitochondria First organelle 16,50037 1995 Haemophilus influenzae Rd First free- living organism 1,830,137~3,500 1996 Saccharomyces cerevisiae First eukaryote 12,086,000~6,000 http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpg http://www.alsa.org/research/article.cfm?id=822 http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpg http://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg

4 Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1998 Caenorhab- ditis elegans First multi- cellular organism 97,000,000~19,000 1999 Human chromosome 22 First human chromosome 49,000,000673 2000 Arabidopsis thaliana First plant genome 150,000,000~25,000 2001Human First human genome 3,000,000,000~30,000 http://www.sih.m.u-tokyo.ac.jp/chem1.gif http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg

5 Genome sequencing projects (as of 1/26,2007)

6 Sequencing strategies: Hierarchical shotgun sequencing http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

7 Genome size range  What’re there in the genomes? Why are there such a big difference? viruses plasmids bacteria fungi plants algae insects mollusks reptiles birds mammals 10 4 10 8 10 5 10 6 10 7 10 1110 10 9 bony fish amphibians

8 Information contents in a genome  Gene  Protein coding genes  RNA genes  Regulatory elements  Gene expression control  Chromatin remodeling  Matrix attachment sites  “Non-functional” elements  Selfish elements  “Junk” DNA  ??

9 The “central dogma” of molecular biology  Central dogma DNA RNA Protein Transcription Translation Replication

10 Expanded “central dogma” of molecular biology  A more comprehensive view DNA RNA Protein Transcription Translation Replication Metabolite Pheno- type

11 New disciplines due to the advance in genomics  Omics DNA RNA Protein Transcription Translation Replication Metabolite Pheno- type Structural genomics Transcriptomics Proteomics Metabolomics Genomic DNA sequences Transcript seq Microarray data Cis-elements TF binding sites Epigenetic regulation Shotgun protein seq Subcellular location Post-translational mod Protein interaction Protein structure Metabolite concn Metabolic flux Genetic interactions Systematic KO Disease information

12 Nature omics gateway http://www.nature.com/omics/subjects/index.html

13 Three perspectives of our biological world  The cellular level, the individual, the tree of life Rosenzweig et al., 2002. Conservation Biol. Image: htto://www.tolweb.org/tree/ Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html ~10 14 cells per individual2-100x10 6 species~3x10 4 genes

14 Further complications  Cell-cell interactions  Cell types  Environmental conditions  Developmental programming  Interactions at the organismal level  Interactions at the population, ecosystem level

15 Definition of bioinformatics  Bioinformatics  Research, development, or application of  Computational tools and approaches for expanding the use of  Biological, medical, behavioral or health data, including those to  Acquire, store, organize, archive, analyze, or visualize such data.  Computational biology  The development and application of  Data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to  The study of biological, behavioral, and social systems  Q: What kinds of data are we taking about? http://www.bisti.nih.gov/

16 Example: Sequence assembly  Cut into ~150kb pieces  Clone into Bacterial Artificial Chromosome (BAC)  Mapped to determine order of the BAC clones (golden/tiling path)  Shear a BAC clone randomly  Sequencing  Assembie sequence reads http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

17 Sequence assembly  Challenges  The presence of gaps  Due to incomplete coverage  Sequencing error and quality issue: worse at the end of reactions  So can’t rely on perfectly identical sequences all the time  Sequences derived from one strand of DNA  Need to take orientations of reads into account  Non-random sequencing of DNA  Presence of repeats http://www.cbcb.umd.edu/research/assembly_primer.shtml Correct layout Mis-assembly

18 Overlap-layout consensus  The relationships between reads can be represented as a graph  Nodes (vertices): reads  Edges (lines): connecting “overlapping reads”  Goal: identifying a path through that graph that visits each node exactly once 12341234 1 2 3 4 Genome http://en.wikipedia.org/wiki/Image:Hamilton_path.gif

19 Example: Gene prediction  How can we identify functional elements in the genomes?  How can we assign functions to these elements?  How can we determine/predict the structures of these elements?  How can we reconstruct networks describing the relationships and dynamics between these elements?  How can we link genotypes to phenotypes?

20 Characteristic of protein coding genes  Similarity to other genes  Assuming there is some level of conservation.  Substitutions that change amino acids vs. those that won’t. http://www.mun.ca/biology/scarr/MGA2_03-20.html

21 Hidden Markov Model and gene finding  Goal:  Choose a path that maximize the probability that you will enjoy the trip (or the other way around if you wish)  How is the probability determined? p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2

22 Example: Sequence alignment  Align retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin >RBP MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRL LNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPN GLPPEAQKIVRQRQEELCLARQYRLIV >lactoglobulin MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN GECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKA LPMHIRLSFNPTQLEEQCHI

23 Goal of PSA  Find an alignment between 2 sequences with the maximum score

24 Extreme value distribution  Normal vs. extreme value distribution x probability extreme value distribution normal distribution 012345-2-3-4-5 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0

25 Example: Microarray  A solid support (e.g. a membrane or glass slide) on which DNA of known sequence is deposited in a grid-like fashion http://shadygrove.umbi.umd.edu/microarray/Microarray.gif

26 Microarray data analysis  A simplified pipeline http://www.microarray.lu/images/overview_1.jpg

27 What’s in the cel files  Intensities of perfect and mismatch probes #### Dimension of the data matrix nrow(M); ncol(M) ### Perfect match pm <- pm(M) # perfect match intensities dim(pm) # dimension of the pm matrix pm[1:5,] # the first five columns summary(pm) # summary stat for the pm matrix GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL [1,] 252.5 267.0 349.0 424.8 213.5 237.8 [2,] 138.0 129.8 147.5 335.5 215.3 142.3 [3,] 172.3 155.5 174.8 411.8 241.0 128.3 [4,] 163.3 142.8 155.5 494.3 225.5 119.5 [5,] 259.5 257.3 245.3 505.5 308.8 217.0 GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL Min. : 56.3 Min. : 67.5 Min. : 69.5 Min. : 96.0 1st Qu.: 144.3 1st Qu.: 143.3 1st Qu.: 157.3 1st Qu.: 303.6 Median : 212.5 Median : 215.0 Median : 234.8 Median : 414.5 Mean : 423.1 Mean : 437.5 Mean : 458.4 Mean : 648.2 3rd Qu.: 383.5 3rd Qu.: 397.8 3rd Qu.: 426.0 3rd Qu.: 637.0 Max. :39818.5 Max. :39268.0 Max. :28628.0 Max. :24854.5

28 Probe intensity behaviors between arrays  Distributions vary widely between experiments ### Summarize the intensity par(mfrow=c(1,2)) # get a plotting region with 1 row, 2 col hist(M) # generate log2 histograms boxplot(M) # generate log2 boxplots log intensity

29 Example: Identification of cis-elements  The on-off switches and rheostats of a cell operating at the gene level.  They control whether and how vigorously that genes will be transcribed into RNAs. http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml

30 Motif model: Position Frequency Matrix (PFM)  f b,i : freuqnecy of a base b occurred at the i-th position D’haeseleer (2006) Nature Biotech. 24:423

31 Motif model: Position Weight Matrix (PWM)  Suppose p A,T = 0.32 and p G,C = 0.18 (Arabidopsis thaliana) 12345 A80442 T00022 G08422 C00002 Position Frequency Matrix 12345 A1.1-2.20.4 -0.2 T-2.2 -0.2 G-2.21.61.00.3 C-2.2 0.3 Position Wight Matrix

32 Example: Cis-regulatory logic  Based on a high confidence set of binding sites:  3,353 interactions between  116 regulators and  1,296 promoters Harbison et al. (2004) Nature 43:99

33 Identification of putative cis elements  Pearson's correlation coefficient as the similarity measure.  k-mean clustering to identify co-regulated genes.  Motifs identified only with AlignACE Beer and Tavazoie (2004) Cell 117:185

34 Bayesian network  Bayes' theorem  Bayesian network Charniak (1991) Bayesian networks without tears

35 Final example: Relationships between sequences  Sanger and colleagues (1950s): 1st sequence  Insulin from various mammals

36 Trees  An acyclic, un-directed graph with nodes and edges A B C D E F G H I time 6 2 11 2 1 2 Li 1997. Molecular Evolution. p101 one unit 6 1 2 2 1 A B C 2 1 2 D E Operational taxonomic unit Ancestral taxonomic units External branch Internal branch

37 Enumerating trees  Suppose there are n OTUs (n ≥ 3)  Bifurcating rooted trees:  Unrooted trees:  For 10 OTUs  3.4x10 7 possible rooted trees  2.0x10 6 possible unrooted trees http://w3.uniroma1.it/cogfil/philotrees.jpg

38 Impacts of genomics and bioinformatics  New ways to ask and answer question?  Hypothesis driven vs. data driven  A matter of scale  A matter of integration  Quantitative emphasis  Multi-displinary approaches  How is genomics different from genetics?  Whole genome approach versus a few genes  Investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion.  Genetics looks at single genes, one at a time, as a snapshot.  Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense

39 The END ...


Download ppt "Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive."

Similar presentations


Ads by Google