Download presentation
Presentation is loading. Please wait.
Published byMonica Barker Modified over 9 years ago
1
BioSci D145 lecture 1 page 1 © copyright Bruce Blumberg 2010. All rights reserved BioSci D145 Lecture #4 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Bassem Shoucri (bshoucri@uci.edu) –4351 Nat Sci 2, 824-6873, 3116 Office hours M 2-4 lectures will be posted on web pages after lecture –http://blumberg.bio.uci.edu/biod145-w2015http://blumberg.bio.uci.edu/biod145-w2015 –http://blumberg-lab.bio.uci.edu/biod145-w2015http://blumberg-lab.bio.uci.edu/biod145-w2015 –Last year’s midterm is now posted. It is useful to work through the problems to see what sort of questions I ask and how best to study. –Term paper outlines due Thursday (1/29) by midnight. Please upload to the drop box (1st choice) or e-mail to me (2nd choice).
2
Term paper outline Title of your proposal A paragraph introducing your topic and explaining why it is important; i.e., what impact will the knowledge gained have. –Why should any funding agency give you money to pursue this research? NIH now requires a statement of human health relevance for all grant applications NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research Present your hypothesis –A supposition or conjecture put forth to account for known facts; esp. in the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at. Enumerate 2-3 specific aims in the form of questions that test your hypothesis –At least one of these aims needs to have a strong “whole genome” component BioSci D145 lecture 4 page 2 © copyright Bruce Blumberg 2004-20014. All rights reserved
3
BioSci D145 lecture 4 page 3 © copyright Bruce Blumberg 2004-2014. All rights reserved Genome sequencing The problem –Genome sizes for most eukaryotes are large (10 8 -10 9 bp) –High quality sequences only about 600-800 bp per run The solution –Break genome into lots of bits and sequence them all –Reassemble with computer The benefit –Rapid increase in information about genome size, gene comparisons, etc The cost –3 x 10 9 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 10 6 reactions for 1x coverage! –Need both strands (x2), need overlaps and need to be sure of sequences –~10 7 -10 8 reactions/runs required for a human-sized genome –About $1-2 per reaction these days, ~$8 commercially.
4
BioSci D145 lecture 4 page 4 © copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing(contd)
5
BioSci D145 lecture 4 page 5 © copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) Shotgun sequencing (contd) –How to minimize sequence redundancy? Best way to minimize redundancy is map before you start –C. elegans was done this way - when the sequence was finished, it was FINISHED »mapping took almost 10 years –mapping much too tedious and nonprofitable for Celera »who cares about redundancy, let’s sequence and make $$ »There is scientific value to draft genomes, too. why does redundancy matter? –Finished sequence today costs about $0.50/base –Note that at 10x, 99.995% coverage leaves at least 150 kb of the human genome unsequenced
6
BioSci D145 lecture 4 page 6 © copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) –Mapping by fingerprinting –Mapping by hybridization
7
BioSci D145 lecture 4 page 7 © copyright Bruce Blumberg 2004-2007. All rights reserved Traditional (map first) vs STC (map as you go along) mapping
8
BioSci D145 lecture 4 page 8 © copyright Bruce Blumberg 2004-2007. All rights reserved The human genome In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs –Celera -> 39114 –Ensembl -> 29691 –Consensus from all sources ~30K Number of genes –C. elegans – 19,000 –Arabidopsis - 25,000 Predictions had been from 50-140k human genes –What’s up with that? –Are we only slightly more complicated than a weed? –How can we possibly get a human with less than 2x the number of genes as C. elegans –Implications? UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002
9
BioSci D145 lecture 4 page 9 © copyright Bruce Blumberg 2004-2007. All rights reserved The human genome The answer – Gene sets don’t overlap completely (duh) –Floor is 42K –130056 build #236 UniGene Clusters (from EST and mRNA sequencing –http://www.ncbi.nlm.nih.gov/unigene –Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous years) Important questions to be answered about what constitutes a “gene” = 42113 –Crick genes? DNA-RNA-protein –How about RNAs? –miRNAs? –Antisense transcripts?
10
BioSci D145 lecture 4 page 10 © copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing(contd) –Whole genome shotgun sequencing (Celera) premise is that rapid generation of draft sequence is valuable why bother trying to clone and sequence difficult regions? –Basically just forget regions of repetitive DNA - not cost effective using this approach, genomes rarely are completely finished –rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% problems –sequence may never be complete as is C. elegans –much redundant sequence with many sparse regions and lots of gaps. –Fragment assembly for regions of highly repetitive DNA is dubious at best –“Finished” fly and human genomes lack more than a few already characterized genes
11
BioSci D145 lecture 4 page 11 © copyright Bruce Blumberg 2004-2007. All rights reserved The human genome How finished is the human genome sequence? –Draft sequence to high coverage –Chromosome by chromosome finishing now Chr 22 – 1999 Chr 21 – 2000 Chr 20 – 2001 Chr 15 – 2003 Chr 6,7,Y-2003 Chr 13,19 -2004 May 2006 – all finished
12
BioSci D145 lecture 5 page 12 © copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) Knowing what we know now – how to approach a large new genome? –Xenopus tropicalis 1.7 Gb (about ½ human) –BAC end sequencing –Whole genome shotgun –HAPPY mapping and radiation hybrid mapping to order scaffolds –Gaps closed with BACS –8.5 x coverage (but > 9000 scaffolds for 18 chromosomes) –Finishing now in process But how “finished” will it be? We need to wait and see 2011 – update. –Xenopus laevis – 454 sequencing to 4x and de novo assembly 2015 update – now version 8.0 – 10x coverage –FINALLY integrated BAC end sequences –Integrated genetic map –50% of contigs > 72 kb
13
BioSci D145 lecture 5 page 13 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies Sequencing by hybridization –Construct a high-density microchip with all possible combinations of a short oligonucleotide Up to 25-mers By photolithography –Synthesized on chip directly –Label and hybridize fragment to be sequenced –Wash stringently –Read fluorescent spots –Reconstruct sequence by computer
14
BioSci D145 lecture 5 page 14 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) Sequencing by hybridization rarely used for de novo sequencing –Extremely fast and useful to sequence something you already know the sequence of but want to identify mutation - resequencing –Disease causing changes e.g in mitochondrial DNA –SNP discovery –Works best for examining sequence of <10 kb
15
BioSci D145 lecture 5 page 15 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) http://www.affymetrix.com/products/arrays/index.affx SNP discovery –Photo shows mitochondrial chip –Right panel shows pairs of normal (top) vs disease (bottom) (Leber’s Hereditary Optic Neuropathy) Top 3 disease mutations Bottom control with no change
16
BioSci D145 lecture 5 page 16 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies – Next Generation sequencing 2 nd generation = high throughput, short sequences 3 rd generation = single molecule sequencing Small number of sequence templates (thousands) but very long reads (~10 5 bp) What is the immediate implication of this technology for genome assembly? Key review is Metzger, M.L. (2010) Sequencing technologies — the next generation, Nature Reviews Genetics 11, 31-46. We should now be able to completely sequence large insert clones directly and avoid fragmentation by repetitive elements!
17
3 rd generation
18
BioSci D145 lecture 5 page 18 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) Pyrosequencing – –http://www.454.comhttp://www.454.com –Based on synthesis of complementary strand to a template (like Sanger) –Detection of elongation with chemiluminescence Fragment genome to appropriate size (depends on application) add adapters to each end Isolate those with different adapters on each end PCR to amplify
19
BioSci D145 lecture 5 page 19 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) Pyrosequencing (contd) –PCR – capture template on micro beads such that each bead gets 1 molecule of DNA – how? –Emulsify in water/oil microreactors –Amplify DNA –Break and recover DNA containing beads Use a large ratio of beads to DNA
20
BioSci D145 lecture 5 page 20 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) Pyrosequencing (contd) –Sequencing – load beads into picotiter wells Add enzymes (sulfurylase and luciferase) Run reaction – flow nucleotide/buffer solution across wells one at a time Complementary nucleotide addition leads to light output –light output is proportional to # consecutive nucleotides
21
BioSci D145 lecture 5 page 21 © copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) Pyrosequencing (contd) –What is the point? Can generate 400,000 reads in parallel (FLX) Or > 1,000,000 (FLX Titanium) Each read is 200-400 bp (FLX), or 400-600 (FLX Titanium) So you can get –8 x10 7 bp per run! (FLX) –4-6 x 10 8 bp/run (FLX Titanium) What is massively parallel sequencing good for? –Rapid sequencing of genomes, or resequencing of known sequences –Ancient DNA (even dinosaurs? – Svante Pääbo says ~200K years is limit) –ChIP-sequencing (week 6) –Sequencing ESTs or other tags –Determining microbial diversity in field samples –Transcriptome sequencing –Identifying variations in Viral populations Gene sequences in mixed populations
22
BioSci D145 lecture 5 page 22 © copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing Idea is to sequence many copies of the same thing –Gene sequence –mRNA transcript
23
BioSci D145 lecture 5 page 23 © copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing (contd) What is amplicon sequencing good for? –Discovery of rare somatic mutations in complex samples (e.g., cancerous tumors - mixed with germline DNA) based on ultra-deep sequencing of amplicons –Sequencing collections of exons from populations of individuals to identify diversity –Sequencing collections of human exons from populations of individuals for the identification of rare alleles associated with disease –Analysis of viral quasispecies present within infected populations in the context of epidemiological studies –Evolutionary biology in populations
24
BioSci D145 lecture 6 page 24 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics Study of similarities and differences between genome structure and organization –How many genes? Chromosomes? –Genome duplications –Gene loss Driving forces –Understanding evolution in molecular terms –Sequence annotation and function identification Sequences with important functions tend to be conserved across evolution Orthology vs paralogy –Homolog – –Orthologs - –Paralogs - descended from a common ancestor (Hox genes) homologous genes in different organisms that encode proteins with the same function and which have evolved by direct vertical descent (frog and human Hoxa-1) homologous genes that encode proteins with related but non-identical functions (Hoxa-1, Hoxb-1, Hoxd-1) Derived by gene duplication
25
BioSci D145 lecture 6 page 25 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) Functional equivalency does not require homology, sequence similarity or even 3D structure –Same chemical reaction can be catalyzed by totally unrelated enzymes –Non-orthologous gene displacement – when non- orthologous genes encode the same essential cellular function Better term would be analogous gene Convergent evolution also sometimes used
26
BioSci D145 lecture 6 page 26 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) Genes with very different functions can be related –3-D structure may indicate that proteins are related (evolved from the same ancestral protein) but sequence identity too low to detect Expected when genes diverge from a distant common ancestor < 20% amino acid sequence identity too little to establish homology (although proteins may be homologous) –For example 3-D structures of –D-alanine ligase –Glutathione synthetase –ATP-binding domains of »Carbamoyl phosphate sythetase »Succinyl-CoA synthetase Are all so similar in 3D structure that homology is not in doubt but sequence comparisons do not detect homology Why should we care whether genes are related or not? Essential for understanding how evolution works at the molecular level
27
BioSci D145 lecture 6 page 27 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) stopped here Protein evolution –Observation – many proteins composed of discrete domains –Observation – many proteins have multiple domains shared with other proteins –Conclusion – domain shuffling must have occurred during evolution –Some correlation between exons and protein domains Protein domains tend to be encoded in 1 or two exons New combinations of protein domains can be created by recombination –LINEs –Between repetitive elements in introns Exon shuffling – process of transferring exons (and hence functional domains) between proteins
28
BioSci D145 lecture 6 page 28 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) Protein evolution (contd) –Haemostatic (aka blood clotting) proteins as an exon shuffling paradigm Family of proteases that are activated by proteolysis Protein domains show strong correlation with exons
29
BioSci D145 lecture 6 page 29 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) stopped here Protein evolution (contd) –What is horizontal gene transfer Fairly rare with eukaryotes Happens in prokaryotes all the time – Examples? –What does this suggest about nature of virulence? – – transfer of genes or protein domains across unrelated species Frequently identifiable by different patterns of codon usage from other genes, particularly ribosomal proteins –e.g., transfer of antiobiotic resistance among bacteria –Plasmid exchange, phage infections and transfer –Often associated with pathogenicity »Pathogenic variants of bacteria frequently have lots of inserted DNA »e.g., E. coli H0157 has 800 kb more than lab strains of E. coli, much of which is virulence factors, prophages and prophage like elements Virulence is acquired, i.e, transferred from one organism to another
30
BioSci D145 lecture 6 page 30 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) Is there a minimal genome? How would you define “minimal genome”? Some lessons from bacterial genomics –Nearly half of ORFs are of unknown function –About 25% of all ORFs are unique to a particular species! Suggests that many new protein families remain to be discovered Many new functions may be uncovered –Periodic re-evaluation of sequenced genomes is useful Compare with newly acquired data –Often find additional ORFs and genes –Much conservation of gene position Same genes found in many genomes at same positions (good for evolutionary studies –Encoding the essential set of proteins required for life? –Compare genomes of archebacteria, eubacteria and yeast Issues with how genes are classified but a reasonably good approximation can be made Can identify 322 clusters of orthologous groups required for all key biosynthetic pathways that might be required in free-living organisms –But remember about non-orthologous gene displacements!
31
BioSci D145 lecture 6 page 31 © copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) What do we get from comparative genomics? –Powerful new tools to identify conserved sequences important regulatory elements Unidentified genes Features (promoters, splice sites, etc) –Important information about genome evolution Where did related genes originate? When did genome duplications arise? What is the history of life on earth? –And by implication, life elsewhere What is the genetic diversity in wild populations –Environmental shotgun sequencing –Information required to identify gene function Protein sequence and structure comparisons
32
BioSci D145 lecture 3 page 32 © copyright Bruce Blumberg 2007. All rights reserved Construction of cDNA libraries What is a cDNA library? What are they good for? –Collection of DNA copies representing the expressed mRNA population of a cell, tissue, organ or embryo –Identifying and isolating expressed mRNAs –functional identification of gene products –cataloging expression patterns for a particular tissue EST sequencing and microarray analysis –Mapping gene boundaries Promoters Alternative splicing
33
BioSci D145 lecture 3 page 33 © copyright Bruce Blumberg 2007. All rights reserved Determinants of library quality What constitutes a full-length cDNA? –Strictly, it is an exact copy of the mRNA –full-length protein coding sequence considered acceptable for most purposes mRNA –full-length, capped mRNAs are critical to making full-length libraries –cytoplasmic mRNAs are best – WHY? 1st strand synthesis –complete first strand needs to be synthesized –issues about enzymes 2nd strand synthesis –thought to be less difficult than 1st strand (probably not) choice of vector –plasmids are best for EST sequencing and functional analysis –phages are best for manual screening They are processed, i.e., introns removed and poly A is added
34
BioSci D145 lecture 3 page 34 © copyright Bruce Blumberg 2007. All rights reserved cDNA synthesis (stopped here – 2015) Scheme –mRNA is isolated from source of interest –1-10 μg are denatured and annealed to primer containing d(T) n V To minimize length of poly A tail in libraries for sequencing –reverse transcriptase copies mRNA into cDNA –DNA polymerase I and Rnase H convert remaining mRNA into DNA –cDNA is rendered blunt ended –linkers or adapters are added for cloning –cDNA is ligated into a suitable vector –vector is introduced into bacteria Caveats –there is lots of bad information out there much is derived from vendors who want to increase sales of their enzymes or kits –all manufacturers do not make equal quality enzymes –most kits are optimized for speed at the expense of quality –small points can make a big difference in the final outcome
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.