BioSci D145 Lecture #4 Bruce Blumberg –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt.

BioSci D145 Lecture #4 Bruce Blumberg (blumberg@uci.edu) –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt (rleavitt@uci.edu) –4351 Nat Sci 2, 824-6873 – office hours M 2:30-3:30 4206 Nat Sci 2 check e-mail daily for announcements, etc.. Updated lectures will be posted on web pages after lecture –http://blumberg.bio.uci.edu/biod145-w2016http://blumberg.bio.uci.edu/biod145-w2016 –http://blumberg-lab.bio.uci.edu/biod145-w2016http://blumberg-lab.bio.uci.edu/biod145-w2016 –Last year’s midterm is now posted. –Term paper outlines due Thursday (1/28) by midnight. –No office hours on Thursday 1/28 BioSci D145 lecture 4 page 1 © copyright Bruce Blumberg 2004-2016. All rights reserved

Term paper outline Title of your proposal A paragraph introducing your topic and explaining why it is important; i.e., what impact will the knowledge gained have. –Why should any funding agency give you money to pursue this research? NIH now requires a statement of human health relevance for all grant applications NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research Present your hypothesis –A supposition or conjecture put forth to account for known facts; esp. in the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at. Enumerate 2-3 specific aims in the form of questions that test your hypothesis –At least one of these aims needs to have a strong “whole genome” component Genomics, transcriptomic, proteomic, metabolomic, etc. BioSci D145 lecture 4 page 2 © copyright Bruce Blumberg 2004-2016. All rights reserved

BioSci D145 lecture 4 page 3 © copyright Bruce Blumberg 2004-2016. All rights reserved DNA Sequence analysis Complete DNA sequence (all nts both strands, no gaps) –complete sequence is desirable but takes time how long depends on size and strategy employed –which strategy to use depends on various factors how large is the clone? –cDNA ?, genomic? How fast is sequence required? sequencing strategies –Small-scale (not whole genome) primer walking cloning and sequencing of restriction fragments progressive deletions –Bidirectional, unidirectional –Genome sequencing – nearly always shotgun sequencing whole genome (traditional vs. nextgen) with mapping –map first (C. elegans) –map as you go (many)

DNA Sequence analysis (contd) Primer walking - walk from the ends with oligonucleotides –sequence, back up ~50 nt from end, make a primer and continue Why back up? –Need to see overlap to be sure about sequence you are reading BioSci D145 lecture 4 page 4 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA Sequence analysis (contd) Primer walking (contd) –advantages very simple no possibility to lose bits of DNA –restriction mapping –deletion methods no restriction map needed best choice for short DNA –disadvantages slowest method –about a week between sequencing runs oligos are not free (and not reusable) not feasible for large sequences –applications cDNA sequencing when time is not critical targeted sequencing –verification –closing gaps in sequences BioSci D145 lecture 4 page 5 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA Sequence analysis (contd) Cloning and sequencing of restriction fragments –once the most popular method make a restriction map, subclone fragments sequence –advantages straightforward directed approach can go quickly cloned fragments often useful otherwise –RNase protection, nuclease mapping, in situ hybridization –disadvantages possible to lose small fragments –must run high quality analytical gels depends on quality of restriction map –mistaken mapping -> wrong sequence restriction site availability –applications sequencing small cDNAs isolating regions to close gaps BioSci D145 lecture 4 page 6 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA Sequence analysis (contd) nested deletion strategies - sequential deletions from one end of the clone Exonuclease III-mediated deletion –cut with polylinker enzyme protect ends - –3’ overhang –phosphorothioate –cut with enzyme between first cut and the insert can’t leave 3’ overhang –timed digestions with Exo III –stop reactions, blunt ends –ligate and size select recombinants –sequence –advantages unidirectional processivity of enzyme gives nested deletions BioSci D145 lecture 4 page 7 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA Sequence analysis (contd) Exonuclease III-mediated deletion (contd) –disadvantages need two unique restriction sites flanking insert on each side best used successively to get > 10kb total deletions may not get complete overlaps of sequences –fill in with restriction fragments or oligos –applications method of choice for moderate size sequencing projects –cDNAs –genomic clones good for closing larger gaps Small-scale sequence analysis – how is it practiced today? –Primer walking –ExoIII-mediated deletion with primer walking BioSci D145 lecture 4 page 8 © copyright Bruce Blumberg 2004-2016. All rights reserved

Genome sequencing The problem –Genome sizes for most eukaryotes are large (10 8 -10 9 bp) –High quality sequences only about 600-800 bp per run Nextgen sequencing is ~75-400 bp The solution –Break genome into lots of bits and sequence them all –Reassemble with computer The benefit –Rapid increase in information about genome size, gene comparisons, etc The cost –3 x 10 9 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 10 6 reactions for 1x coverage! –Need both strands (x2), need overlaps and need to be sure of sequences –~10 7 -10 8 reactions/runs required for a human-sized genome –About $1-2 per reaction these days, ~$8 commercially. BioSci D145 lecture 4 page 9 © copyright Bruce Blumberg 2004-2016. All rights reserved

Genome sequencing (contd) Shotgun sequencing NOT invented by Craig Venter –Messing 1981 first description of shotgun sequencing –Sanger lab developed current methods in 1983 –approach blast genome into small chunks clone these chunks –3-5 kb, 8 kb plasmid –40 kb fosmid jump repetitive sequences sequence + assemble by computer –A priori difficulties how to get nice uniform distribution how to assemble fragments what to do about repeats? How to minimize sequence redundancy? BioSci D145 lecture 4 page 10 © copyright Bruce Blumberg 2004-2016. All rights reserved

Genome sequencing (contd) Shotgun sequencing (contd) –How to minimize sequence redundancy? Best way to minimize redundancy is map before you start –C. elegans was done this way - when the sequence was finished, it was FINISHED »mapping took almost 10 years –mapping much too tedious and nonprofitable for Celera »who cares about redundancy, let’s sequence and make $$ »There is scientific value to draft genomes, too. why does redundancy matter? –Finished sequence today costs about $0.50/base –Note that at 10x, 99.995% coverage leaves at least 150 kb of the human genome unsequenced BioSci D145 lecture 4 page 13 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA sequence analysis Landmarks in DNA sequencing –Sanger, Nicklen and Coulson. Sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463-5467 (1977). –Sanger, F. et al. The nucleotide sequence of bacteriophage ΦX174. J Mol Biol 125, 225-46. (1978). –Sutcliffe, J. G. Complete nucleotide sequence of the Escherichia coli plasmid pBR322. Cold Spring Harb Symp Quant Biol 43, 77-90. (1979). –Sanger et al., Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162, 729-73. (1982). –Messing, J., Crea, R. & Seeburg, P. H. A system for shotgun DNA sequencing. Nucl.Acids Res 9, 309-21 (1981). –Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290, 457-65 (1981). –Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA sequence analysis. Anal Biochem 129, 216-23. (1983). –Baer et al. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 310, 207-11. (1984). (189 kb) –Innis et al. DNA sequencing with Taq DNA polymerase and direct sequencing of PCR-amplified DNA Proc. Natl. Acad. Sci. 85, 9436-9440 (1988) BioSci D145 lecture 4 page 16 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA sequence analysis (contd) Landmarks in DNA sequencing (contd). –1995 - Haemophilus influenzae (1.83 Mb) –1995 - Mycoplasma genitalium (0.58 Mb) –1996 - Saccharomyces cerevisiae genome (13 Mb) –1996 - Methanococcus jannaschii (1.66 Mb) –1997 - Escherichia coli (4.6 Mb) –1997 - Bacillus subtilis (4.2 Mb) –1997 - Borrelia burgdorferi (1.44 Mb) –1997 - Archaeoglobus fulgidus (2.18 Mb) –1997 - Helicobacter pylori (1.66 Mb) first bacterium sequenced, human pathogen smallest free living organism first Archaebacterium Lyme disease first sulfur metabolizing bacterium first bacterium proven to cause cancer BioSci D145 lecture 4 page 17 © copyright Bruce Blumberg 2004-2016. All rights reserved

Landmarks in DNA sequencing (contd) –1998 - Treponema pallidum (1.14 Mb) –1998 - Caenorhabditis elegans genome (97 Mb) –1999 - Deinococcus radiodurans (3.28 Mb) –2000 - Drosophila melanogaster (120 Mb) –2000 - Arabidopsis thaliana (115 Mb) –2001 - Escherichia coli O157:H7 (4.1 Mb) –2001 – draft Human “genome” –2002 – mouse genome –2002 – Ciona intestinalis –2003 – “complete “human genome –2004 – rat genome –2006 – Human “genome” complete sequence of all chromosomes –Many more genomes underway, check JGI, Sanger and other web sites resistant to radiation, starvation, ox stress DNA sequence analysis (contd) Primitive chordate Pathogenic variant of E. coli BioSci D145 lecture 4 page 18 © copyright Bruce Blumberg 2004-2016. All rights reserved

The human genome In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs –Celera -> 39114 –Ensembl -> 29691 –Consensus from all sources ~30K Number of genes –C. elegans – 19,000 –Arabidopsis - 25,000 Predictions had been from 50-140k human genes –What’s up with that? –Are we only slightly more complicated than a weed? –How can we possibly get a human with less than 2x the number of genes as C. elegans –Implications? UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002 BioSci D145 lecture 4 page 19 © copyright Bruce Blumberg 2004-2016. All rights reserved

The human genome The answer – Gene sets don’t overlap completely (duh) –Floor is 42K –130056 build #236 UniGene Clusters (from EST and mRNA sequencing) –http://www.ncbi.nlm.nih.gov/unigene –Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous years) (“final” count Important questions to be answered about what constitutes a “gene” = 42113 –Crick genes? DNA-RNA-protein –How about RNAs? –miRNAs? –Antisense transcripts? –lncRNAs? BioSci D145 lecture 4 page 20 © copyright Bruce Blumberg 2004-2016. All rights reserved

Genome sequencing(contd) –Whole genome shotgun sequencing (Celera) premise is that rapid generation of draft sequence is valuable why bother trying to clone and sequence difficult regions? –Basically just forget regions of repetitive DNA - not cost effective using this approach, genomes rarely are completely finished –rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% problems –sequence may never be complete as is C. elegans –much redundant sequence with many sparse regions and lots of gaps. –Fragment assembly for regions of highly repetitive DNA is dubious at best –“Finished” fly and human genomes lack more than a few already characterized genes BioSci D145 lecture 4 page 21 © copyright Bruce Blumberg 2004-2016. All rights reserved

Genome sequencing (contd) Knowing what we know now – how to approach a large new genome? –Xenopus tropicalis 1.7 Gb (about ½ human) –BAC end sequencing –Whole genome shotgun –HAPPY mapping and radiation hybrid mapping to order scaffolds –Gaps closed with BACS –8.5 x coverage (but > 9000 scaffolds for 18 chromosomes) –Finishing now in process But how “finished” will it be? 2016 update – now version 9.0 –FINALLY integrated BAC end sequences –Integrated genetic map –50% of contigs > 72 kb –Xenopus laevis – v9.1 – >90% of genome in chromosomal scaffolds 2 “subgenomes” fully characterized. annotation remains a big challenge. BioSci D145 lecture 4 page 22 © copyright Bruce Blumberg 2004-2016. All rights reserved

Functional Genomics - Analysis of gene function on a whole genome basis Genome projects –DNA sequencing –Human genome, mouse, rat, Drosophila, C. elegans “finished” –model organisms progressing rapidly –Lots of new genes, but many lack known function Functional genomics –Identification of gene functions associate functions with new genes coming from genome projects function of genes identified from characterizing diseases or mutants –Identification of genes by their function discovery of new genes BioSci D145 lecture 4 page 23 © copyright Bruce Blumberg 2004-2016. All rights reserved

*Methods of profiling gene expression – large scale to whole genome What are the possibilities –Array – micro or macro –Sequence sampling (EST generation) –SAGE – serial analysis of gene expression –Massively parallel signature sequencing (RNA-seq, Illumina, 454) DNA microarray analysis was, until now totally dominant method –Two basic flavors Spotted (spot DNA onto support) –cDNA microarrays –Oligonucleotide arrays –Moderately expensive Synthesized (use photolithography to synthesize oligos onto silicon or other suitable support –Affymetrix Gene Chips dominate –VERY expensive –Both are in wide use and suitable for whole genome analysis BioSci D145 lecture 4 page 24 © copyright Bruce Blumberg 2004-2016. All rights reserved

Spotted arrays Source material is prepared –cDNAs are PCR amplified OR –Oligonucleotides synthesized Spotted onto treated glass slides RNA prepared from 2 sources –Test and control Labeled probes prepared from RNAs –Incorporate label directly –Or incorporate modified NTP and label later –Or chemically label mRNA directly Hybridize, wash, scan slide Express as ratio of one channel to other after processing BioSci D145 lecture 4 page 25 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarray types Stanford type microarrayer –http://cmgm.stanford.edu/pbrown/mguide/ index.htmlhttp://cmgm.stanford.edu/pbrown/mguide/ index.html Printing method –Reminiscent of fountain pen BioSci D145 lecture 4 page 26 © copyright Bruce Blumberg 2004-2016. All rights reserved

Strategy to identify RAR target genes Agonist - TTNPBAntgonist - AGN193109 Harvest st 18 Poly A+ RNA Amino-allyl labeled 1 st strand cDNA Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Probe microarrays upregulateddownregulated BioSci D145 lecture 4 page 27 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarray Statistical analysis of output – VERY IMPORTANT! Replicates are very important Preprocessing of data is needed –To remove spurious signals BioSci D145 lecture 4 page 28 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarray Advantages –Custom arrays possible and affordable –Ratio of fluorescence is robust and reproducible Disadvantages –Availability of chips –Expense of production on your own –Technical details in preparation BioSci D145 lecture 4 page 29 © copyright Bruce Blumberg 2004-2016. All rights reserved

Affymetrix GeneChips High density arrays are synthesized directly on support –4 masks required per cycle -> 100 masks per chip (25-mers) –Pentium IV requires about 30 masks –G.P. Li in Engineering directs a UCI facility that can make just about anything using photolithography BioSci D145 lecture 4 page 30 © copyright Bruce Blumberg 2004-2016. All rights reserved

Affymetrix GeneChips –Each gene is represented by a series of oligonucleotide pairs One perfect match One with a single mismatch –Only hybridization to perfect match but not mismatch is considered to be real –Gene is considered “detected” if > ½ of oligo pairs are positive –Number of pairs depends on organism and how well characterized array behavior is Human uses 8 pairs Xenopus uses 16 pairs BioSci D145 lecture 4 page 32 © copyright Bruce Blumberg 2004-2016. All rights reserved

Affymetrix GeneChips Result is in single color –Always need two chips – control and experimental for each condition –Also need replicates for each condition –For diverse biological samples (e.g., humans) 10 replicates required! –For less diverse samples (cell lines) probably 5 replicates needed Advantages –Commercially available –Standardized Disadvantages –About $700 to buy, probe and process each chip (at UCI)! About $500 elsewhere –May not be available for your organism of interest –No ability to compare probes directly on the same chip Must rely on technology BioSci D145 lecture 4 page 33 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarrays What are they good for? –Identifying genes expressed in one condition vs. another One tissue vs. another (heart vs liver) Tissue vs. tumor (liver vs. hepatocarcinoma) In response to a treatment (e.g., RA) In response to disease (e.g., after viral infection) –Building expression profiles Tissues Cancers Developmental stages Expressed genes –Identifying organisms in food Array can identify which animals are present in a mix http://www.dnavision.com/files/FOODIDBrosh%20En.pdf BioSci D145 lecture 4 page 34 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarrays What are they good for? (contd) –Response of animal to drugs or chemicals Toxicogenomics Pharmacogenomics –Diagnostics SNP analysis to identify disease loci Specific testing for known diseases BioSci D145 lecture 4 page 35 © copyright Bruce Blumberg 2004-2016. All rights reserved

DNA microarrays What are the limitations of microarray technology? What sorts of factors might confound the experiment? –Signal intensity (or signal/noise) Improved dyes, label uniformly –Biological variation (samples are inherently different) Sufficient # of replicates is key keep individuals separate –Not all mRNAs will be present at sufficient levels to detect Amplification, but beware of bias –Good statistical analysis is required Bayesian statistics are best (Pierre Baldi is local expert) –calculating the probability of a new event on the basis of earlier probability estimates which have been derived from empiric data –i.e., don’t assume random distribution in datasets, calculate probability based on real data –Bayesian approach great for small number of replicates, converges on t-test at high number of replicates http://cybert.microarray.ics.uci.edu/ BioSci D145 lecture 4 page 36 © copyright Bruce Blumberg 2004-2016. All rights reserved

Other methods of transcriptome analysis - parallel Microarray was once the dominant method –Direct RNA sequencing methods are rapidly displacing microarrays –SAGE (serial analysis of gene expression) Nanostring N-Counter is modern implementation Very short sequences –RNAseq Directly sequence large numbers of RNAs Longer sequences SAGE –Relies on generating many very short sequences and matching these to the genome –10 bp = short SAGE –17 bp = “long” SAGE BioSci D145 lecture 4 page 37 © copyright Bruce Blumberg 2004-2016. All rights reserved

Other methods of transcriptome analysis - parallel SAGE (continued) –What is the obvious shortcoming of this method? –Sequences may not be unique and could have difficulty mapping to the genome BioSci D145 lecture 4 page 38 © copyright Bruce Blumberg 2004-2016. All rights reserved

Other methods of transcriptome analysis - parallel RNA seq – Ali Mortazavi is local expert –Use of massively parallel sequencing allows precise quantitation of transcript –Also allows discovery of rare splice forms –Discovery of unexpected transcripts –Main problem is in mapping sequence calls to genome Sequencing has 1-2% errors which can make mapping to genome fail or induce “in silico cross-hybridization” –Mapping to incorrect genomic location BioSci D145 lecture 4 page 39 © copyright Bruce Blumberg 2004-2016. All rights reserved

Microarray vs. RNAseq Microarray –Assumes you know all the transcripts that are expressed in the organism/tissue of interest –Any sequence you did not know was expressed will not be there. except whole genome tiling arrays –Detection limit issues Signal-noise ratio –Well validated, expression analysis can be quantitative Not usually performed quantitatively RNAseq –No assumption re transcripts but best with genome sequence. –Works less well without –Can discover novel sequences or new splice forms not yet characterized (if you have genome) –Detection limits are not a problem – can detect small # –Getting better, expression analysis is quantitative with read depth ≥ 20 x10 6 mapped reads BioSci D145 lecture 4 page 40 © copyright Bruce Blumberg 2004-2016. All rights reserved

BioSci D145 Lecture #4 Bruce Blumberg –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt.

Similar presentations

Presentation on theme: "BioSci D145 Lecture #4 Bruce Blumberg –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BioSci D145 Lecture #4 Bruce Blumberg –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt.

Similar presentations

Presentation on theme: "BioSci D145 Lecture #4 Bruce Blumberg –4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) –phone 824-8573 TA – Ron Leavitt."— Presentation transcript:

Similar presentations

About project

Feedback