Today’s Lecture Topics

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

IGEM Journal Club 6/30/10. “Even in simple bacterial cells, do the chromosomes contain the entire genetic repertoire? If so, can a complete genetic system.
Today’s Lecture Topics
Human Genome Project What did they do? Why did they do it? What will it mean for humankind? Animation OverviewAnimation Overview - Click.
Genomics & Proteomics What is genomics? GOALS of Genomics
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Kolmogorov: Complexity of an object is the shortest length of a computer program that creates the object The Human Genome, and Human Complexity Yoni Toker.
9 Genomics and Beyond Brief Chapter Outline
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Alternative splicing and evolution Daniel Jeffares.
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Genome sequencing and assembling
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Manipulating the Genome: DNA Cloning and Analysis 20.1 – 20.3 Lesson 4.8.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Synthetic biology Genome engineering Chris Yellman, U. Texas CSSB.
Lesson 10 Bioinformatics
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
AP Biology A Lot More Advanced Biotechnology Tools Sequencing.
Molecular Basis for Relationship between Genotype and Phenotype DNA RNA protein genotype function organism phenotype DNA sequence amino acid sequence transcription.
Genomics BIT 220 Chapter 21.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Organizing information in the post-genomic era The rise of bioinformatics.
Chapter 21 Eukaryotic Genome Sequences
Human Genome.
Central dogma: the story of life RNA DNA Protein.
Bailee Ludwig Quality Management. Before we get started…. ….Let’s see what you know about Genomics.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
BIOL 433 Plant Genetics Term 2, Instructors: Dr. George Haughn Dr. Ljerka Kunst BioSciences 2239BioSciences Tel
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
DNA Technology & Genomics CHAPTER 20. Restriction Enzymes enzymes that cut DNA at specific locations (restriction sites) yielding restriction fragments.
1 Annotation of the bacteriophage 933W genome: an in- class interactive web-based exercise.
Virginia Commonwealth University
Microbial genomics.
Ch 12: Genomes.
Part 3 Gene Technology & Medicine
MCB 7200: Molecular Biology
Human Genome Project.
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
BIOL 433 Plant Genetics Term 2,
Genomes and Their Evolution
Section 3: Gene Technologies in Detail
Genomics: Sequencing Is the Basis for Identifying and Mapping All Genes in a Genome Genomics, the study of genomes, encompasses structural genomics, functional.
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
EL: To find out what a genome is and how gene expression is regulated
Today’s Lecture Genetic mapping studies: two approaches
Chapter 14 Bioinformatics—the study of a genome
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Today… Review a few items from last class
Genomes and Their Evolution
How to Build a Horse: Final Report
Every living organism inherits a blueprint for life from its parents.
BIOL 433 Plant Genetics Term 2,
From Mendel to Genomics
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Computational genomics
Human Genome Project Seminal achievement. Scientific milestone.
Relationship between Genotype and Phenotype
A Lot More Advanced Biotechnology Tools
Presentation transcript:

Today’s Lecture Topics Whole genome sequencing Shotgun sequencing method Sequencing the human genome Functional/comparative genomics Transcriptome & RNA-Seq Proteomics

Shotgun DNA sequencing: Sequence the entire genome rapidly. No requirement for a high resolution linkage or physical map. Just break the genome up into small pieces, sequence it, assemble, and find the gene of interest/do the bioinformatic analysis later. Reverses the way genetic studies proceeds. It used to be we had to find the gene first to study the cause of the disease. Now we can study genes we didn’t even knew exist.

Fig. 8.13, Shotgun sequencing a genome

Shotgun DNA sequencing---dideoxy method: Begin with genomic DNA and/or 200-300 kb BAC clone library. Mechanically shear DNA into ~2 kb bp overlapping fragments. Isolate on agarose, purify, and clone into standard plasmid vectors. Sequence ~500 bp from each end of each 2 kb insert. Sequence from the middle 1,000 bp of each insert is obtained from overlapping clones. Repeat the process so that 4-5x the total length of the genome is sequenced (dideoxy sequencing is 99.99% accurate). Results in a contig library with ~97% genome coverage (the missing 3% is composed mostly of repeated DNA sequence). Assemble hundreds of thousands of overlapping ~500 bp sequences with fast computers operating in parallel (supercomputer).

How to deal with the repeated DNA - 2 kb clones present a problem, solved with 10 kb clones: Many repeated sequences in the genome are in regions spanning ~5 kb in size. So many 2 kb clones contain entirely repeated DNA. Results in a dead stop in the assembly, because there is ambiguity about where each clone goes. Repeated sequences occur all over the genome. On average, 10 kb clones contain less repeated DNA sequence. Solution is to create and sequence a 10 kb clone library derived from the same genomic DNA or BAC library. Complete genome coverage requires combining the sequences from the 2 kb & 10 kb libraries.

Genome Date Size Institute Method Homo sapiens mtDNA 1981 16,159 bp (1 circular) - Haemophilus influenzae (bacteria) 1995 1,830,137 bp TIGR Shotgun Mycoplasma genitalium (bacteria) 580,070 bp Escherichia coli (bacteria) 1997 4,639,221 bp University of Wisconsin-Madison Methanococcus jannaschii (Archaeon) 1996 1,739,933 bp (3 circular) DOE Saccharomyces cerevisiae (yeast) 12,067,280 bp (16 linear) 100+ labs Mapping Caenorhabditis elegans (nematode) 1998 97,000,000 bp (6 linear) Consortium

Genome Date Size Institute Method Drosophila melanogaster (fruit fly) 2000 180,000,000 bp UC Berkley Celera Genomics Shotgun w/BAC map Arabidopsis thaliana (angiosperm) 125,000,000 bp (5 linear) Consortium Homo sapiens (human) 3,400,000,000 bp Human Genome Project & Mapping & Shotgun

Sequencing the human genome: Two major players: Human Genome Project (HGP): Publicly funded international consortium (NIH, DOE, etc.) Francis Collins, National Human Genome Res. Inst. (NHGRI) Began in U.S. in 1990 with a goal of 15 years Genetic and physical mapping approach + dideoxy sequencing Celera Genomics Corporation (CRA): Spin-off of Applied Biosystems (ABI) J. Craig Venter, CEO Created in 1998 with a goal of 3 years Direct shotgun approach + dideoxy sequencing (+ HGP’s maps for validation) Both groups collected blood and sperm samples from anonymous male and female donors of different ethnic backgrounds.

J. Craig Venter Celera Genomics Francis Collins Human Genome Project

Milestone: 26 June 2000 - White House press conference with Bill Clinton: HGP: Started 1990 ~22.1 billion nucleotides of sequence data 7-fold coverage Unfinished (24% completely finished, 50% near-finished) Celera: Started 1998 ~14.5 billion nucleotides of sequence data 4.6-fold coverage Complete assembled genome with >99% coverage First assembled draft of human genome simultaneously published in Nature & Science 15 & 16 February 2001 (Nature published 1 day earlier).

How did Celera et al. assemble the sequences using shotgun methods? Method A: Assembly of 26.4 million 550 bp sequences  4.6-fold coverage, without reference to a physical map of any kind. Covered >99% of the genome. 500 million trillion base-to-base comparisons. 20,000 CPU hours (833 CPU days) on a year 2000 supercomputer. Method B: Used BAC clone scaffold (combined lots of smaller maps) to validate the whole genome direct shotgun assembly approach. Also helped resolved ambiguities resulting from the assembly of short repeated DNA fragments.

Features of the human genome: 32,000 genes estimated (50,000-100,000 were predicted). Not many more genes than Drosophila, and only 50% more genes than Caenorhabditis elegans (nematode worm). Only 1-1.5% of the genome codes for protein. 50% of the sequence is repeated DNA. Humans share 223 genes found in bacteria, but not yeast, nematodes, or fruit flies.

Now it requires only a couple $1000s and is done in 2 days. First human genome required $1 Billion USD and 13 years (or 2 years by Celera shotgun sequencing). Now it requires only a couple $1000s and is done in 2 days.

Next-generation shotgun genome sequencing: The shotgun method is fundamentally the same, but uses pyrosequencing and shorter read lengths (~150 bp paired-ends on Illumina). 300-800 bp fragments + mate-pairs of 2-12 kb to aid assembly and increase N50 (avg. scaffold length). The throughput has increased and the cost has decreased. Not uncommon to assemble trillions of sequence reads. Some things to consider: If error rates are high (454, Illumina) 30-50x genome sequencing is required to get a good genome. If error rates are low (SOLiD, Ion Torrent) 4-5x coverage is sufficient. Costs have been falling from $10K to $1K.

Sequence  Contig  Scaffold (contig of contigs)

Scaffold with two small gaps and one large gap bridged by mate pair with paired-end sequencing Larger the scaffolds, larger the N50s xxxxx---------------xxxxx      xxxxx---------------xxxxx         xxxxx---------------xxxxx                                     xxxxx------------------------------------------------------xxxxx                                                                                                       xxxxx---------------xxxxx                                                                                                             xxxxx---------------xxxxx                                                                                                               xxxxx---------------xxxxx mate pair

How much data storage does 1 human genome require? Sequencing is no longer the primary need; data storage/retrieval and computational needs are outpacing everything else. How much data storage does 1 human genome require? About 1.5 GB (2 CDs) if your stored only one copy of each letter. For the raw format containing image files and base quality data 2-30 TB are required. 30-50x coverage requires more data storage capacity. Sequence + quality scores is compressed to format called FASTQ. @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Illumina Sequence Identifiers FASTQ '!' represents the lowest quality while '~' is the highest. Left-to-right increasing order of quality (ASCII 90 characters): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Illumina Sequence Identifiers @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG Bar code at front of each sequence that allows you to label each sequence (e.g. individual, population, etc.)

Sequence assembly & genotyping Trimming and filtering sequences based on base quality scores  Aligning reads to a reference genome  Genotyping to determine homozygous & heterozygous SNPs http://gatkforums.broadinstitute.org/

Post-genome sequencing era is very different: Classical genetics studies started with a phenotype and set out to identify the gene. But we now have the ability to start with a complete genome and set out to identify the phenotype. Large data sets required many computational and mathematical tools, which requires strong bioinformatics skillsets (Perl, Python, R, etc.). Lots of applications: Identify genes within genomic DNA sequences. Align and match homologous gene sequences in databases and seek to determine function. Predict structure of gene products. Describe interactions between genes and gene products. Study gene expression.

1. Identifying genes in DNA sequences: First step is annotation = identification and description of putative genes and other important sequences. Open reading frames (ORFs) ORF = potential protein coding sequence that begins with a start codon and ends with a stop codon. ORFs come in all sizes. Not all ORFs encode proteins (6-7% do not in yeast). ORFs with introns can require sophisticated computer algorithms to detect (especially if there are many introns or introns are particularly long).

2. Homology searches to assign gene function: Homology search = identify gene function by searching database. Similarities reflect evolutionary relationships and shared function. Homology searches are performed for nucleotides and amino acids using BLAST = Basic Local Alignment Search Tool. GenBank’s BLAST site: http://www.ncbi.nlm.nih.gov/BLAST/ Example, human mtDNA control region sequence: TTCTCTGTTCTTCATGGGGAAGCAGATTTGGGTACCACCCAAGTATTGACTCACCCACAACAACCGCTATGTATTTCGTACATTACTGCCAGCCACCATGAATATTGCACGGTACCATAAATACTTGACCACCTGTAGTACATAAAAACCCAATCCACATCAAAA

(2006)Jiawei Han & Micheline Kamber

Fig. 9.2, Summary of genes in the yeast genome.

3. Gene function can be identified and studied in other ways: Gene knockout approach = systematically delete different genes and observe the phenotypes (PCR + cloning is one method). Synthesize recombinant proteins with modified amino acid sequence and expressed in E. coli. Test effects of mutations that don’t exist in nature.

Study the transcriptome = complete set of mRNAs in a cell mRNAs are not stable, but types and levels change with different experimental conditions. Sample mRNA at experimental intervals and convert to cDNA using reverse transcriptase. Probe unknown cDNAs with DNA microarray of PCR-generated ORF sequences (requires known sequence for each probe). Or better yet, sequence the entire transcriptome using: RNA-Seq = Whole Transcriptome Shotgun Sequencing of all expressed RNAs. Sequencing of ribosome-bound mRNA for monitoring in-vivo translation.

Fig. 9.7b, Microarray study of gene expression

http://www.nature.com/nbt/journal/v28/n5/images_article/nbt0510-421-F1.gif

The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments Nicholas T Ingolia, Gloria A Brar, Silvia Rouskin, Anna M McGeachy & Jonathan S Weissman Nature Protocols 7, 1534–1550 (2012) doi:10.1038/nprot.2012.086

“Proteomics”: Proteome = complete set of expressed proteins in a cell Major goals of proteomics: Identify every protein, isolate and purify. Determine the sequence and structure of each protein (and its function). Create a database with the sequence of each protein. Analyze protein levels and interactions in different cell types, at different times, and at different stages of development. Rationale: Genes are two-steps removed from disease (DNA  mRNA  protein). Most gene products involved in disease are composed of protein. Understanding protein means understanding disease.

http://biol.lf1.cuni.cz/ucebnice/en/proteomics.htm

“Systems Biology” Computational and mathematic modeling of complex biological systems---Wikipedia Requires integration of genomic, proteomic, and metabolic data.