Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mid-term Examination Between October 16th to 30th 2006

Similar presentations


Presentation on theme: "Mid-term Examination Between October 16th to 30th 2006"— Presentation transcript:

1 Mid-term Examination Between October 16th to 30th 2006
One hour in the class Closed book Essay type questions Will cover all chapters done until then Will count towards 30% of the final grade © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

2 Chapter 4 Genome Sequencing
Strategies and procedures for sequencing entire genomes This chapter covers the different strategies and procedures used to sequence entire genomes. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

3 Contents The Human Genome Project Sequencing strategies
Large-scale sequencing Accuracy and coverage EST sequencing Sequence annotation The topics covered in this chapter include the Human Genome Project, the different strategies for sequencing genomes, how large-scale sequencing is carried out, how the parameters of accuracy and coverage affect genome-sequencing projects, EST sequencing, and sequence annotation. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

4 Background Field of genomics began with decision to sequence human genome Size of human genome is 3 billion base pairs, which necessitated new ways to do sequencing Approaches to sequencing the human genome Scale up existing techniques Develop new sequencing techniques Start with smaller genomes used as a warm-up project The field of genomics, whose aim is to determine the structure and function of all genes in a genome, began with the decision to sequence the human genome. At the time, the size of the human genome, 3 billion base pairs, presented a colossal challenge necessitating new ways to do sequencing. Three approaches were proposed: to scale up existing sequencing technologies, to develop new sequencing technologies, and to start sequencing smaller genomes as a warm-up to sequencing the human genome. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

5 Goals of the Human Genome Project
Sequence entire genome Not just transcribed genes Sequencing should be performed with a high level of accuracy One error in 10,000 bases Develop genomic resources that would be useful for all genes Example: collections of physical markers Develop economies of scale In addition to improving sequencing technology, the Human Genome Project set the following as goals: to sequence the entire genome, not just transcribed genes or disease genes; to sequence the genome to a high level of accuracy, with less than one error in 10,000 bases (originally, this margin was set at less than one error in 100,000 bases); to develop genomic resources that would be useful for all genes (for example, collections of physical markers); and to develop economies of scale. This last goal has meant the concentration of sequencing in a few centers, where it is carried out on an industrial scale. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

6 Scale-up of existing technologies
There has been remarkable improvement in sequencing efficiency since the invention of sequencing The amount of sequencing that one person can perform has increased dramatically 1980: 0.1– 1 kb per year 1985: 2–10 kb per year 1990: 25–50 kb per year 1996: 100–200 kb per year 2000: 500–1,000 kb per year Almost all large-scale sequencing is still based on Sanger chain-termination technology The greatest success has come in scaling up existing technologies. From 1980 to 2000, sequencing efficiency increased over a thousandfold and is still improving. Recent estimates are that one of the major sequencing centers can sequence a 100-MB genome (about the size of the Drosophila genome) at 5X coverage in two to three weeks. Remarkably, there has been no major change in the underlying technology. Almost all large-scale sequencing is still based on the Sanger chain-termination sequencing technology. This technology is described in detail in the chapter on the fundamentals of mapping and sequencing. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

7 New technologies A high-priority goal at the beginning of the Human Genome Project was to develop new mapping and sequencing technologies To date, no major breakthrough technology has been developed Possible exception: whole-genome shotgun sequencing applied to large genomes Although a high-priority goal of the Human Genome Project was to develop new mapping and sequencing technologies, to date there has been no major breakthrough in this area. Some promising new methods are described in the chapter on the fundamentals of mapping and sequencing. The one possible breakthrough is the application of whole-genome shotgun sequencing to large genomes (described later in this chapter), but this method was developed by a private company and thus was not part of the official Human Genome Project. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

8 Automated sequencers Perhaps the most important contribution to large-scale sequencing was the development of automated sequencers Most use Sanger sequencing method Fluorescently labeled reaction products Capillary electrophoresis for separation Most commonly used automated sequencers are the following: ABI MegaBACE Of perhaps paramount value in enabling large-scale sequencing to be performed was the development of automated sequencers. All commercially available machines use the Sanger sequencing methods and fluorescently labeled reaction products. The newer, high-throughput models all use capillary electrophoresis for separation of the reaction products. The two most commonly used instruments are the ABI 3700 and the MegaBACE. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

9 Automated sequencers: ABI 3700
Made by Applied Biosystems Most widely used automated sequencers: 96 capillaries robot loading from 384-well plates Two to three hours per run 600–700 bases per run robotic arm and syringe 96 glass capillaries 96–well plate load bar The ABI 3700, made by Applied Biosystems, is probably the most widely used instrument for large-scale sequencing. It has 96 capillaries that are fed by robotic loading from two 384-well microtiter plates. It makes a sequence run every two to three hours and can read on average 600–700 bases per run. Celera, the company that produced a rough draft of the human genome in three years, used 200 of these machines running 24/7 to do so. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

10 Automated sequencers: MegaBACE
Made by Amersham 96 capillaries Robotic loading from 384–well plate Two to four hours per run Can read up to 800 bases The primary competitor of the ABI 3700 is the MegaBACE 4000, made by Amersham. It also has 96 capillaries and performs robotic loading from 384-well microtiter plates. It takes from two to four hours per run and can read up to 800 bases accurately. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

11 Automatic gel reading Top image: confocal detection by the MegaBACE sequencer of fluorescently labeled DNA Bottom image: computer image of sequence read by automated sequencer Both automated sequencers detect fluorescently labeled DNA strands as they pass through the capillaries. The MegaBACE sequencer uses the confocal imaging system shown in the top image. The readout is given as peaks of fluorescence, as shown in the lower image. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

12 Steps in genomic sequencing
Library making Large-insert library from genome Production sequencing Generate fragments to be sequenced Perform sequencing reactions Determine sequence Finishing Assemble into continuous sequence Fill gaps Genomic sequencing can be broken into several steps. First, libraries of large fragments of genomic DNA are made. The second stage of production sequencing can itself be broken into smaller steps. These steps include generating the small DNA fragments to be sequenced, performing the actual sequencing reactions, and running the reactions through the automated sequencer. The third step is called finishing, and it involves assembling the raw sequence reads into a continuous sequence and then filling any remaining gaps in the sequence. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

13 Library making Library of genomic fragments made in vector
BAC, PAC, or YAC Usually have several-fold coverage Every DNA sequence on five to eight different clones Difficult and inefficient to sequence straight from large fragment Need to break into manageable pieces Random shearing By nebulization or sonication The first step in large-scale sequencing is normally to prepare a library of large fragments of genomic DNA. Today, the vectors most likely to be used are BACs or PACs, but occasionally YACs are still used. The libraries are usually made of a sufficient size to contain several-fold coverage of the genome, meaning that any sequence will be found in, for example, five to eight different clones. Because it is difficult and inefficient to sequence directly from large fragments, the next step is to break each clone into manageable pieces. This is done by randomly shearing the DNA from each clone by sonicating it or by using a nebulizer. Randomly sheared fragments of a particular size (e.g., 2 kb) can be produced in this way. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

14 Fragments for sequencing
Generally use 2–10 kb pieces for sequencing Clone into sequencing vector Contains binding sites for sequencing primers Can be single stranded or double stranded Production sequencing is generally performed on fragments of 2–10 kb in size. These randomly sheared fragments must be subcloned into a sequencing vector that contains sites complementary to the primers used to initiate the sequencing reaction. Sequencing vectors can be either single stranded, like the M13 phage vector, or a double-stranded plasmid. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

15 Sequence assembly Random sequences
First assemble into overlapping sequence Then create one continuous sequence Program used for this operation named PHRAP Analyzes each position to determine the following: Quality of sequence Consistency of sequence of same region Acquired from different random fragments The random sequences generated from the sheared fragments are first assembled into overlapping sequences. Depending on the desired coverage, this step can require up to 10 reads for each portion of the DNA. From these overlapping reads, one contiguous sequence is identified. This task is usually accomplished using an automated editing program known as PHRAP, developed by Phil Green. The program analyzes each position in the sequence and determines the quality of each read. It then generates a consensus sequence based on the different sequences acquired from the same region and their internal consistency (i.e., how frequently the same base was found at any given position). © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

16 Sequence assembly readout
The readout generated by the PHRAP program shows overlapping sequences lined up to form one contiguous sequence. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

17 Finishing I Process of assembling raw sequence reads into accurate contiguous sequence Required to achieve 1/10,000 accuracy Manual process Look at sequence reads at positions where programs can’t tell which base is the correct one Fill gaps Ensure adequate coverage Gap Single stranded Although automated editing programs like PHRAP have greatly increased the efficiency of sequencing, there remains a need for human judgment and intervention. This occurs during the finishing step, which is defined as the process of assembling the raw sequence reads into an accurate contiguous genomic sequence. For genomic sequencing with an accuracy of one error in 10,000 bases, a manual finishing step is essential. The finisher looks at positions where the automated editing program can’t tell which base is the correct one. By examining the various raw sequence reads, the finisher then makes a judgment call as to the correct base or sends the region back for additional sequencing. Similarly, when there are gaps in the sequence or insufficient coverage, the finisher will flag the region and send it back to the production sequencing team for more work. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

18 Finishing II To fill gaps in sequence, design primers and sequence from primer To ensure adequate coverage, find regions where there is not sufficient coverage and use specific primers for those areas GAP Gaps are usually filled by designing custom sequencing primers that are complementary to the regions adjacent to the gap. The sequencing reaction is then performed using these custom primers on a clone containing the problematic region of DNA as the template. A similar strategy is used for regions with insufficient coverage: Custom primers are made and then used for directed sequencing of a particular region. Primer Primer © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

19 Verification Region verified for the following:
Coverage Sequence quality Contiguity Determine restriction-enzyme cleavage sites Generate restriction map of sequenced region Must agree with fingerprint generated of clone during mapping step The final step of finishing is to verify the sequence. All regions are checked for the extent of coverage (i.e., how many times the same region has been sequenced, and in what direction), for sequence quality (i.e., whether ambiguity has been removed for all positions in the sequence), and for contiguity (i.e., whether the sequence forms one uninterrupted stretch of DNA). A good test of sequence quality that is frequently used in the finishing stage is to determine the sites where restriction enzymes would cut in the newly acquired sequence. A restriction map is generated from the sequence and then compared with the known fingerprint generated from the clone during the mapping step. If both show the same pattern, then it is considered to be an indication that the sequence is of high quality and relatively error free. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

20 Map-based sequencing I
Human Genome Project adopted a map-based strategy Start with well-defined physical map Produce shortest tiling path for large-insert clones Assemble the sequence for each clone Then assemble the entire sequence, based on the physical map The Human Genome Project adopted a map-based strategy for all of the genomes that it initially tackled. In this approach, one starts with a well-defined physical map, usually consisting of overlapping BAC or PAC clones. Among the overlapping clones, those that form what’s called the shortest tiling path are chosen. The shortest tiling path is simply the set of clones that have the shortest overlaps between them. The reason for using the shortest tiling path is that fewer clones then need to be sequenced. Each clone is then randomly sheared, and the sequence read and assembled. From the sequence of each BAC or PAC, the entire sequence of each chromosome is assembled based on the physical map. This procedure requires finding the sequence overlaps of each large-insert clone (in the shortest tiling path) and removing the overlapping portion of one of the two sequences in order to form a single contiguous sequence. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

21 Map-based sequencing II
Construct clone map and select mapped clones Generate several thousand sequence reads per clone An analogy for the map-based sequencing strategy is to think of the genome as a set of books. Each volume contains one BAC’s worth of sequence. In the map-based strategy, the order of the volumes is first determined. Then the information in each of the individual volumes is determined. Finally, the knowledge of the order of the volumes and the material in each volume are combined to form a single assembly. Assemble © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

22 Whole-genome shotgun sequencing I
Developed by Celera Subsidiary of Applied Biosystems, maker of automated sequencers No mapping Instead, the whole genome is sheared Randomly sequenced As an alternative to map-based sequencing, the biotechnology company Celera applied shotgun sequencing to large genomes. Although this method had been applied to small genomes such as viruses, it was thought to be inappropriate for large genomes containing repetitive sequences. (Celera was formed as a subsidiary of Applied Biosystems, the maker of automated sequencing instruments.) In the whole-genome shotgun (wgs) approach, there is no physical map made. Instead, the whole genome is treated like a single large-insert clone. It is randomly sheared into fragments. (Celera used 2-kb, 10-kb, and 50-kb subclones.) These fragments are then randomly sequenced until there is sufficient coverage. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

23 Whole-genome shotgun sequencing II
Generate tens of millions of sequence reads To understand whole-genome shotgun sequencing, let us use the book analogy again: The information on random fragments of pages from each of the volumes is determined. This procedure is done multiple times so that overlapping sentences can be found. Based on these overlaps, the whole work is then assembled. Assemble © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

24 Whole-genome shotgun sequencing III
Major challenge: assembly Repetitive elements are the biggest problem Performed on very high-speed computers, using novel software Key to assembly is paired reads Sequence both ends of each clone The major challenge of whole-genome shotgun sequencing is the assembly of the huge number of individual sequence reads into one contiguous sequence. A computer program was written by a team headed by Gene Myers to assemble the DNA sequence. To run the program required some of the fastest computers available. One of the keys to assembling the sequence was the use of paired reads: During the production sequencing process, the clones were bar coded so that one could identify when the same clone was read from both ends. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

25 Controversy: Map-based sequencing vs. whole-genome shotgun sequencing
Celera used publicly funded sequence to produce its published draft of the human genome Scientists who worked on the map-based effort claimed Celera couldn’t have produced a draft without access to the public sequence Celera scientists claim that they could have produced an accurate draft even without the public sequence The relative values of map-based sequencing and whole-genome shotgun sequencing are still highly controversial. In its published rough draft of the human genome, Celera had access to and used the publicly funded Human Genome Project sequence. Some of the scientists who worked on the publicly funded map-based effort claimed that Celera could never have produced an assembled sequence without access to the public data. The Celera scientists countered that they could indeed have produced an assembly without the public sequence and, in fact, the assembly would have been more accurate and only slightly less extensive. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

26 Hybrid approach Combines aspects of both map-based and whole-genome shotgun approaches Map clones Sequence some of the mapped clones Do whole-genome sequencing Combine information from both methods Use sequence from mapped clones as scaffold to assemble whole-genome shotgun reads Used for sequencing the mouse genome One way to resolve the controversy has been to combine the two approaches. This hybrid approach starts with a physical map as well as sequencing of some of the clones in the shortest tiling path. At the same time, whole-genome sequencing is performed. Then the information is combined, using the sequence from the mapped clones as a scaffold to help assemble the whole-genome shotgun reads. This approach was used by the publicly funded effort to sequence the mouse genome. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

27 To use the book analogy again, in the hybrid approach, the order of the volumes is determined and a small amount of information from each volume is acquired. At the same time, random fragments from all the volumes are read. These random fragments are then aligned with the help of the information gained by ordering the volumes and of the small amount of information known about each volume. In essence, the random fragments are used to fill in the gaps that remain when the information from individual volumes is assembled. The graph in the lower right corner of this slide depicts the open question of what mix of whole-genome shotgun and map-based sequencing is optimal. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

28 Completed genomes as of 2002
Organism Base pairs Whole-genome shotgun Mapbased Hybrid > 40 Bacteria 0.8-6 million + Yeast 15 million C. elegans (roundworm) 100 million Drosophila (fruitfly) 120 million Arabidopsis (thale cress) 130 million Rice 435 million Human 3 billion Mouse 2.5 billion Fugu (puffer fish) 365 million Anopheles (malaria-carrying mosquito) 278 million The number of completely sequenced genomes is increasing at a dramatic pace. Over 60 bacterial genomes of between 800,000 to 6 million base pairs (bp) have been fully sequenced, most using the whole-genome shotgun approach. The first eukaryotic genome to be sequenced was baker’s yeast, Saccharomyces cerevisiae, with a size of 15 million bp, using a map-based approach. The sequence of another yeast, Saccharomyces pombe, was recently completed. The roundworm, C. elegans, was the first multicellular organism to be sequenced. Its 100-million-bp genome was sequenced with a map-based approach. The fruit fly Drosophila was sequenced using the wgs approach in a collaboration between Celera and a publicly funded sequencing effort. This project was seen as a warm-up by Celera to test wgs before tackling the human genome. The genomes for two plants, Arabidopsis thaliana and rice, have been sequenced, both with a map-based approach. The 3-billion-bp human genome sequence was generated in a competition between the publicly funded Human Genome Project and Celera. Both also sequenced the mouse genome, Celera using wgs and the publicly funded effort using a hybrid approach. The genome of the puffer fish, Fugu rubripes, which has been touted as the vertebrate with the smallest genome (365 million base pairs), was sequenced using wgs. The genome for the malaria-bearing mosquito, Anopheles, was also sequenced using wgs. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

29 About 1600 genomes are being sequenced
New genomes finished Over 400 genomes are sequenced (Science 313: 1897, 2006) About 1600 genomes are being sequenced Poplar (Black cottonwood) two weeks back Rat Chicken Dog Chimpanzee ( © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

30 Sizes of genomes and numbers of genes
The chart in this slide shows the relationship between genome size and the number of genes thought to be found in each genome. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

31 Sequencing parameters
Difficulty and cost of large-scale sequencing projects depend on the following parameters: Accuracy How many errors are tolerated Coverage How many times the same region is sequenced The two parameters are related More coverage usually means higher accuracy Accuracy is also dependent on the finishing effort The difficulty and cost of large-scale sequencing depend on two parameters: accuracy and coverage. Accuracy means how many errors can be tolerated in the final sequence. Coverage is a measure of the number of times that the same region is sequenced. Of course, the two parameters are related. With increasing coverage, one usually gets higher accuracy. However, accuracy is dependent on more than just coverage; the quality of the finishing effort also plays a critical role in determining the level of accuracy. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

32 Sequence accuracy Highly accurate sequences are needed for the following: Diagnostics e.g., Forensics, identifying disease alleles in a patient Protein coding prediction One insertion or deletion changes the reading frame Lower accuracy sufficient for homology searches Differences in sequence are tolerated by search programs Not all projects need the same level of accuracy. Highly accurate sequencing is needed for diagnostics and forensics. If a single-base-pair change is diagnostic for a particular disease allele, one can’t tolerate any errors. The same should be the case in forensics when the outcome of a trial can turn on the accuracy of a DNA sequence. This type of sequencing is almost always “resequencing”—determining the sequence of a particular individual and comparing it with a reference sequence. In genomic sequencing, a high level of accuracy is needed for protein coding prediction. A single insertion or deletion will change the reading frame and thus change the sequence of the protein. Much lower accuracy is usually sufficient for homology searches, when the main goal is to determine whether there is a similar gene known from another organism. This process is more flexible because the homology search programs are designed to tolerate differences in the two sequences being compared. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

33 Sequence accuracy and sequencing cost
Level of accuracy determines cost of project Increasing accuracy from one error in 100 to one error in 10,000 increases costs three to fivefold Need to determine appropriate level of accuracy for each project If reference sequence already exists, then a lower level of accuracy should suffice Can find genes in genome, but not their position The cost of a genomic sequencing project is proportional to the level of accuracy needed. For example, increasing the accuracy from one error in 100 to one error in 10,000 will increase the costs by three- to fivefold. Therefore, it is important to determine the level of accuracy that is appropriate for each project. If a reference sequence (from a closely related organism) exists, then a lower level of accuracy will usually suffice. This level of accuracy allows one to identify the genes in a genome, but not necessarily their position on the chromosome. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

34 Sequencing coverage Coverage is the number of times the same region is sequenced Ideally, one wants an equal number of sequences in each direction To obtain accuracy of one error in 10,000 bases, one needs the following: 10x coverage Stringent finishing Complete sequence Base-perfect sequencing Coverage is the number of times that the same region is sequenced. Ideally, approximately the same number of sequences should be read from each direction, as ambiguities that show up when a region is sequenced in one direction can frequently be resolved by reading the sequence from the opposite direction. For what’s known as base-perfect sequencing, which is an accuracy of one error in 10,000 bases, at least tenfold coverage is needed in addition to stringent finishing. This method is also known as complete sequencing. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

35 Rough-draft and skimming sequence
Rough-draft sequence refers to an average of 5x coverage Skimming is 1–3x coverage Obtains 67%–97% of the sequence On average, 99% accurate Of greatest use when can compare the sequence to a reference sequence For example, chimpanzee genome compared with human genome Lower levels of coverage are appropriate for a number of uses. Fivefold coverage is referred to as a rough draft. The human genome sequence was first published when it was at this stage. One- to threefold coverage is known as skimming. This approach yields 67% to 97% of the sequence and is, on average, 99% accurate. This level of coverage can still be highly informative when there already exists a reference sequence from a closely related organism—for example, when comparing a chimpanzee sequence with a human sequence. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

36 Industrialization of sequencing
Most large-scale sequencing projects divide tasks among different teams Large-insert libraries Production sequencing Finishing Sequencing machines run 24/7 Many tasks performed by robots To gain economies of scale and increase efficiency, most large-scale sequencing projects divide the tasks among different teams of scientists. For example, one team will generate the large-insert libraries, another will perform the production sequencing, and a third will do the finishing. In large sequencing centers like those found at Baylor University, Washington University, and MIT, automated sequencing instruments run 24 hours a day, seven days a week. Many of the tasks, including picking clones, running sequencing reactions, and loading the automated sequencers, are performed by robots. Bar coding is frequently used to keep track of the large number of samples. These large sequencing centers resemble a high-tech factory more than they do a biology research laboratory. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

37 EST sequencing I Idea: sequence only “important” genes
Those genes expressed in a particular tissue Sequence random cDNAs made from RNA extracted from tissue of interest Muscle mRNA cDNA libraries “New” Biolims In the early days of the Human Genome Project, when sequencing all 3 billion bases appeared to be a nearly impossible task, an alternative approach was proposed: to sequence just the “important” genes, that is, those expressed in particular tissues. Among those who tried this approach, J. Craig Venter, while at the NIH prior to becoming head of Celera, showed that it could yield a large amount of useful information. The basic idea was to sequence a portion of random cDNAs made from the RNA extracted from a tissue of interest. The image in this slide shows a cDNA library made from human muscle. The library is arrayed in microtiter wells, and individual clones are randomly sequenced. Robotized stations DNA sequencers © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

38 EST sequencing II Make cDNA library Select clones at random
Sequence in from one or both ends One-pass sequencing The resulting sequence = expressed sequence tag (EST) 5’ cDNA 3’ Partial sequence = EST The first step in this alternative approach, called EST sequencing, is to make a cDNA library from RNA extracted from the tissue or cell type of interest. Then cDNA clones are selected at random and sequenced from one or both ends. The key is that there is no attempt to get the complete sequence of the clone, only a one-pass sequence from one of the ends. This sequence is known as an “expressed sequence tag,” or EST. It has been shown that in most cases, the one-pass sequencing yields sufficient information to perform a homology search. Once clones with homology to genes of interest are identified, the clones can be fully sequenced or used as probes to pull out the corresponding genomic region. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

39 EST sequencing: pros and cons
Advantages Relatively inexpensive Certainty that sequence comes from transcribed gene Information about tissue and developmental stage Disadvantages No regulatory information Usually less than 60% of genes found in EST collections Location of sequence in genome unknown An advantage EST sequencing has over genomic sequencing is that EST sequencing is a relatively inexpensive way of finding a lot of genes that one knows are expressed. Moreover, because the cDNA library is made from a particular tissue at a specific developmental stage, the types of genes and the number of each gene sequenced give an indication of the relative abundance of these sequences in that tissue or developmental stage. The disadvantage of EST sequencing as compared with genomic sequencing is that there is no information on the upstream or downstream regulatory sequences with EST sequencing. Also, even with extensive sequencing from many different tissue sources, most EST collections contain less than 60% of the genes found in the genome. In addition, EST sequencing gives no information as to the location of the sequence on the chromosome. Snapshot of transcription activity Mixture of cells © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

40 Sequence annotation Annotation performed on completed sequence
Computer programs used to find the following: Genes Exons and introns Regulatory sequences Repetitive elements After a genome is completely sequenced, it can then be annotated. This operation is performed initially using computer programs that are designed to find genes by using their exons and introns, regulatory sequences such as TATA boxes and polyA addition sites, and repetitive elements. Once genes are identified, other programs search for homologues in other species. The chapter on bioinformatics describes how these annotation programs work. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

41 Progress in genome sequencing
Human Genome Project 10 years to complete Billions of dollars Current sequencing technology $10-25 million to sequence a human genome Mammalian genomes sequenced in months (10-25 million) Microbial genomes sequenced in weeks (20-50K) Massively parallel sequencing 25 million base pairs in 4 hours! (454 sequencing Curagen) By the time the human genome project was completed, the effort had taken ten years and cost billions of dollars. Today mammalian genomes can be sequenced in months and bacterial genomes can be sequenced in weeks. The cost of sequencing the human genome using current technology is estimated to be $10-25 million, and a bacterial genome can be sequenced at a cost of $20,000 to $50,000. In September of 2005, a group of corporate and academic researchers led by Jonathan M. Rothberg announced a method of massively parallel genome sequencing that promises to make sequencing orders of magnitude more efficient, allowing researchers to sequence 25 million base pairs in only 4 hours! © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

42 Sample preparation Fragment DNA into short single strands
Use “adaptor” sequences to attach DNA to micron-scale beads Encase beads in oil droplets containing PCR reagents Amplify bead DNA Load beads into picoliter wells Massively parallel sequencing is made possible by the specific technique used to prepare the DNA samples. The first step is to break up the DNA into small single stranded pieces. “Adaptor” sequences are then appended to DNA strands that allow them to be attached to beads that are 28 microns in diameter. Attachment of the DNA is done in such a way that only a single strand of DNA will attach to each bead. The beads are then surrounded by droplets of oil that contain all of the reagents needed for the polymerase chain reaction (PCR). Amplification via PCR will result in approximately ten million copies of each DNA strand that is attached to a bead. Oil droplets are kept in an emulsion that prevents cross-contamination between individual droplets and the beads they encase. Finally, the beads are loaded into picoliter wells on a slide containing optic fibers. A schematic of the entire process is shown on the slide. Using this sample preparation technique, a single individual was able prepare the genomic DNA of Mycoplama genitalium in roughly four hours. The size of M. genitalium’s genome is 580 kb. From Figure 1 in Margulies, M. et al. (2005) “Genome sequencing in microfabricated high-density picolitre reactors” Nature 437: Figures 1 and 2 used by permission by Jonathan Rothberg. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

43 The instrument Load beads into slide containing 1.6 million picoliter wells Sequentially pass A, G, C, and T into picoliter wells Presence of a particular base emits light from individual well CCD reads emitted light from each picoliter well Once the beads have been prepared and the DNA has been amplified, the samples can be loaded into the device that will read the sequence. A 6.4 cm by 6.4 cm slide containing 1.6 million picoliter wells is loaded with the beads. Each well can only hold a single bead thereby avoiding cross-contamination. The slide is attached to a pumping system that alternately sends reagents into the picoliter wells to indicate the presence of an A, G, C, or T on the DNA strands associated with each bead. The chemical reaction used is one that had already been developed and is common in pyrosequencing devices. It works by constructing a second strand of DNA based on the template that is already associated with each bead. If a base is added to the sister strand, light is emitted from the picoliter well. The light is detected by a charge coupled device (CCD) that simultaneously reads the signals from all of the picoliter wells. The CCD is attached to a computer that is used to download and analyze the data. Sequences read from the picoliter wells are then used to reconstruct the genome sequence. An automated run of the system shown in the slide was able to read the full genome sequence of M. genitalium in four hours. A typical run reads approximately 25 million bases. In the accompanying figure, (a) shows the pumping system used to expose wells to different bases, (b) shows the slide containing the picoliter wells, and (c) shows the CCD that reads the optic signal from individual wells. From Figure 2 in Margulies, M. et al. (2005) “Genome sequencing in microfabricated high-density picolitre reactors” Nature 437: © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

44 Caveats Only reads short lengths of sequence
bp Accuracy of individual reads is lower than conventional methods Though speed can compensate to some extent Can not read paired-ends Makes sequence assembly less efficient Though this method of massively parallel sequencing holds great promise, there are a number of ways in which the traditional Sanger method of sequencing is superior. The method developed by Rothberg and colleagues is only able to read short lengths of DNA ( bp), approximately 10% of what is typically possible with existing sequencing methods. Also, the accuracy of individual reads is lower, however, the speed of massively parallel sequencing can compensate for this shortcoming to some extent. Lastly, because this technique reads only single strands of DNA, sequence assembly is less efficient than in existing sequencing methods that can exploit reading the paired ends of DNA, which provide more information for genome sequence assembly. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

45 Summary I Human Genome Project Automated sequencers
Goals Automated sequencers Sequencing strategies Mapbased Whole-genome shotgun Hybrid In summary, we have discussed the Human Genome Project’s goals and accomplishments in this chapter. We examined the importance of automated sequencing instruments to large-scale sequencing. We also described the different strategies for performing genomic sequencing, including map-based, whole-genome shotgun, and hybrid approaches. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

46 Summary II Steps in large-scale sequencing Accuracy and coverage
Large-insert libraries Production sequencing Finishing Accuracy and coverage EST sequencing Sequence annotation We went through the steps of large-scale sequencing projects, including making large-insert libraries, production sequencing, and finishing. We then described the two parameters, accuracy and coverage, and the impact of changing each parameter on the cost and difficulty of completing the sequencing project. We described the use of EST sequencing to determine the sequence only of transcribed genes. Finally, we touched on the use of computer programs to annotate the completed sequence. © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458


Download ppt "Mid-term Examination Between October 16th to 30th 2006"

Similar presentations


Ads by Google