Last lecture summary. Sequencing strategies Hierarchical genome shotgun HGS – Human Genome Project “map first, sequence second” clone-by-clone … cloning.

Slides:



Advertisements
Similar presentations
Recombinant DNA Technology
Advertisements

Last lecture summary.
Last lecture summary.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Day 2. Genetic information, stored in DNA, is conveyed as proteins.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
General Microbiology (Micr300) Lecture 11 Biotechnology (Text Chapters: ; )
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao C havan Maharashtra Open University, Nashik.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Přednáška odpadá. Last lecture summary recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join.
Biotechnology and Genomics Chapter 16. Biotechnology and Genomics 2Outline DNA Cloning  Recombinant DNA Technology ­Restriction Enzyme ­DNA Ligase 
Lesson 10 Bioinformatics
Plant Molecular Systematics Michael G. Simpson
Lesson Overview 13.1 RNA.
AP Biology Ch. 20 Biotechnology.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
GenomesGenomes Chapter 21 Genomes Sequencing of DNA Human Genome Project countries 20 research centers.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
By Melissa Rivera.  GENE CLONING: production of multiple identical copies of DNA  It was developed so scientists could work directly with specific genes.
Chapter 5 The Content of the Genome 5.1 Introduction genome – The complete set of sequences in the genetic material of an organism. –It includes the.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Biotechnology and Genomics Chapter 16. Biotechnology and Genomics 2Outline DNA Cloning  Recombinant DNA Technology ­Restriction Enzyme ­DNA Ligase 
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
A Molecular Toolkit AP Biology Fall The Scissors: Restriction Enzymes  Bacteria possess restriction enzymes whose usual function is to cut apart.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
Who is smarter and does more tricks you or a bacteria? YouBacteria How does my DNA compare to a prokaryote? Show-off.
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Virginia Commonwealth University
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Genomes and Their Evolution
Genomes and their evolution
Very important to know the difference between the trees!
Genomes and Their Evolution
Biology, 9th ed,Sylvia Mader
Chapter 14 Bioinformatics—the study of a genome
Today… Review a few items from last class
Genomes and Their Evolution
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Gene Density and Noncoding DNA
A Sequenciação em Análises Clínicas
Biology, 9th ed,Sylvia Mader
Presentation transcript:

Last lecture summary

Sequencing strategies Hierarchical genome shotgun HGS – Human Genome Project “map first, sequence second” clone-by-clone … cloning is performed twice (BAC, plasmid)

Sequencing strategies Whole genome shotgun WGS – Celera shotgun, no mapping Coverage - the average number of reads representing a given nucleotide in the reconstructed sequence. HGS: 8, WGS: 20

Genome assembly reads, contigs, scaffolds base calling, sequence assembly, PHRED/PHRAP

Human genome 3 billions bps, ~ – genes Only 1.1 – 1.4 % of the genome sequence codes for proteins. State of completion: best estimate – 92.3% is complete problematic unfinished regions: centromeres, telomeres (both contain highly repetitive sequences), some unclosed gaps It is likely that the centromeres and telomeres will remain unsequenced until new technology is developed Genome is stored in databases Primary database – Genebank ( Additional data and annotation, tools for visualizing and searching UCSCS ( ) Ensembl ( )

1 st and 2 nd generation of sequencers 1 st generation – ABI Prism 3700 (Sanger, fluorescence, 96 capillaries), used in HGP and in Celera Sanger method overcomes NGS by the read length (600 bps) 2 nd generation - birth of HT-NGS in Life Sciences developed GS 20 sequencer. Combines PCR with pyrosequencing. Pyrosequencing – sequencing-by-synthesis Relies on detection of pyrophosphate release on nucleotide incorporation rather than chain termination with ddNTs. The release of pyrophosphate is detected by flash of light (chemiluminiscence). Average read length: 400 bp Roche GS-FLX 454 (successor of GS 20) used for J. Watson’s genome sequencing.

New stuff

3 rd generation 2 nd generation still uses PCR amplification which may introduce base sequence errors or favor certain sequences over others. To overcome this, emerging 3 rd generation of seqeuencers performs the single molecule sequencing (i.e. sequence is determined directly from one DNA molecule, no amplification or cloning). Compared to 2 nd generation these instruments offer higher throughput, longer reads (~1000 bps), higher accuracy, small amount of starting material, lower cost

source: transition to 2 nd generation 5,000 $ 0.057$ 5,000$

Illumina HiSeq X Ten Illumina anounced the new HiSeq X Ten Sequencing System. Illumina claims they are enabling the $1,000 genome. Uses Illumina SBS technology (sequencing-by-synthesis). It sells for at least $10 million.

Human Longevity – Human Longevity was founded by Craig Venter Its main aim: slow ageing The largest human DNA sequencing operation in the world, capable of processing 40,000 human genomes a year. DNA data will be combined with other data on the health and body composition of the people whose DNA is sequenced, in the hope of gleaning insights into the molecular causes of aging and age-related illnesses like cancer and heart disease. Equipment: 2x Illumina Hiseq X Ten

Which genomes were sequenced? GOLD – Genomes online database ( information regarding complete and ongoing genome projects

Important genomics projects The analysis of personal genomes has demonstrated, how difficult is to draw medically or biologically relevant conclusions from individual sequences. More genomes need to be sequenced to learn how genotype correlates with phenotype Genomes project ( started in Sequence the genomes of at least a 1000 people from around the world to create the detailed and medically useful picture of human genetic variation. 2 nd generation of sequencers is used in 1000 Genomes Genomes will start soon.

Important genomics projects ENCODE project (ENCyclopedia Of DNA Elements, ) by NHGRI identify all functional elements in the human genome sequence Defined regions of the human genome corresponding to 30Mb (1%) have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Rapid Evolution of Next Generation Sequencing Technologies 2008: Major genome centers can sequence the same number of base pairs every 4 days 1000 Genome project launched World-wide capacity dramatically increasing 2000: Human genome working drafts Data unit of approximately 10x coverage of human 10 years and cost about $3 billion 2009: Every 4 hours 2010: Every 14 minutes Illumina HiSeq2000 machine produces 200 gigabases per 8 day run

cDNA 1. isolate mRNA from suitable cells 2. convert it to complementary DNA (cDNA) using the enzyme reverse transcriptase (+ DNA poymerase) cDNA contains only expressed genes, no intergenic regions, no introns (just exons). Because usually the desired gene sequences still represent only a tiny proportion of the total cDNA population, the cDNA fragments are amplified by cloning/PCR. cDNA library – a library is defined simply as a collection of different DNA sequences that have been incorporated into a vector.

ESTs Expressed Sequence Tag Their use was promoted by Craig Venter. At that time (1991) it was a revolutionary way for gene identification. EST is a short subsequence ( bps) of cDNA sequence. They are unedited, randomly selected single- pass sequence reads derived from cDNA libraries. They can be generated either from 5’ or from 3’ end. mRNA cDNA 3’ ESTs5’ ESTs

ESTs ESTs and cDNA sequences provide direct evidence of all sampled transcripts and they are currently the most important resources for transcriptome exploration. ESTs/cDNA sequences cover the genes expressed in the given tissue of the given organism under the given conditions. housekeeping genes – gene products required by the cell under all growth conditions (genes for DNA polymerase, RNA polymerase, rRNA, tRNA, …) tissue specific genes – different genes are expressed in the brain and in the liver, enzymes responding to a specific environmental condition such as DNA damage, …

ESTs vs. whole genome Whole genome sequencing is still impractical and expensive for organisms with large genome size. Genome expansion, as a result of retrotransposon repeats, makes whole genome sequencing less attractive for plants such as maize. Transposons - sequences of DNA that can move (transpose) themselves to new positions within the genome. Retrotransposons – subclass of transposons, they can amplify themselves. Ubiquitous in eukaryotic organisms (45%-48% in mammals, 42% in human). Particularly abundant in plants (maize – 49-78%, wheat – 68%) Genome expansion – increase in genome size, one of the elements of genome evolution

EST properties Individual raw EST has negligible biological information, it is just a very short copy of mRNA. It is highly error prone, especially at the ends. The overall sequence quality is usually significantly better in the middle. Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform (1):6-21. PMID:

Problems in ESTs Redundancy Under-representation and over-representation of selected host transcripts (i.e. sequence bias) Base calling errors (as high as 5%) Contamination from vector sequences Repeats may pose problems Natural sequence variations (e.g. SNPs) – how to distinguish them and sequencing artifacts?

ESTs on the web Largest repository: dbEST ( ) – 75,134,573 ESTs from more than organisms UniGene ( ) stores unique genes and represents a nonredundant set of gene-oriented clusters generated from ESTs.

EST analysis generic steps involved in EST analysis The aim of the analysis: augment weak signals, make consensus, when a multitude of ESTs are analysed reconstruct transcriptome of the organism. Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform (1):6-21. PMID:

EST preprocessing Reduces the overall noise in EST data to improve the efficacy of subsequent analyses. Remove vector contaminating fragments. Compare ESTs with non-redundant vector databases (UniVec - EMVEC – ) Repeats must be detected and masked using RepeatMasker ( ). Resources for EST pre-processing: page 12 in Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform (1):6-21. PMID:

EST clustering Collect overlapping ESTs from the same transcript of a single gene into a unique cluster to reduce redundancy. Clustering is based on the sequence similarity. Different steps for EST clustering are described in detail in Ptitsyn A, Hide W. CLU: a new algorithm for EST clustering. BMC Bioinformatics. 2005; 6 Suppl 2:S3. PubMed PMID: The maximum informative consensus sequence is generated by ‘assembling’ these clusters, each of which could represent a putative gene. This step serves to elongate the sequence length by culling information from several short EST sequences simultaneously. Sequence clustering and assembly: CAP3

Functional annotations Database similarity searches (BLAST) are subsequently performed against relevant DNA databases and possible functionality is assigned for each query sequence if significant database matches are found. Additionally, a consensus sequence can be conceptually translated to a putative peptide and then compared with protein sequence databases. Protein centric functional annotation, including domain and motif analysis, can be carried out using protein analysis tools.

EST analysis pipelines Large-scale sequencing projects (thousands of ESTs generated daily) – store, organize and annotate EST data in an automatic pipeline. Database of raw chromatograms → clean, cluster, assemble, generate consensus, translate, assign putative function based on various DNA/protein similarity searches examples: TGI Clustering tools (TGICL) PartiGene

Sequence Alignment

What is sequence alignment ? CTTTTCAAGGCTTA GGCTTATTATTGC CTTTTCAAGGCTTA GGCTATTATTGC CTTTTCAAGGCTTA GGCT-ATTATTGC Fragments overlaps

What is sequence alignment ? CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG “EST clustering” CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG consensus

Sequence alphabet side chain charge at physiological pH 7.4 Name3 letters1 letter Positively charged side chains ArginineArgR HistidineHisH LysineLysK Negatively charged side chains Aspartic AcidAspD Glutamic AcidGluE Polar uncharged side chains SerineSerS ThreonineThrT AsparagineAsnN GlutamineGlnQ Special CysteineCysC SelenocysteineSecU GlycineGlyG Proline\ProP Hydrophobic side chains AlanineAlaA LeucineLeuL IsoleucineIleI MethionineMetM PhenylalaninePheF TryptophanTrpW TyrosineTyrY ValineValV AdenineA ThymineT CytosineG GuanineC

Sequence alignment Procedure of comparing sequences Point mutations – easy More difficult example However, gaps can be inserted to get something like this ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT gapless alignment gapped alignment insertion × deletion indel

Why align sequences – continuation The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

Flavors of sequence alignment pair-wise alignment × multiple sequence alignment

Flavors of sequence alignment global alignment × local alignment global local align entire sequence stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences

Evolution wikipedia.org common ancestors

Evolution of sequences The sequences are the products of molecular evolution. When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. Similar function Sequence similarity Similar 3D structure Protein1Protein2 DNA1DNA2 However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: Similar sequences produce similar proteins

Homology During the time period, the molecular sequences undergo random changes, some of which are selected during the process of evolution. Selected sequences accumulate mutations, they diverge over time. Two sequences are homologous when they are descended from a common ancestor sequence. Traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry. Residues performing key roles are preserved by natural selection, less crucial residues mutate more frequently.

Orhology, paralogy I Orthologs – homologous proteins from different species that possess the same function (e.g. corresponding kinases in signal transduction pathway in humans and mice) Paralogs – homologous proteins that have different function in the same species (e.g. two kinases in different signal transduction pathways of humans) However, these terms are controversially discussed: Jensen RA. Orthologs and paralogs - we need to get it right. Genome Biol. 2001;2(8), PMID: and references therein

Orthology, paralogy II Orthologs – genes separated by the event of speciation Sequences are direct descendants of a common ancestor. Most likely have similar domain structure, 3D structure and biological function. Paralogs – genes separated by the event of genetic duplication Gene duplication: An extra copy of a gene. Gene duplication is a key mechanism in evolution. Once a gene is duplicated, the identical genes can undergo changes and diverge to create two different genes.

Gene duplication 1. Unequal cross-over 2. Entire chromosome is replicated twice This error will result in one of the daughter cells having an extra copy of the chromosome. If this cell fuses with another cell during reproduction, it may or may not result in a viable zygote. 3. Retrotransposition Sequences of DNA are copied to RNA and then back to DNA instead of being translated into proteins resulting in extra copies of DNA being present within cell.

Unequal cross-over Homologous chromosomes are misaligned during meiosis. The probability of misalignment is a function of the degree of sharing the repetitive elements.

Comparing sequences through alignment – patterns of conservation and variation can be identified. The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences The variation between sequences reflects the changes that have occurred during evolution in the form of substitutions and/or indels. Identifying the evolutionary relationships between sequences helps to characterize the function of unknown sequences. Protein sequence comparison can identify homologous sequences from common ancestor 1 billions year ago (BYA). DNA sequences typically only 600 MYA.