Last lecture summary.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Last lecture summary.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
Day 2. Genetic information, stored in DNA, is conveyed as proteins.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence similarity.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
General Microbiology (Micr300) Lecture 11 Biotechnology (Text Chapters: ; )
Sequencing a genome and Basic Sequence Alignment
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Přednáška odpadá. Last lecture summary recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join.
Last lecture summary. Sequencing strategies Hierarchical genome shotgun HGS – Human Genome Project “map first, sequence second” clone-by-clone … cloning.
Lesson 10 Bioinformatics
 3 billion base pairs of DNA and about 30,000 genes  97% of human DNA is junk, as it does not code for protein products  Of this junk, some are regulatory.
TOPICS IN (NANO) BIOTECHNOLOGY Lecture 7 5th May, 2006 PhD Course.
Plant Molecular Systematics Michael G. Simpson
P2 Discussion 1. Revise on Central Dogma 2
AP Biology Ch. 20 Biotechnology.
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Todd J. Treangen, Steven L. Salzberg
Enzymes (Proteins) Standards 1b, 1h, 4e, 4f, From the largest entity in the Universe to the smallest entity that makes up all the matter in the Universe.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
RNA and Protein Synthesis
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Bioinformatics and Sequencing Relevant to SolCAP
RNA and Protein Synthesis
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Sequencing a genome and Basic Sequence Alignment
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Genomics.
By Melissa Rivera.  GENE CLONING: production of multiple identical copies of DNA  It was developed so scientists could work directly with specific genes.
Human Genome.
Biotechnology and Genomics Chapter 16. Biotechnology and Genomics 2Outline DNA Cloning  Recombinant DNA Technology ­Restriction Enzyme ­DNA Ligase 
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Example of regression by RBF-ANN Prediction of charge on peptides after electron-spray ionization in mass spectrometry What are the best attributes to.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Sequence Alignment.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
DNA Technology Ch. 20. The Human Genome The human genome has over 3 billion base pairs 97% does not code for proteins Called “Junk DNA” or “Noncoding.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
DNA Technology & Genomics CHAPTER 20. Restriction Enzymes enzymes that cut DNA at specific locations (restriction sites) yielding restriction fragments.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Virginia Commonwealth University
Part 3 Gene Technology & Medicine
Lesson: Sequence processing
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Genomes and their evolution
Very important to know the difference between the trees!
Chapter 14 Bioinformatics—the study of a genome
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Genomes and Their Evolution
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Introduction to Sequencing
Presentation transcript:

Last lecture summary

Sequencing strategies Hierarchical genome shotgun HGS – Human Genome Project “map first, sequence second” clone-by-clone … cloning is performed twice (BAC, plasmid)

Sequencing strategies Whole genome shotgun WGS – Celera shotgun, no mapping Coverage - the average number of reads representing a given nucleotide in the reconstructed sequence. HGS: 8, WGS: 20

Genome assembly reads, contigs, scaffolds base calling, sequence assembly, PHRED/PHRAP

Human genome 3 billions bps, ~20 000 – 25 000 genes Only 1.1 – 1.4 % of the genome sequence codes for proteins. State of completion: best estimate – 92.3% is complete problematic unfinished regions: centromeres, telomeres (both contain highly repetitive sequences), some unclosed gaps It is likely that the centromeres and telomeres will remain unsequenced until new technology is developed Genome is stored in databases Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide) Additional data and annotation, tools for visualizing and searching UCSCS (http://genome.ucsc.edu) Ensembl (http://www.ensembl.org)

New stuff

New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS). New approaches, reduce time and cost. Holly Grail of sequencing – complete human genome below $ 1000. Archon X Prize http://genomics.xprize.org/ $10 million prize is to be awarded to the private company that is able to sequence 100 human genomes within 10 days at cost of no more than $10 000 per genome

1st and 2nd generation of sequencers 1st generation – ABI Prism 3700 (Sanger, fluorescence, 96 capillaries), used in HGP and in Celera Sanger method overcomes NGS by the read length (600 bps) 2nd generation - birth of HT-NGS in 2005. 454 Life Sciences developed GS 20 sequencer. Combines PCR with pyrosequencing. Pyrosequencing – sequencing-by-synthesis Relies on detection of pyrophosphate release on nucleotide incorporation rather than chain termination with ddNTs. The release of pyrophosphate is detected by flash of light (chemiluminiscence). Average read length: 400 bp Roche GS-FLX 454 (successor of GS 20) used for J. Watson’s genome sequencing. PYROSEQUENCING - "Sequencing by synthesis" involves taking a single strand of the DNA to be sequenced and then synthesizing its complementary strand enzymatically. The pyrosequencing method is based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemiluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. The template DNA is immobile, and solutions of A, C, G, and T nucleotides are sequentially added and removed from the reaction. Light is produced only when the nucleotide solution complements the first unpaired base of the template. The sequence of solutions which produce chemiluminescent signals allows the determination of the sequence of the template. - Based on excellent and up-to-date review Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet. 2011 Jun 23. PubMed PMID: 21698376

3rd generation 2nd generation still uses PCR amplification which may introduce base sequence errors or favor certain sequences over others. To overcome this, emerging 3rd generation of seqeuencers performs the single molecule sequencing (i.e. sequence is determined directly from one DNA molecule, no amplification or cloning). Compared to 2nd generation these instruments offer higher throughput, longer read lengths (~1000 bps), higher accuracy, small amount of starting material, lower cost

NHGRI Costs transition to 2nd generation $0.19 source: http://www.genome.gov/27541954 NHGRI Costs transition to 2nd generation Moore's Law - doubling of 'compute power' every two years Moore’s Law curve here is hypothetical In both graphs, note: (1) the use a logarithmic scale on the Y axis; and (2) the sudden and profound out-pacing of Moore's Law beginning in January 2008. The latter represents the time when the sequencing centers transitioned from Sanger-based (dideoxy chain termination sequencing) to 'second generation' (or 'next-generation') DNA sequencing technologies. $0.19 National Human Genome Research Institute (NHGRI) tracks the costs associated with sequencing.

Which genomes were sequenced? http://www.ncbi.nlm.nih.gov/sites/genome GOLD – Genomes online database (http://www.genomesonline.org/) information regarding complete and ongoing genome projects

Important genomics projects The analysis of personal genomes has demonstrated, how difficult is to draw medically or biologically relevant conclusions from individual sequences. More genomes need to be sequenced to learn how genotype correlates with phenotype. 1000 Genomes project (http://www.1000genomes.org/) started in 2009. Sequence the genomes of at least a 1000 people from around the world to create the detailed and medically useful picture of human genetic variation. 2nd generation of sequencers is used in 1000 Genomes. 10 000 Genomes will start soon.

Important genomics projects ENCODE project (ENCyclopedia Of DNA Elements, http://www.genome.gov/ENCODE/) by NHGRI identify all functional elements in the human genome sequence Defined regions of the human genome corresponding to 30Mb (1%) have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Rapid Evolution of Next Generation Sequencing Technologies 2000: Human genome working drafts Data unit of approximately 10x coverage of human 10 years and cost about $3 billion 2008: Major genome centers can sequence the same number of base pairs every 4 days 1000 Genome project launched World-wide capacity dramatically increasing 2009: Every 4 hours ($25,000) 2010: Every 14 minutes ($5,000) Illumina HiSeq2000 machine produces 200 gigabases per 8 day run

cDNA isolate mRNA from suitable cells convert it to complementary DNA (cDNA) using the enzyme reverse transcriptase (+ DNA poymerase) cDNA contains only expressed genes, no intergenic regions, no introns (just exons). Because usually the desired gene sequences still represent only a tiny proportion of the total cDNA population, the cDNA fragments are amplified by cloning/PCR. cDNA library – a library is defined simply as a collection of different DNA sequences that have been incorporated into a vector.

ESTs Expressed Sequence Tag Their use was promoted by Craig Venter. At that time (1991) it was a revolutionary way for gene identification. EST is a short subsequence (200-800 bps) of cDNA sequence. They are unedited, randomly selected single-pass sequence reads derived from cDNA libraries. They can be generated either from 5’ or from 3’ end. - cerpano z Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 Jan;8(1):6-21. PubMed PMID: 16772268. mRNA cDNA 5’ ESTs 3’ ESTs

ESTs ESTs and cDNA sequences provide direct evidence for all the sampled transcripts and they are currently the most important resources for transcriptome exploration. ESTs/cDNA sequences cover the genes expressed in the given tissue of the given organism under the given conditions. housekeeping genes – gene products required by the cell under all growth conditions (genes for DNA polymerase, RNA polymerase, rRNA, tRNA, …) tissue specific genes – different genes are expressed in the brain and in the liver, enzymes responding to a specific environmental condition such as DNA damage, …

ESTs vs. whole genome Whole genome sequencing is still impractical and expensive for organisms with large genome sizes. Genome expansion, as a result of retrotransposon repeats, makes whole genome sequencing less attractive for plants such as maize. Transposons - sequences of DNA that can move (transpose) themselves to new positions within the genome. Retrotransposons – subclass of transposons, they can amplify themselves. Ubiquitous in eukaryotic organisms (45%-48% in mammals, 42% in human). Particularly abundant in plants (maize – 49-78%, wheat – 68%) Genome expansion – increase in genome size, one of the elements of genome evolution - ubiquitous - vsudypritomny

EST properties Individual raw EST has negligible biological information, it is just a very short copy of mRNA . It is highly error prone, especially at the ends. The overall sequence quality is usually significantly better in the middle. Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.

Problems in ESTs Redundancy Under-representation and over-representation of selected host transcripts (i.e. sequence bias) Base calling errors (as high as 5%) Contamination from vector sequences Repeats may pose problems Natural sequence variations (e.g. SNPs) – how to distinguish them and sequencing artifacts?

ESTs on the web Largest repository: dbEST (http://www.ncbi.nlm.nih.gov/dbEST/) 1.7. 2011 – 69 992 536 ESTs from more than 1 000 organisms UniGene (http://www.ncbi.nlm.nih.gov/unigene) stores unique genes and represents a nonredundant set of gene-oriented clusters generated from ESTs. - http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html

EST analysis generic steps involved in EST analysis The aim of the analysis: augment weak signals, make consensus, when a multitude of ESTs are analysed reconstruct transcriptome of the organism. Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.

EST preprocessing Reduces the overall noise in EST data to improve the efficacy of subsequent analyses. Remove vector contaminating fragments. Compare ESTs with non-redundant vector databases (UniVec - http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html, EMVEC – http://www.ebi.ac.uk/Tools/sss/ncbiblast/vectors.html) Repeats must be detected and masked using RepeatMasker (http://www.repeatmasker.org/). Resources for EST pre-processing: page 12 in Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007 8(1):6-21. PMID: 16772268.

EST clustering Collect overlapping ESTs from the same transcript of a single gene into a unique cluster to reduce redundancy. Clustering is based on the sequence similarity. Different steps for EST clustering are described in detail in Ptitsyn A, Hide W. CLU: a new algorithm for EST clustering. BMC Bioinformatics. 2005; 6 Suppl 2:S3. PubMed PMID: 16026600 The maximum informative consensus sequence is generated by ‘assembling’ these clusters, each of which could represent a putative gene. This step serves to elongate the sequence length by culling information from several short EST sequences simultaneously. Sequence clustering and assembly: CAP3

Functional annotations Database similarity searches (BLAST) are subsequently performed against relevant DNA databases and possible functionality is assigned for each query sequence if significant database matches are found. Additionally, a consensus sequence can be conceptually translated to a putative peptide and then compared with protein sequence databases. Protein centric functional annotation, including domain and motif analysis, can be carried out using protein analysis tools.

EST analysis pipelines Large-scale sequencing projects (thousands of ESTs generated daily) – store, organize and annotate EST data in an automatic pipeline. Database of raw chromatograms → clean, cluster, assemble, generate consensus, translate, assign putative function based on various DNA/protein similarity searches examples: TGI Clustering tools (TGICL) http://compbio.dfci.harvard.edu/tgi/software/ PartiGene http://nebc.nerc.ac.uk/tools/other-tools/est

Sequence Alignment

What is sequence alignment ? CTTTTCAAGGCTTA GGCTTATTATTGC Fragments overlaps CTTTTCAAGGCTTA GGCTATTATTGC navozeni squence alignmentu na prikladech, kdy se tento objevuje v predchozim vykladu nahore je presny overlap dole je priblizny overlap, jsou ukazana dve zarovnani, zarovnani s vlozenou mezerou je vice optimalni CTTTTCAAGGCTTA GGCT-ATTATTGC

What is sequence alignment ? CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG “EST clustering” CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG toto je pouze demonstracni zarovnani nahore jsou EST sekvence clustering vede ke konsensualni sekvenci ale ty sekvence musime nejak takhle setridit samozrejme jsou tam chyby, caste na koncich ESTU, dve jsou ukazany (ale mam jich tam vic) TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG consensus

Sequence alphabet Adenine A Thymine T Cytosine G Guanine C Name side chain charge at physiological pH 7.4 Name 3 letters 1 letter Positively charged side chains Arginine Arg R Histidine His H Lysine Lys K Negatively charged side chains Aspartic Acid Asp D Glutamic Acid Glu E Polar uncharged side chains Serine Ser S Threonine Thr T Asparagine Asn N Glutamine Gln Q Special Cysteine Cys C Selenocysteine Sec U Glycine Gly G Proline\ Pro P Hydrophobic side chains Alanine Ala A Leucine Leu L Isoleucine Ile I Methionine Met M Phenylalanine Phe F Tryptophan Trp W Tyrosine Tyr Y Valine Val V Adenine A Thymine T Cytosine G Guanine C

Sequence alignment Procedure of comparing sequences Point mutations – easy More difficult example However, gaps can be inserted to get something like this ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT gapless alignment ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT Gaps correspond to inserion in one sequnce, or deletion in another. (indel) Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT gapped alignment insertion × deletion indel

Why align sequences – continuation The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

Flavors of sequence alignment pair-wise alignment × multiple sequence alignment - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.

Flavors of sequence alignment global alignment × local alignment global align entire sequence stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. local

Protein vs. DNA sequences Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences. There are several reasons for this: Many changes in DNA do not change the amino acid that is specified. Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships can be accounted for using scoring systems. When is it appropriate to compare nucleic sequences? confirming the identity of DNA sequence in database search, searching for polymorphisms, confirming identity of cloned cDNA When nucleotide sequence is analyzed, it is usually preferable to study the protein sequences. Particularly 3rd position in codon does not change the coded amino acid.

Evolution of sequences The sequences are the products of molecular evolution. When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. DNA1 DNA2 Protein1 Protein2 - similar sequences produce similar proteins – this is probably the most powerful idea of bioinformatics because it enables us to make predictions. Often little is known about the function of new sequence from a genome sequencing program, but if similar sequences can be found in a database for which functional or structural information is available, then this can be used as the basis of a prediction of function or structure for the new sequence. Sequence similarity Similar 3D structure Similar function Similar sequences produce similar proteins However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260