DNA Sequence Analysis 5.1 Introduction

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Recombinant DNA Technology
Basics of Comparative Genomics Dr G. P. S. Raghava.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Finding Eukaryotic Open reading frames.
Chapter 4 Transcription and Translation. The Central Dogma.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
ECE 501 Introduction to BME
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
RNA.
PROTEIN SYNTHESIS.
Finding prokaryotic genes and non intronic eukaryotic genes
BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)
Fine Structure and Analysis of Eukaryotic Genes
From Gene To Protein Chapter 17. The Connection Between Genes and Proteins Proteins - link between genotype (what DNA says) and phenotype (physical expression)
Lesson Overview 13.1 RNA.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Screening a Library Plate out library on nutrient agar in petri dishes. Up to 50,000 plaques or colonies per plate.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
RNA and Protein Synthesis
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
RNA and Protein Synthesis
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
From Genomes to Genes Rui Alves.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
By Chris Paine Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
12/16/14 StarterConnection/Exit: What is the true meaning of the word mutation? Are mutations bad / harmful? 12/16/14 Protein Synthesis Writing
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
RNA and Protein Synthesis. RNA Structure n Like DNA- Nucleic acid- composed of a long chain of nucleotides (5-carbon sugar + phosphate group + 4 different.
12-3 RNA and Protein Synthesis Page 300. A. Introduction 1. Chromosomes are a threadlike structure of nucleic acids and protein found in the nucleus of.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
Human Genome Project.
Basics of Comparative Genomics
BTY100-Lec#4.2 DNA to Protein (Central Dogma).
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from parents,
Chapter 14 Bioinformatics—the study of a genome
CHAPTER 12 DNA Technology and the Human Genome
12-3 RNA and Protein Synthesis
Introduction to Bioinformatics II
Transcription and Translation
Transcription and Translation
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from.
Basics of Comparative Genomics
Introduction to Alternative Splicing and my research report
The Production of Proteins by DNA
Presentation transcript:

DNA Sequence Analysis 5.1 Introduction 1. Terms in common use are defined, and the genetic code is reviewed. 2. EST-Expressed Sequence Tag as a unit of sequence data, derived from rapid sequencing of cDNA libraries. 3. Three examples of producers of EST databases are profiled.

5.2 Why analysis DNA? The most sensitive comparisons between sequences are made at the protein level; detection of distantly related sequences is easier in protein translation, because the redundancy of the genetic code of 64 codons is reduced to 20 distinct amino acids. However, the loss of degeneracy at this level is accompanied by a loss of information about evolutionary process, because proteins are a functional abstraction of genetic events in DNA.

Table 5.1 The Genetic Code

Box 5.1 Family Analysis at DNA Level

5.3 Gene structure and DNA sequences 1. DNA sequence databases contain genomic sequence data,which includes information at the level of the untranslated sequence, introns and exons, mRNA, cDNA , and translations. 2. Untranslated regions(UTRs): occur in both DNA and RNA; they are portions of the sequence flanking the CDS that are not translated into protein.It is highly specific at the 3’ end both to the gene and the species from which the sequence is derived.

Box 5.2 The Central Dogma

3. Six-Frame Translation: There are three forward frames, which are achieved by beginning to translate at the first,second and third bases respectively; the three reverse frames are determined by reversing the DNA sequence and again beginning on the first, second and third bases. Thus, for any piece of DNA, the result of a six-frame translation is six potential protein sequences.

Fig. 5.1 Six-Frame Translation

5.4 Features of DNA sequence analysis Detecting open reading frames (ORF) : Initial codon: ATG Stop codon: TGA, TAA, TAG 2. Several features may be used as indicators of potential protein coding regions in DNA: a. Sufficient ORF length b. Recognition of flanking Kozak sequence c. Patterns of codon usage d. A general preference for G/C over A/T in the third base (wobble) position of a codon e. Ribosome binding sites f. Alignment with a homologous protein sequences

Table 5.2 Percentage use of codons for serine in a variety of model organisms

3. DNA sequence assembly: The rapid accumulation of DNA sequence data has been expedited by the introduction of fluorescent sequencing technology.The output consists of a series of color-coded peaks, beneath which is a string of base symbols-the particular base shown is determined by the highest peak at that position of the trace.

Box 5.3 Fluorescent sequence chromatogram interpretation

5.5 Issues in the interpretation of EST searches 1. A large part of currently available DNA data is made up of partial sequence, the majority of which are Expressed Sequence Tags (ESTs). 2. In analyzing ESTs the following points should be borne in mide: The EST alphabet is five characters:ACGTN. There may be phantom INDELs resulting in translation frameshifts. The EST will often be a sub-sequence of any other sequence in the databases. The EST may not represent part of the CDS of any gene.

3. The EST alphabet

4. The existence of splice variants has particular consequences for database searches with EST queries.

5.6 Two approaches to gene hunting Position cloning: The chromosome linked to the disease in question is established by analyzing a population of subjects. Once a link to a chromosomal region has been established, a large part of the chromosome in the vicinity of this region(locus) is sequenced, yielding several megabases of DNA. Such a locus can contain many individual genes, only one of which is likely to be involved in diseases.

Ultimately, several genes will need to be expressed, and further experimentation will be required to confirm which gene is actually involved in the disease. Although genes discovered in this way can be illuminating from an academic point of view, they do not necessarily represent good drug targets.The whole process is lengthy, time-consuming and labor intensive.

RNA transcript analysis: This approach requiring much less sequencing effort and relying more heavily on the powerful search capabilities of current computer systems, examines the genes that are actually expressed in healthy and diseased tissue.This process analyses the mRNA and allows a comparison to be performed between the two states, and a process of reasoning applied to arrive at a potential drug target in a more direct way.

The hierarchy of genomic information: The human genome is complex, containing of about 3 billion base-pairs of DNA. Yet only 3% of the DNA is coding sequence. Thus, in simple terms, we have three levels of genomic information: The chromosomal genome-the genetic information common to every cell in the organism. The expressed genome-the part of genome that is expressed in a cell at a specific stage in its development. The proteome-the protein molecules that interact to give the cell its individual character.

5.7 cDNA libraries and ESTs Obtained a sample of cells RNA extraction Reversed transcribed to cDNA cDNA library Sequence The sequences that emerge successfully from this process are called ESTs. Good libraries contain at least 1 million clones, and the actual number of distinct genes expressed in a cell may be a few thousand; the number varies according to cell type.

5.8 Different approaches to EST analysis There are three major sources of EST information. Much of the publicly available data are collected together into the EST sections of the EMBL Data Library and GenBank (dbEST). Merck/IMAGE: In 1994, Merck&Co. funded a research project to sequence 300,000 ESTs from a variety of normalised libraries. As of May 1997, 484421 ESTs had been submitted by the project to dbEST.(Table 5.4) Incyte: Incyte Pharmaceuticals Inc. produces a database, LifeSeq, emphasizing the quantitative

information derived by sequencing standard cDNA libraries information derived by sequencing standard cDNA libraries. The goal is to provide information on transcribed genes in health and diseased tissues, to facilitate the elucidation of potential therapeutic targets.In April 1998, the size of LifeSeq was 2.5 million ESTs, representing 80,000-120,000 different genes. 3. TIGR: The Institute for Genomic Research is a research organization with interests in structural, functional and comparative analysis of genomes and gene products.

TIGR Human Gene Index(HGI)

5.9 EST analysis tools There are three publicly avaiable tools for the analysis of ESTs: Sequence similarity search tools- The BLAST series of programs has variants that will translate DNA databasees(TBLASTN), translate the input sequence(BLASTX), or both(TBLASTX).FastA provides a similar suite of options. 2. Sequence assembly tools-When a search of the databases reveals several ESTs matching with

a probe sequence, the ESTs must be aligned with each other to reveal the consensus sequence. 3. Sequence clustering tools- Programs that take a large set of sequences and divide them into subsets, or clusters, based on the extent of shared sequence identity in a minimum overlap region. A reliable mechanism for clustering ESTs will reduce redundancy in the dataset, and save search time.

Clustering an EST library

5.10 A practical example of EST analysis