Gene architecture and sequence annotation Week 2
Last week: How to search genomic databases such as NCBI and ensembl How to obtain sequence files
This week we will learn to identify genetic architecture within sequence files Sequence of the Cystic Fibrosis Gene: CFTR
This week will learn the differences between the two types of Nucleic Acid Sequences Genomic—the sequence of nucleotides on a chromosome Expressed sequences—the sequence of nucleotides in mRNA/cDNA
The expression of genomic information DNA RNA protein Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
DNA RNA protein genome transcriptome proteome Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
DNA RNA protein phenotype protein sequence databases cDNA ESTs UniGene genomic DNA databases Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
Learning Objectives: Understand sequence differences between genomic and expressed sequences Use programs to determine the correct open reading frame (ORF) of an expressed sequence Annotate sequence files
Genomic DNA is one source of nucleic acid sequence Strachan, T. & Read, A.P. Human Molecular Genetics. (New York; Wiley-Liss, 1999).
The chemical properties of DNA are important for sequence analysis Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands 5’ is the beginning of the sequence and 3’ is the end of the sequence DNA sequence is always written with 5’ at the left side and 3’ at the right side Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands 5’ is the beginning of the sequence and 3’ is the end of the sequence DNA sequence is always written with 5’ at the left side and 3’ at the right side Strand 1: 5’ GAT… Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands 5’ is the beginning of the sequence and 3’ is the end of the sequence DNA sequence is always written with 5’ at the left side and 3’ at the right side Strand 1: 5’ GAT… Strand 2: 5’ AGT… Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA has strict base pairing rules that determine the sequence of the complementary strand Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
Transcription is the process of making RNA from a DNA template protein Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
During transcription and RNA molecule is synthesized from genomic DNA Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
RNA polymerase adds bases to the 3’ end of the growing RNA molecule Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
The rule of complementary base pairing are followed for RNA transcription During RNA transcription Uridine is added instead of Thymine. Uridine base pairs with Adenine. In Bioinformatics we ignore this fact—all Uridine are written as Thymine. Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
The template strand is anti-parallel to the growing mRNA molecule Template strand= antisense 5’ 3’ Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000). 3’ 5’
The template strand is anti-parallel to the growing mRNA molecule non-template strand = sense strand Template strand= antisense 5’ 3’ This strand has the same sequence as the mRNA molecule 3’ 5’ Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
Genes can be found on both strands of a chromosome Forward strand 5’ 5’ Reverse strand
The original RNA molecule undergoes processing that changes the sequence Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The original RNA molecule is processed Exons are segments of DNA that are found in mature mRNA Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The original RNA molecule is processed Introns are segments of DNA that are removed through splicing. They are not found in mRNA Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The original RNA molecule is processed The sequence in red is the coding sequence (often abbreviated CDS) Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The original RNA molecule is processed The sequence in red is the coding sequence (often abbreviated CDS) Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
In the mRNA the exons are joined together as one continuous sequence Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Translation is the process by which an mRNA molecule is used to make a protein +1 is the first translated nucleotide (usually the A (followed by TG (ATG=Methionine)
Translation is the process by which an mRNA molecule is used to make a protein The red indicates all the sequence within the mRNA that will be used during translation to code for protein
The sequences within an mRNA that do not directly code for protein are called Untranslated Regions 5’ UTR- UnTranslated Region before start codon—does not code for protein 3’ UTR- UnTranslated Region after stop codon—does not code for protein
mRNA is converted to cDNA using reverse transcription Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).
Because it is cDNA, not mRNA that is sequenced we use T not U in sequence files Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).
How do we identify introns/exons in our sequence files?
We will use KRAS as an example
The KRAS gene produces 4 transcripts (splice variants) Table
This is the transcript diagram for this gene region
The Transcript Diagram shows the organization of the transcripts generated from the gene locus
Use the link under the “Transcript ID” column identify the exons and introns in a specific transcript
The exon/intron map for a specific transcript The lines are intronic sequence
The exon/intron map for a specific transcript The lines are intronic sequence Bars are exonic sequence: filled bars mean coding sequence and unfilled bars are UTR sequence
The exon/intron map for a specific transcript The number of introns is always the number of exons -1. 5 exons, means 4 introns
The RefSeq link will direct you to the NCBI nucleotide record for that gene
NCBI nucleotide record
NCBI nucleotide record continued
NCBI nucleotide record also contains the sequence
Every nucleotide within the sequence has an exact position 60 Each nucleotide has a number associated with its position
NCBI nucleotide contains the annotation of the sequence
The numbers refer to nucleotide positions
Viewing features within the sequence file
Once you select a sequence feature, the nucleotide sequence of the feature become highlighted
CDS stands for coding sequence and this will also show you the translation of the nucleotide sequence into amino acid sequence
The genetic code DNA RNA protein Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
The genetic code is based on three nucleotides “coding” for one amino acid Codons Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol; O’Reilly, 2003).
An Open Reading Frame (ORF) begins with ATG and ends with TAA, TAG or TGA Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol; O’Reilly, 2003).
To find the coding sequence you must identify the start and stop codons within the sequence
Which start codon is right?
Which start codon is right? The correct ORF is the longest translated sequence
Any sequence has 6 possible reading frames Two strands of DNA Triplet code (three nucleotides in a codon)
Any sequence has 6 possible reading frames 5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’ 5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1 5’ C GCA TGG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTT AA 3’ FRAME +2 5’ CG CAT GGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TTA A 3’ FRAME +3
The next three reading frames are based on the reverse complement sequence 5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’ 3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence 5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement
Generating the reverse complement sequence 5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’ 3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence 5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement
The 6 possible reading frames 5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’ 3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence 5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement 5’ TTA AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC CAT GCG 3’ FRAME -1 5’ T TAA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACC ATG CG 3’ FRAME -2 5’ TT AAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CCA TGC G 3’ FRAME -3
The correct reading frame will have the largest ORF 5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’ 5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1 5’ M V L R W S S H G S V Ter 3’ (amino acids) Always ends with a stop codon Always begins with ATG ATG (M) is the start codon TAA, TAG or TGA are the three stop codons—they do not code for an amino acid
Using the ORF-finder program to identify ORFs http://www.ncbi.nlm.nih.gov/gorf/gorf.html Or Google “ORF-finder”
Using ORF-finder
Using ORF-finder
Using ORF-finder
Results from ORF-finder
There are 6 possible reading frames
For our purposes, the largest ORF is the correct one
Selecting an ORF gives you the translation
ORFs begin with a start codon and end with a stop codon
ORF-finder results match with NCBI nucleotide
Sequences found in the genomic DNA are removed from the mRNA
Sequences found in the genomic DNA are removed from the mRNA Introns are the sequences that are removed The mature mRNA sequence contains only exonic sequence
An mRNA sequence includes 5’UTR, ORF, 3’UTR Coding sequence (red) 3’ UTR- Untranslated region after stop codon—does not code for protein 5’ UTR- Unstranslated region before start codon—does not code for protein
There are 6 possible reading frames in a nucleic acid sequence
The correct ORF is usually the largest
ORFs start with ATG and end with a stop codon
Worksheet