Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.

Slides:



Advertisements
Similar presentations
Ch 17 Gene Expression I: Transcription
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
CH. 11 : Transcriptional Control of Gene Expression Jennifer Brown.
Section 8.6: Gene Expression and Regulation
Gene Expression. 2 Gene expression? Gene expression?  Biological processes, such as transcription, and in case of proteins, also translation, that yield.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
I. Overview of Eukaryotic gene regulation Mechanisms similar to those found in bacteria- most genes controlled at the transcriptional level Much more complex.
(CHAPTER 12- Brooker Text)
Step 1 of Protein Synthesis
Translation and Transcription
CHAPTER 3 GENE EXPRESSION IN EUKARYOTES (cont.) MISS NUR SHALENA SOFIAN.
Relationship between Genotype and Phenotype
Biological Motivation Gene Finding in Eukaryotic Genomes
GENE: RNA polymerases and transcription factors. Structure of genes Prokaryotic and eukaryotic genes differ in their structure, however there are a number.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Day 2! Chapter 15 Eukaryotic Gene Regulation Almost all the cells in an organism are genetically identical. Differences between cell types result from.
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Control of Gene Expression Eukaryotes. Eukaryotic Gene Expression Some genes are expressed in all cells all the time. These so-called housekeeping genes.
KUFA MEDICAL COLLEGE MGD MODULE SESSION 5: LECTURE 9 MARCH 16, 2014 DR.THEKRA AL-KASHWAN.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
From Gene to Protein Chapter 17.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
How Genes Work Ch. 12.
Genetica per Scienze Naturali a.a prof S. Presciuttini 1. ELONGATION Shortly after initiating transcription, the sigma factor dissociates from the.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Relationship between Genotype and Phenotype
From Genomes to Genes Rui Alves.
Transcription and mRNA Modification
Transcription in prokaryotes
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Transcription in Prokaryotic (Bacteria) The conversion of DNA into an RNA transcript requires an enzyme known as RNA polymerase RNA polymerase – Catalyzes.
Introduction to Molecular Cell Biology Transcription Regulation Dr. Fridoon Jawad Ahmad HEC Foreign Professor King Edward Medical University Visiting Professor.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Topic 1: Control of Gene Expression Jamila Al-Shishani Mehran Hazheer John Ligtenberg Shobana Subramanian.
Transcription. Recall: What is the Central Dogma of molecular genetics?
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
GENE REGULATION RESULTS IN DIFFERENTIAL GENE EXPRESSION, LEADING TO CELL SPECIALIZATION Eukaryotic DNA.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
KEY CONCEPT 8.5 Translation converts an mRNA message into a polypeptide, or protein.
KEY CONCEPT Gene expression is carefully regulated in both prokaryotic and eukaryotic cells. Chapter 11 – Gene Expression.
Human Molecular Genetics Institute of Medical Genetics.
Colinearity of Gene and Protein
CAMPBELL BIOLOGY IN FOCUS © 2014 Pearson Education, Inc. Urry Cain Wasserman Minorsky Jackson Reece Lecture Presentations by Kathleen Fitzpatrick and Nicole.
Eukaryotic Gene Regulation
Factors Involved In RNA synthesis and processing Presented by Md. Anower Hossen ID: MS in Biotechnology.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Relationship between Genotype and Phenotype
Key Concepts After RNA polymerase binds DNA with the help of other proteins, it catalyzes the production of an RNA molecule whose base sequence is complementary.
Eukaryotic Gene Structure
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Regulation of Gene Expression
Concept 18.2: Eukaryotic gene expression can be regulated at any stage
Transcription in Prokaryotic (Bacteria)
Introduction to Bioinformatics II
Chromosome structures
Unit 7: Molecular Genetics
CHAPTER 17 FROM GENE TO PROTEIN.
Gene Structure.
Eukaryotic Gene Regulation
Relationship between Genotype and Phenotype
Gene Structure.
Presentation transcript:

Genome Analysis & Gene Prediction

Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional RNA) A human cell contains approximately 23,000 genes.  Some of these are expressed in all cells all the time. These so- called housekeeping genes are responsible for the routine metabolic functions (e.g. respiration) common to all cells.  Some are expressed all the time in only those cells that have differentiated in a particular way. For example, a liver cell expresses continuously the genes for the metabolizing enzymes.  Some are expressed only as conditions around and in the cell change. For example, the arrival of a hormone (due to environmental factors or others) may turn on (or off) certain genes in that cell.

How Gene Expression is Regulated ? To Know about gene expression, first we look for the basic structure of a gene.

5 Terminology  Genome – entire genetic material of an individual  Transcriptome – set of transcribed sequences  Proteome – set of proteins encoded by the genome

Prokaryotic Gene Structure Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Eukaryotic Gene Structure 5’ - Promoter Exon1 Intron1 Exon2 Terminator – 3’ UTRUTR transcription translation Poly A protein

 3 types of RNA polymerases are employed in transcription of genes:  RNA polymerase I transcribes rRNA  RNA polymerase II transcribes all genes coding for polypeptides  RNA polymerase III transcribes small cytoplasmatic RNA, such as tRNA.

Genomic DNA UpstreamPrimary TranscriptDownstream Genomic DNA 5’…. …3’

About Upstream region of a Gene UpstreamPrimary TranscriptDownstream Upstream promoter/Regulatory regionPromoter Genomic DNA Upstream 5’…. …3’ DistalCentralProximal Distal (GC box)Central (CAAT box)Core/basal Promoter (TATA Box)

About Core Promoter  basal or core promoter located within about 40 base pairs (bp) of the transcription start site (TSS)  It is found in all protein-coding genes. This is in sharp contrast to the upstream promoter whose structure and associated binding factors differ from gene to gene.  It contains a sequence of TATA box (either canonical TATA box or TATA variant). It is bound by a large complex of some 50 different proteins, including - Transcription Factor IID (TFIID) which is a complex of  TATA-binding protein (TBP), which recognizes and binds to the TATA box  14 other protein factors which bind to TBP — and each other — but not to the DNA. - Transcription Factor IIB (TFIIB) which binds both the DNA and pol II.

About Upstream Promoter/Regulatory Regions  an "upstream" promoter, which may extend over as many as 200 bp or farther upstream  It has three regions - Proximal region: insulators are possibly present in this region. Insulators are stretches of DNA (as few as 42 base pairs) and located between the enhancer(s) and promoter or silencer(s) and promoter of adjacent genes or clusters of adjacent genes. Their function is to prevent a gene from being influenced by the enhancer (or silencer) of its neighbors. - Central Region: Silencers are possibly present in this region. Silencers control regions of DNA that may be located thousands of base pairs away from the gene they control. However, when transcription factors (Silencers) bind to them, expression of the gene they control is repressed. - Distal Region: Enhancers may be present in this region. Enhancer bind to regions of DNA that are thousands of base pairs away from the gene they control. Binding increases the rate of transcription of the gene. Enhancers can be located upstream, downstream, or even within the gene they control.

About Upstream Promoter/Regulatory Regions

About Primary Transcript UpstreamPrimary TranscriptDownstream Genomic DNA 5’…. …3’ ATG….GT……..AG………...GT…..AG…......TGA Exon Start codon Exon Acceptor siteDonor site ExonIntron mRNA Stop codon Intron ATG…………………………………………TGA TSS

Primary transcript consists of  Cap region: 5' cap is a specially altered nucleotide on the 5' end of precursor messenger RNA.  5’-UTR: Regions of the gene outside of the CDS are called UTR’s (untranslated regions), and are mostly ignored by gene finders, though they are important for regulatory functions.  Coding sequence (CDS): CDS of a gene is delimited by four types of signals: start codons (ATG in eukaryotes), stop codons (usually TAG, TGA, or TAA), donor sites (usually GT), and acceptor sites (AG).  3’-UTR: three prime untranslated region (3' UTR) is a particular section of messenger RNA (mRNA).  Poly-A tail: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. The poly(A) tail consists of multiple adenosine monophosphates. About Primary Transcript

About Intron and Exon  Intron: It is derived from the term intragenic region, i.e. a region inside a gene. these are sometimes called intervening sequences which refer to any of several families of internal nucleic acid sequences that are not present in the final gene product  Exon: these sequences are present in the mature form of an RNA molecule after removing of introns. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA.

More about Exon  Three types of exons are defined:  initial exons extend from a start codon to the first donor site;  internal exons extend from one acceptor site to the next donor site;  final exons extend from the last acceptor site to the stop codon;  single exons (which occur only in intronless genes) extend from the start codon to the stop codon.

Structure of a Gene

An Hypothetical Example Gene Parse Tree

Gene Prediction  Analysis by sequence similarity can only reliably identify about 30% of the protein coding genes in a genome  50-80% of new genes that are identified having partial, marginal, or unidentified homolog  Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes

Gene finding is species-specific  Codon usage patterns vary by species  Functional regions (promoters, translation initiation sites, termination signals) vary by species  Common repeat sequences are species-specific  Gene finding programs rely on this information to identify coding regions

Protein Coding Gene  ab initio using computational methods is the most suited to protein-coding genes  Protein-coding genes have recognizable features open reading frames (ORFs) codon bias known transcription and translational start and stop motifs (promoters, 3’ poly-A sites) splice consensus sequences at intron-exon boundaries

ab initio gene discovery Protein-coding genes have recognizable features We can design software to scan the genome and identify these features Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes It’s a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.

ab initio gene discovery—Validating predictions and refining gene models Standard types of evidence for validation of predictions include: match to previously annotated cDNA match to EST from same organism similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank protein structure prediction match to a PFAM domain associated with recognized promoter sequences, ie TATA box, CpG island known phenotype from mutation of the locus

Finding Non–protein Coding Genes Non-protein coding genes (tRNA, rRNA, snoRNA, siRNA, miRNA, various other ncRNAs) are harder to find than protein-coding genes. Because often not poly-A tailed—don’t end up in cDNA libraries no ORF constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect

To find out, Non-protein coding genes, we have identify….. secondary structure homology, especially alignment of related species experimentally isolation through non-polyA dependent cloning methods microarrays Finding Non–protein Coding Genes

 Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.  Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are  Dynamic programming model  Artificial neural networks (ANNs) and  Hidden Markov models (HMMs) ab initio gene discovery—approaches

Transcription Factors

Control of Gene Expression—Transcription Factors  Transcription factors (TFs) are proteins that bind to the DNA and help to control gene expression. The sequences to which they bind are transcription factor binding sites (TFBSs), which are a type of cis- regulatory sequence  Most transcription factors can bind to a range of similar sequences. These can be found in either of two ways, as a consensus sequence, or as a position weight matrix (PWM).  Once we know the binding site, we can search the genome to find all of the (predicted) binding sites

Evidence based Approaches  Comparative or similarity based gene prediction  Combine gene models with alignment to known ESTs & protein sequences

Gene Prediction Tools  SNAP  TwinScan  Gnomon (NCBI)  GeneWise  Jigsaw  GLEAN  Grail  BLAST  FASTAX  BLAT  WABA  MZEF,  MZEF-SPC  FGENESH

Genome Annotation-Much work remains  Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated.  For the higher eukaryotes, only a tiny percentage of features such as TFBSs and other non-gene features have so far been indentified.

References  yPages/P/Promoter.html yPages/P/Promoter.html