1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

Slides:



Advertisements
Similar presentations
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
1 Alternative Splicing. 2 Eukaryotic genes Splicing Mature mRNA.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
DNA Sequencing and Gene Analysis
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
How to access genomic information using Ensembl August 2005.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Lecture 12 Splicing and gene prediction in eukaryotes
Biological Motivation Gene Finding in Eukaryotic Genomes
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Fine Structure and Analysis of Eukaryotic Genes
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Probe selection for Microarrays Considerations and pitfalls.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Part I: Identifying sequences with … Speaker : S. Gaj Date
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
Research about Alternative Splicing recently 楊佳熒.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Introduction to Alternative Splicing and my research report
Presentation transcript:

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on genomic DNA Applications

2 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction exon 2exon 1exon npromotor 5‘UTR 3‘UTR Protein coding sequence exon n-1

3 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Strategies for detecting ORFs / exons Distribution of Stop-codons Codon usage Hexamer frequencies Prediction of the coding frame Splice site recognition (Eucaryotes only)

4 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Codon usage (single exon) Frame 1 Frame 2 Frame 3 coding non-coding

5 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Codon usage (single exon) Frame 1 Frame 2 Frame 3 coding non-coding correct start coding sequence

6 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Codon usage (multiple exons) Frame 1 Frame 2 Frame 3 coding non-coding Splice sites Exons:

7 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Codon usage (multiple exons) Frame 1 Frame 2 Frame 3 coding non-coding Splice sites Exons:

8 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Additional criteria Detection of start codons Detection of potential promotor elements Detection of repetitive sequences (mostly untranslated) Homology to known genes of related organisms

9 Computational Molecular Biology MPI for Molecular Genetics Gene prediction Software GENSCAN (C.Burge & S.Karlin) Grail (neural network; Ueberbacher et al.) MZEF (M. Zhang,1997) FGeneH, Hexon (V.Solovyev et al., 1994) Genie, etc. All programs are using dynamic programming for detection of the optimal solution

10 Computational Molecular Biology MPI for Molecular Genetics DNA sequences in public databases Human ~ 4 million ESTs RNAs Mouse ~ 2.7 million ESTs RNAs

11 Computational Molecular Biology MPI for Molecular Genetics Expressed sequence tags (EST) AAAAAA... mRNA TTTTTT... cDNA is usually oligo dT primed, or by random primers Reverse transcriptase stops ‚randomly‘ cDNA Several cDNAs for the same mRNA may be generated

12 Computational Molecular Biology MPI for Molecular Genetics Expressed sequence tags (EST) Average: 1500 bp <700 bp Vector (known sequence) Clone = mRNA fragment Dechiffered sequence (EST) 3‘-primer

13 Computational Molecular Biology MPI for Molecular Genetics Expressed sequence tags (EST) Isolation of mRNAs from tissue(s) Generation of cDNAs reflecting parts of the RNAs Cloning of cDNAs into a vector (often random orientation) End sequencing of the clones

14 Computational Molecular Biology MPI for Molecular Genetics Generation of ESTs basecalling problems close to 3‘ end of EST close to 5‘ end of EST missing bases

15 Computational Molecular Biology MPI for Molecular Genetics Coverage of an mRNA by ESTs AAAAAA... putative mRNA exon 15‘UTRexon 23‘UTR expressed sequence tags (ESTs)

16 Computational Molecular Biology MPI for Molecular Genetics Characteristics of ESTs Highly redundant Low sequence quality (Cheap) Reflect expressed genes May be tissue/stage specific

17 Computational Molecular Biology MPI for Molecular Genetics Gene indices UniGene (NCBI) TIGR Gene Indices STACK (SANBI) GeneNest (DKFZ,MPI) Clustering of EST and mRNA sequences of an organism to reduce redundance in sequence data. Goal: Each cluster represents one gene or mRNA

18 Computational Molecular Biology MPI for Molecular Genetics Gene indices GeneNest workflow EMBL databaseUnigene database Quality clipping BLAST/QUASAR search, clustering Assembly, Consensus sequences Visualization

19 Computational Molecular Biology MPI for Molecular Genetics Gene indices Quality clipping Removal of vector sequence Masking of repetitive sequences (e.g. Alu) Removal of terminal sequences of low quality In order to cluster based on gene-specific sequence data the following steps have to be performed:

20 Computational Molecular Biology MPI for Molecular Genetics Gene indices Clustering Minimal % identity (e.g. > 95%) Minimal length of match (e.g. >40 bp) No internal matches (TIGR gene indices) Same origin of tissue (only STACK) Sequences are usually clustered if the matching part between two sequences fullfills several (empirical) criteria:

21 Computational Molecular Biology MPI for Molecular Genetics Gene indices Assembly Contigs, reflecting parts of different transcripts One consensus sequence per contig A relative order of the sequences (alignment) Sequences in a cluster are assembled to group those sequences which are globally similar, resulting in

22 Computational Molecular Biology MPI for Molecular Genetics Gene indices Consensus sequences Reduced error rate Consensus often longer than any single sequence contributing Efficient database search Detection of exon/intron boundaries and alternative splice variants

23 Computational Molecular Biology MPI for Molecular Genetics Gene indices Alignment consensus

24 Computational Molecular Biology MPI for Molecular Genetics Gene indices Alignment Software Phrap (Phil Green) CAP3 (X. Huang) TIGR assembler GAP4 (R. Staden)

25 Computational Molecular Biology MPI for Molecular Genetics GeneNest visualization ( )

26 Computational Molecular Biology MPI for Molecular Genetics GeneNest visualization ( )

27 Computational Molecular Biology MPI for Molecular Genetics TIGR Gene Indices ( ) Alignment scheme

28 Computational Molecular Biology MPI for Molecular Genetics UniGene ( )

29 Computational Molecular Biology MPI for Molecular Genetics UniGene ( )

30 Computational Molecular Biology MPI for Molecular Genetics Mapping of consensus sequences on genomic DNA genomic sequence exons consensus sequence (  mRNA) missing intron

31 Computational Molecular Biology MPI for Molecular Genetics Mapping cDNA on genomic DNA

32 Computational Molecular Biology MPI for Molecular Genetics Gene indices Applications Detection of exon/intron boundaries Detection of alternative splicing Detection of Single Nucleotide Polymorphisms Genome annotation Analysis of gene expression Genome-genome comparison

33 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing hnRNA mRNA 2 exon 15‘UTRexon 2 mRNA 1 exon 15‘UTRexon 3 exon 15‘UTRexon 2exon 3

34 Computational Molecular Biology MPI for Molecular Genetics Alignment of EST consensus sequences and genomic target genomic sequence

35 Computational Molecular Biology MPI for Molecular Genetics Detection of the appropriate genomic target sequence Local similarity of EST consensus and genomic DNA >96% identity genomic sequence

36 Computational Molecular Biology MPI for Molecular Genetics Cutting out genomic target sequence genomic sequence

37 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing (mapping on genomic DNA) genomic sequence exons consensus sequence (  mRNA) splice variant

38 Computational Molecular Biology MPI for Molecular Genetics SpliceNest ( putative exons genomic sequence aligned GeneNest consensus alternative exon

39 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing (additional exon) skipped exon Splice variants of adenylsuccinate lyase gene prediction errors ? unspliced ?

40 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing Splice variants of APECED gene number of sequencesgenomic sequence alternative variants

41 Computational Molecular Biology MPI for Molecular Genetics Alternative splicing

42 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing (alternative donor site)

43 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing

44 Computational Molecular Biology MPI for Molecular Genetics Alternative Splicing (alternative exons)

45 Computational Molecular Biology MPI for Molecular Genetics SpliceNest (hypothetical gene Hs16936)

46 Computational Molecular Biology MPI for Molecular Genetics Single Nucleotide Polymorphisms (SNP) SNPs are single base differences within one species Several million SNPs detected in Human SNPs may be related to diseases

47 Computational Molecular Biology MPI for Molecular Genetics Single Nucleotide Polymorphisms (SNP) SNP or basecalling error ?

48 Computational Molecular Biology MPI for Molecular Genetics Genome Annotation / Ensembl (

49 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression tissue-specificity Counting frequency of EST derived from a specific tissue within one sequence cluster Searching for cluster/contigs which are tissue specific (e.g. tumor) Searching for alternative splice variants which are potentially tissue specific

50 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression PDZ-domain containing protein PDZK1 (Hs.15456) liver tumor kidney

51 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression small muscular protein, SMPX (Hs.88492) heart muscle

52 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression hypothetical protein (Hs.32343) thyroid tumor heart ovary

53 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression non-redundant gene set Selection of ‚optimal‘ clones Generation of gene-specific PCR-products

54 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression ‚optimal clones‘ clone availability type of clone library length of the clone relative position to the consensus sequence homology to other genes existence of repetitive elements

55 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression gene-specific PCR-products putative gene  consensus sequence exon Aexon Cexon B repetitive sequence similarity to another gene potential gene-specific fragment potential gene-specific fragment

56 Computational Molecular Biology MPI for Molecular Genetics Analysis of gene expression optimal gene-specific PCR-product minimal similarity to other genes minimal content of repetitive sequences not spanning over several exons +/- constant length of PCR-products of different genes