Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Introduction to BioInformatics GCB/CIS535
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Biological Motivation Gene Finding in Eukaryotic Genomes
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
NGS Analysis Using Galaxy
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence analysis – an overview A.Krishnamachari
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How.
Construction of Substitution Matrices
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Motif discovery and Protein Databases Tutorial 5.
Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Copyright OpenHelix. No use or reproduction without express written consent1.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Application of Bioinformatics in Genetic Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Dr. Michele Tennant Dr. Lei Zhou
Bioinformatics and Computational Biology
Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Accessing and visualizing genomics data
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Stand-alone tools 2. 1.Download the zip file to the GMS6014 folder. 2.Unzip the files to a folder named “clustalx”. 3.Edit the MDM2_isoforms_5.fasta file.
Welcome to the combined BLAST and Genome Browser Tutorial.
Protein motif /domain Structural unit Functional unit Signature of protein family How are they defined?
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
What is sequencing? Video: WlxM (Illumina video) WlxM.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
A Very Basic Gibbs Sampler for Motif Detection
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
GEP Annotation Workflow
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Ensembl Genome Repository.
Problems from last section
Gene Structure.
Gene Structure.
Presentation transcript:

Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total: 5 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap Extension: -2 ~ -6

Position –specific matrices reflect the structural- function relationship of a given protein family BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Statistical representation G: 5 -> 71% S: 1 -> 14 % C: 1 -> 14 %

Genomic sequence analysis  Genome organization/gene structure.  Comparing genome organization.  Identifying regulatory modules.

Genome Browser / Map viewer o NCBI, Ensemble, species databases. o Range selection, Zoom in/out. oRetrieving genomic sequences. oFastacmd oPython script

Practice: retrieve genomic sequence using the genome browser 1.Identify the range that you would like to retrieve (start and end positions) by clicking on the features in the map. It helps to have an round-up position (e.g xx,xxx,000) for easy mapping back. 2.Input the number in the data retrieve window.

Practice/observe: retrieve genomic sequence using fastacmd or python script 1.Fastacmd is a program distributed together with Blast for sequence retrieval. Takes input files of sequence IDs. -- strict requirement of database format. 2.FindSeq_WithID.py or FindSeq_Partialmatch.py are simple python scripts for retrieving sequences based on fasta format sequence identification line (following the “>”).

Practice: Gene structure analysis using GeneScan 1.Identify and save the DNA sequence file 2.Upload to GeneScan sever at MIT, Pasteur Institue,MITPasteur Institue

GeneScan ResultResult Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P.... Tscr Term Intr Intr Intr Init

Basis of Gene structure prediction GC contents Promoter signal (ie. TATA box) Splicing signal Translation initiation signal ….. Probability modeling Weighted scoring scheme Detection

From data to model >seq0 gtcttttttttaaCTTATTTGAAGGgcctcggtaaccg > seq1 gaatataatgctttcttggtggtgggatcattttagggatt ccgccctccTTTATAAAATACgcctagt > seq2 gcgctttacttaaCGTACTAGAAGCtaga >seq3 gttgtttgggttgaatccgTGCCTGAAAGTGaataattaga cagaactat actttggggactaagtcg >seq4 gctttCATATGAATTCCtcttcgtcggtaatcatgtataag gtaaattct taacacgg >seq5 caactacaagAGCGTATAAGGGctcgggaacccgaagacgg tgagacatt >……… ………………………. TATA containing core promoter sequences ###MATCH_STATE # Symbol A probability # Symbol C probability # Symbol G probability # Symbol T probability ###MATCH_STATE # Symbol A probability # Symbol C probability # Symbol G probability # Symbol T probability

From data to model >SNR17A_15_780119_780275_INTRON GUAUGUAAUAUACCCCAAACAUUUUACCCACAAAAAACCAGGAUUUGAAA ACUAUAGCAUCUAAAAGUCUUAGGUACUAGAGUUUUCAUUUCGGAGCAGG CUUUUUGAAAAAUUUAAUUCAACCAUUGCAGCAGCUUUUGACUAACACAU UCUACAG >SNR17B_16_281502_281373_INTRON GUAUGUUUUAUACCAUAUACUUUAUUAGGAAUAUAACAAAGCAUACCCAA UAAUUAGGCAAUGCGAUUGUCGUAUUCAACAACCAUCUUCUAUUUCACCA GCUUCAGGUUUUGACUAACACAUUCAACAG >YAL001C_1_151163_147591_INTRON_71_160 GUAUGUUCAUGUCUCAUUCUCCUUUUCGGCUCCGUUUAGGUGAUA AACGU ACUAUAUUGUGAAAGAUUAUUUACUAACGACACAUUGAAG >YAL003W_1_142172_143158_INTRON_81_446 GUAUGUUCCGAUUUAGUUUACUUUAUAGAUCGUUGUUUUUCUUUCUUUUU UUUUUUUCCUAUGGUUACAUGUAAAGGGAAGUUAACUAAUAAUGAUUACU UUUUUUCGCUUAUGUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUG AUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUA UCACAGUAUCUGACGAUAGCACAGAGCAGAGUAUCAUUAUUAGUUAUCUG UUAUUUUUUUUUCCUUUUUUGUUCAAAAAAAGAAAGACAGAGUCUAAAGA >……… ………………………. 500 verified exon sequences Modeling

Basis of Gene structure prediction oGC contents oPromoter signal (ie. TATA box) oSplicing signal oTranslation initiation signal o….. Final score / p value

Accuracy ME: Missing Exons WE: Wrong Exons Sn: Sensitivity (find the right one) Sp: Specificity (true positive)

DNA Pattern – Transcription factor binding sites ACGT Con sens u s N G R 09305C W T G 0500 Y C Y

Stringency of the matrices ACGT Con sens u s N G R 09305C W T G 0500 Y C Y ACGT Co nse nsu s 40130G 50120G 15020A 01700C 000A 000 T 00 0G 01304C 01700C 0 00C 00 0G 00 0G 20150G 01700C 000A 000 T 00 0G 02015T 01304C 0727Y P53_01 P53_02 Consensus –10 bp Consensus –20 bp

Comparing genomes  For understanding genome organization.  For identifying functionally conserved region / sequences.  3’, 5’ UTR (eg. microRNA binding sites)  Transcription factor binding sites / regulatory modules.

Vista Genome Browser Practice & Observe: cross genome comparison using vista browser

Cautions with genome browser and description of genomic sequences  Coordinates changes with every release/build of genome. – refer to genome release in your work and publication.  Predicated gene structure ≠ verified gene structure.

Identifying conserved regulatory modules Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation. Functional requirement  conservation at the binding site (sequence) level.

Ways to Identify conserved regulatory modules Based on sequence similarity:  MEME,  rVista, Whole genome rVista for model organisms  … Based on binding site identity:  BLISS

Practice: Identify conserved TFBSs upstream of the human TNF gene.

Vista Genome Browser Practice & Observe: cross genome comparison using vista browser

Practice: Identify conserved TFBSs upstream of the human TNF gene.  Use precompiled TFBS conservation data.  Load genomic sequence.

Practice: Load the BED file of TF binding sites to UCSC genome browser.

Large Data Set Analysis. Hardware considerations: 1.) Data storage.  FASTA record of a protein (1,000 aa) ~ 1 KB.  Human proteome, or Chromosome 21 ~ 50 MB  Human genome ~ 1.5 GB  HTS transcriptome analysis (4 40 million reads each) original and derived data sets ~ 200 GB

Large Data Set Analysis. Hardware considerations: 2.) Processors and RAM.  Comparison: tbalstn of 5 protein sequences against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours).  RAM < data size will greatly slow down the process.

Large Data Set Analysis. Hardware considerations: 3.) Operating system determines the availability of tools.  Linux is the default development system for most bioinformatics groups. It is also the OS of the UFHPC.  Easy control and automation.  Portable to Mac OSX, but often requires recompiling the source code.

Observe: demanding computation for large data set analysis.

Practice: log into UFHPC. First step