Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total: 5 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap Extension: -2 ~ -6
Position –specific matrices reflect the structural- function relationship of a given protein family BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G D D I BAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E F HRK_HUMAN T A A R L K A L G D E L Egl-1 I G S K L A A M C D D F Statistical representation G: 5 -> 71% S: 1 -> 14 % C: 1 -> 14 %
Genomic sequence analysis Genome organization/gene structure. Comparing genome organization. Identifying regulatory modules.
Genome Browser / Map viewer o NCBI, Ensemble, species databases. o Range selection, Zoom in/out. oRetrieving genomic sequences. oFastacmd oPython script
Practice: retrieve genomic sequence using the genome browser 1.Identify the range that you would like to retrieve (start and end positions) by clicking on the features in the map. It helps to have an round-up position (e.g xx,xxx,000) for easy mapping back. 2.Input the number in the data retrieve window.
Practice/observe: retrieve genomic sequence using fastacmd or python script 1.Fastacmd is a program distributed together with Blast for sequence retrieval. Takes input files of sequence IDs. -- strict requirement of database format. 2.FindSeq_WithID.py or FindSeq_Partialmatch.py are simple python scripts for retrieving sequences based on fasta format sequence identification line (following the “>”).
Practice: Gene structure analysis using GeneScan 1.Identify and save the DNA sequence file 2.Upload to GeneScan sever at MIT, Pasteur Institue,MITPasteur Institue
GeneScan ResultResult Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P.... Tscr Term Intr Intr Intr Init
Basis of Gene structure prediction GC contents Promoter signal (ie. TATA box) Splicing signal Translation initiation signal ….. Probability modeling Weighted scoring scheme Detection
From data to model >seq0 gtcttttttttaaCTTATTTGAAGGgcctcggtaaccg > seq1 gaatataatgctttcttggtggtgggatcattttagggatt ccgccctccTTTATAAAATACgcctagt > seq2 gcgctttacttaaCGTACTAGAAGCtaga >seq3 gttgtttgggttgaatccgTGCCTGAAAGTGaataattaga cagaactat actttggggactaagtcg >seq4 gctttCATATGAATTCCtcttcgtcggtaatcatgtataag gtaaattct taacacgg >seq5 caactacaagAGCGTATAAGGGctcgggaacccgaagacgg tgagacatt >……… ………………………. TATA containing core promoter sequences ###MATCH_STATE # Symbol A probability # Symbol C probability # Symbol G probability # Symbol T probability ###MATCH_STATE # Symbol A probability # Symbol C probability # Symbol G probability # Symbol T probability
From data to model >SNR17A_15_780119_780275_INTRON GUAUGUAAUAUACCCCAAACAUUUUACCCACAAAAAACCAGGAUUUGAAA ACUAUAGCAUCUAAAAGUCUUAGGUACUAGAGUUUUCAUUUCGGAGCAGG CUUUUUGAAAAAUUUAAUUCAACCAUUGCAGCAGCUUUUGACUAACACAU UCUACAG >SNR17B_16_281502_281373_INTRON GUAUGUUUUAUACCAUAUACUUUAUUAGGAAUAUAACAAAGCAUACCCAA UAAUUAGGCAAUGCGAUUGUCGUAUUCAACAACCAUCUUCUAUUUCACCA GCUUCAGGUUUUGACUAACACAUUCAACAG >YAL001C_1_151163_147591_INTRON_71_160 GUAUGUUCAUGUCUCAUUCUCCUUUUCGGCUCCGUUUAGGUGAUA AACGU ACUAUAUUGUGAAAGAUUAUUUACUAACGACACAUUGAAG >YAL003W_1_142172_143158_INTRON_81_446 GUAUGUUCCGAUUUAGUUUACUUUAUAGAUCGUUGUUUUUCUUUCUUUUU UUUUUUUCCUAUGGUUACAUGUAAAGGGAAGUUAACUAAUAAUGAUUACU UUUUUUCGCUUAUGUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUG AUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUA UCACAGUAUCUGACGAUAGCACAGAGCAGAGUAUCAUUAUUAGUUAUCUG UUAUUUUUUUUUCCUUUUUUGUUCAAAAAAAGAAAGACAGAGUCUAAAGA >……… ………………………. 500 verified exon sequences Modeling
Basis of Gene structure prediction oGC contents oPromoter signal (ie. TATA box) oSplicing signal oTranslation initiation signal o….. Final score / p value
Accuracy ME: Missing Exons WE: Wrong Exons Sn: Sensitivity (find the right one) Sp: Specificity (true positive)
DNA Pattern – Transcription factor binding sites ACGT Con sens u s N G R 09305C W T G 0500 Y C Y
Stringency of the matrices ACGT Con sens u s N G R 09305C W T G 0500 Y C Y ACGT Co nse nsu s 40130G 50120G 15020A 01700C 000A 000 T 00 0G 01304C 01700C 0 00C 00 0G 00 0G 20150G 01700C 000A 000 T 00 0G 02015T 01304C 0727Y P53_01 P53_02 Consensus –10 bp Consensus –20 bp
Comparing genomes For understanding genome organization. For identifying functionally conserved region / sequences. 3’, 5’ UTR (eg. microRNA binding sites) Transcription factor binding sites / regulatory modules.
Vista Genome Browser Practice & Observe: cross genome comparison using vista browser
Cautions with genome browser and description of genomic sequences Coordinates changes with every release/build of genome. – refer to genome release in your work and publication. Predicated gene structure ≠ verified gene structure.
Identifying conserved regulatory modules Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation. Functional requirement conservation at the binding site (sequence) level.
Ways to Identify conserved regulatory modules Based on sequence similarity: MEME, rVista, Whole genome rVista for model organisms … Based on binding site identity: BLISS
Practice: Identify conserved TFBSs upstream of the human TNF gene.
Vista Genome Browser Practice & Observe: cross genome comparison using vista browser
Practice: Identify conserved TFBSs upstream of the human TNF gene. Use precompiled TFBS conservation data. Load genomic sequence.
Practice: Load the BED file of TF binding sites to UCSC genome browser.
Large Data Set Analysis. Hardware considerations: 1.) Data storage. FASTA record of a protein (1,000 aa) ~ 1 KB. Human proteome, or Chromosome 21 ~ 50 MB Human genome ~ 1.5 GB HTS transcriptome analysis (4 40 million reads each) original and derived data sets ~ 200 GB
Large Data Set Analysis. Hardware considerations: 2.) Processors and RAM. Comparison: tbalstn of 5 protein sequences against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours). RAM < data size will greatly slow down the process.
Large Data Set Analysis. Hardware considerations: 3.) Operating system determines the availability of tools. Linux is the default development system for most bioinformatics groups. It is also the OS of the UFHPC. Easy control and automation. Portable to Mac OSX, but often requires recompiling the source code.
Observe: demanding computation for large data set analysis.
Practice: log into UFHPC. First step