3.31.2005BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 1 Stanislav Luban 1,2 Daisuke Kihara 2,1 1. Department.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Gene Prediction Preliminary Results Computational Genomics February 20, 2012.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Introduction to Bioinformatics
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Predicting RNA Structure and Function
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Protein Modules An Introduction to Bioinformatics.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Predicting RNA Structure and Function. Following the human genome sequencing there is a high interest in RNA “Just when scientists thought they had deciphered.
Predicting RNA Structure and Function
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
Computational Identification of Drosophila microRNA Genes Journal Club 09/05/03 Jared Bischof.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Chapter 21 Eukaryotic Genome Sequences
Comp. Genomics Recitation 3 The statistics of database searching.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
From Genomes to Genes Rui Alves.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Step 3: Tools Database Searching
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Chapter 13 Regulatory RNA Introduction  RNA functions as a regulator by forming a region of secondary structure (either inter- or intramolecular)
bacteria and eukaryotes
Genome Annotation (protein coding genes)
The Transcriptional Landscape of the Mammalian Genome
Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.
Predicting RNA Structure and Function
Genome Center of Wisconsin, UW-Madison
Recitation 7 2/4/09 PSSMs+Gene finding
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Protein Occupancy Landscape of a Bacterial Genome
Volume 108, Issue 5, Pages (March 2002)
Basic Local Alignment Search Tool
Computational identification of noncoding RNAs in E
Presentation transcript:

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 1 Stanislav Luban 1,2 Daisuke Kihara 2,1 1. Department of Computer Sciences 2. Department of Biological Sciences Purdue University, West Lafayette, IN Comparative Study of Small RNA and Small Peptides in Complete Genome Sequences

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 2 Introduction: Structural Small RNA (sRNA) Genes which produce non-coding transcripts that function directly as structural, regulatory, or catalytic RNAs Include rRNAs, tRNAs, small nucleolar RNAs, spliceosomal RNAs, viral associated RNAs, microRNAs, ctRNAs, and others In Rfam (RNA families) database, sRNA entries distributed among 352 known families are stored In E. coli, about 50 sRNAs are known (figure from Rfam database:

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 3 Methods: QRNA Model distinctive pattern of mutation: Conserved Structural RNA Pattern of compensatory mutations consistent with base-paired secondary structure Pair Stochastic Context-Free Grammar Model Conserved Coding Region Pattern of synonymous codon substitutions Pair Hidden Markov Model Other Types of Conserved Regions Approximated by “null hypothesis” that mutations occur position independently, without pattern Pair Hidden Markov Model Scores are log likelihoods used to calculate final log odds score for RNA model compared to other two models (Figure: Rivas et al, Current Biol. 2001)

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 4 Procedure for Extracting sRNAs Extract Intergenic Regions From 30 Sequenced Genomes Perform All Vs. All Nucleotide- Nucleotide BLAST Run QRNA, Extract Alignments Scoring as sRNAs vs. Coding and Null Hypothesis Regions Select Significant Alignments, Concatenate and Format into QRNA Program Input Verify Results Computationally And Experimentally (Yet To Be Done) Eliminate Alignment Regions Which Overlap >50% with E. coli Regulatory Regions Extend Regions Within 25 nt Of Other Regions Causing Them To Include Each Other Merge sRNA Regions Which Align or Exactly Overlap Into Families Eliminate Family Regions Not Found Using Both Query And Database Organism As Source

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 5 Genome Data Set 30 Microbial Genomes Used as Queries and Databases: Gammaproteobacteria Acinetobacter calcoaceticus Blochmannia floridanus Buchnera aphidicola Coxiella burnetii Erwinia carotovora Escherichia coli Haemophilus ducreyi Haemophilus influenzae Pasteurella multocida Photorhabdus luminescens Pseudomonas aeruginosa Pseudomonas putida Pseudomonas syringae Salmonella enterica Salmonella typhimurium Shewanella oneidensis Shigella flexneri Vibrio cholerae Vibrio parahaemolyticus Vibrio vulnificus Wigglesworthia brevipalpis Xanthomonas campestris Xanthomonas citri Xylella fastidiosa Yersinia pestis Alphaproteobacteria Agrobacterium tumefaciens Brucella melitensis Caulobacter crescentus Mesorhizobium loti Deinococci Deinococcus radiodurans

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 6 Result Statistics Total number of intergenic regions: Average number of intergenic regions per organism: Total combined length of intergenic regions: nt Average length of intergenic region: nt

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 7 sRNA Length vs. Score Plot Total: sRNAs

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 8 Total: sRNAs Number of sRNA Entries by Organism 1 - Pseudomonas putida 2 - Shigella flexneri 3 - Xanthomonas citri 4 - Shewanella oneidensis 5 - Wigglesworthia brevipalpis 6 - Haemophilus ducreyi 7 - Pseudomonas syringae 8 - Erwinia carotovora 9 - Escherichia coli 10 - Vibrio parahaemolyticus 11 - Mesorhizobium loti 12 - Buchnera aphidicola 13 - Brucella melitensis 14 - Yersinia pestis 15 - Xylella fastidiosa 16 - Pseudomonas aeruginosa 17 - Salmonella enterica 18 - Caulobacter crescentus 19 - Agrobacterium tumefaciens 20 - Blochmannia floridanus 21 - Pasteurella multocida 22 - Deinococcus radiodurans 23 - Vibrio cholerae 24 - Photorhabdus luminescens 25 - Coxiella burnetii 26 - Vibrio vulnificus 27 - Salmonella typhimurium 28 - Acinetobacter calcoaceticus 29 - Xanthomonas campestris 30 - Haemophilus influenzae

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 9 Conservation of sRNAs Total: 3768 families

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 10 Conservation of sRNAs Total: 3768 families E. Coli Total: 554 families Along with statistics for all entries, statistics for entries containing at least one entry from E. coli were added for comparison

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 11 Common Organism Combinations in Families Top 5 most frequent combinations of 4 and 7 organisms: Combination: Occurances: Ecoli, Senterica, Sflexneri, Styphimurium117 Ecarotovora, Ecoli, Senterica, Styphimurium26 Ecoli, Senterica, Styphimurium, Ypestis20 Ecarotovora, Ecoli, Sflexneri, Styphimurium18 Ecoli, Sflexneri, Styphimurium, Ypestis17 Ecarotovora, Ecoli, Pluminescens, Senterica, Sflexneri, Styphimurium, Ypestis 4 Acalcoaceticus, Ccrescentus, Mloti, Paeruginosa, Pputida, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Pputida, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Paeruginosa, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Paeruginosa, Pputida, Xcampestris 2

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 12 Result Verification 71 total sRNAs related to E. coli already found to be annotated in Rfam database were used as benchmark Of those: 15 – found by computational method that were also listed in Rfam and not tRNAs 6 – not found due to shortcomings of method 29 – tRNAs already annotated as gene loci in E. coli genome sequence used 10 – E. coli plasmid loci not found in full E. coli genome sequence used 2 – 4.5S RNAs already annotated as gene loci in E. coli genome sequence used 2 – E. coli reverse transcriptase loci not found in full E. coli genome sequence used 1 – E. coli insertion sequence not found in full E. coli genome sequence used 1 – E. coli small RNA annotated separately, not found in full E. coli genome sequence used 1 – Antisense RNA already annotated as gene locus in E. coli genome sequence used 1 – Cloning vector with E. coli promoter not found in full E. coli genome sequence used 1 – E. coli transposable element not found in full E. coli genome sequence used 1 – Reporter vector not found in full E. coli genome sequence used 1 – E. coli retron not found in full E. coli genome sequence used

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 13 Candidates for Experimental Verification of Findings For the following 2 slides: Family designation expressed as [Organism name] [locus absolute start location] [locus absolute end location] and is synonymous with the first (header) entry of that family Entries refer to number of different organism (2 chromosomes counted separately) sRNA entries in the family Length (nt) and score only refer to the header entry of the family Scores calculated by QRNA program with log odds post for RNA likelihood as opposed to null hypothesis

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 14 Candidates for Experimental Verification of Findings Top 10 highest statistically scoring E. coli sRNA loci found by computational method: Family designation: Ecoli Length: 133Score: Family designation: Ecoli Length: 100Score: Family designation: Ecoli Length: 193Score: Family designation: Ecoli Length: 152Score: Family designation: Ecoli Length: 200 Score: Family designation: Ecoli Length: 63Score: Family designation: Ecoli Length: 63Score: Family designation: Ecoli Length: 28Score: Family designation: Ecoli Length: 69Score: Family designation: Ecoli Length: 26Score:

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 15 Candidates for Experimental Verification of Findings Top 10 largest sRNA families found by computational method: Family designation: Styphimurium Entries: 18 Length: 38Score: Family designation: Ecarotovora Entries: 15 Length: 37Score: Family designation: Ecarotovora Entries: 12 Length: 20Score: Family designation: Styphimurium Entries: 12 Length: 95Score: Family designation: Ecarotovora Entries: 10 Length: 59Score: Family designation: Paeruginosa Entries: 9 Length: 18Score: Family designation: Styphimurium Entries: 8 Length: 28Score: Family designation: Styphimuriu Entries: 8 Length: 17Score: Family designation: Ecarotovora Entries: 8 Length: 31Score: Family designation: Ecarotovora *Entries: 7 Length: 146 Score: *This last entry was used a sample for detailed study and is discussed subsequently.

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 16 Detailed Study of Located Sample sRNA OrganismLocation (in genome)Length(nt)ScoreNeighboring Genes Ecarotovora rpsM - rpmJ Pluminescens rpsM - secY Ypestis rpmJ - rpsM Styphimurium rpsM - rpmJ Senterica rpmJ - rpsM Ecoli rpsM - rpmJ Sflexneri rpsM - rpmJ Hit to Alpha_RBS RNA (Rfam: RF00140) (115 nt) Rfam Sequence: GUCCUUGAUAUUCUGUUUGAGUAUCCUGAAAACGGGCUUUUCAAGAUCAGAAUAUCAAAUUA AUUAAAAUAUAGGAGUGCAUAGUGGCCCGUAUUGCAGGCAUUAACAUUCCUGAU

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 17 Most Likely (Lowest Free Energy) Predicted Fold of 80 nt Segment of Sequence Mfold by Zuker et al, 2004 Used Detailed Study of Located Sample sRNA

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 18 Another Approach to Finding sRNAs in E. Coli: Paper Summary

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 19 Method Used in Paper to Find Putative sRNAs A database of all E. coli intergenic DNA sequences was created based on gene annotations in early release of the EcoGene database, and used as input to profile search program (pftools2.2, Swiss Bioinformatics Institute) set to find sigma-70 promoter Terminator motif was searched for in database using following search criteria: (1) An 11-nt A-rich region; (2) variable-length hairpin; (3) variable-length spacer; (4) 5-nt T-rich region nearest the hairpin; and (5) 7-nt distal extra T-rich region Predicted promoter and terminator pairs were combined to generate putative sRNAs if (1) pair was on same strand; and (2) pair was greater than 45 but less than 350 nt apart To verify, open reading frames and possible ribosome binding sites were searched for downstream of each promoter

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 20 Synopsis of Method Used in Paper Using the E. Coli MG1655 genome, DNA regions that contained a sigma-70 promoter within a short distance of a rho-independent terminator were searched for 227 putative sRNAs between 80 and 400 nt in length were predicted in E. coli by paper, 32 of which were already known to be sRNAs Transcripts of some of the candidate loci were verified using Northern hybridization Approach may possibly be used in annotating sRNA loci in other bacterial genomes

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 21 Verification of Paper Results with Results Using Our Method Along with other results, the paper gives a detailed listing of the 277 sRNAs predicted, including the designation, strand orientation (forward or reverse), left and right boundaries (nt from genome start position), and length (nt) of each sRNA Left and right boundary positions in genome given by paper were compared with left and right boundary positions of putative sRNAs found by our method If an sRNA candidate from the paper was within 100 nt of any sRNA predicted by our method, that sRNA was scored as ‘found’

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 22 Results of Verification 227 candidate sRNAs were predicted in E. coli by the paper Among them, 150 (66.1 %) were localized by our method, according to previously utilized criteria The test was re-run with a 50 nt threshold, yielding 140 hits (61.7 %), a 10 nt threshold, yielding 128 hits (56.4 %), and a 1000 nt threshold, yielding 187 hits (82.4 %)

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 23 Preliminary Procedure for Extracting Small Peptides Extract Intergenic Regions From 30 Sequenced Genomes Perform All Vs. All Nucleotide- Nucleotide BLAST Run QRNA, Extract Alignments Scoring as Coding vs. sRNA and Null Hypothesis Regions Select Significant Alignments, Concatenate and Format into QRNA Program Input Observe Results and Refine Extraction Method Extend Regions Within 25 nt Of Other Reions Causing Them To Include Each Other Merge sRNA Regions Which Align or Exactly Overlap Into Families Blast Resulting Family Entries Against SwissProt Database Score Regions Based on Quality of Fit Inside a Nearby Open Reading Frame

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 24 Preliminary Results of Small Peptide Search OrganismLocation (in genome)Length(nt)E-Value Erwinia Carotovora Aligns To gb|AAF | flagelliform silk protein [Nephila madagascariensis] Sequence aattccgtcgcatgttctctggtgagtacgacagcgcggattgctatctggatattcaggcgggatctggcggtacgg aagcgcaggactgggccagcatgctggtacgtatgtacctgcgttgggcggaagc Query: 133 LPPNAGTYVPACWPSPALPYRQIPPEYPDSNP 38 Subject: 1373 LPPLXTSXXPPPPPPPSXPLXSLPPSXPPSLP 1278 Tblastx Alignment Query Sequence Information

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 25 Preliminary Results of Small Peptide Search OrganismLocation (in genome) Length(nt)E-Value Pseudomonas syringae Aligns To emb|CAD | C. elegans GRL-25 protein (corresponding sequence ZK 643.8) Sequence tgagttccggcagctcgtcatccagcttctgacgcaaccgcccggtcagaaacgcaaagccctcgagcaaccgct ccacatccggatcccgtccggcctgccccagaaacggcgccaacgccggactacgctcggcgaagcgacgac caagctggcgcagtgcagtgagttcgctctggtagtaatggttaaaggacacgggttacctgc Query: 62 PRATAPHPDPVRPAPETAPTP 124 Subject: 90 PPAPAPRPPPVAPAPRPLPPP 28 Tblastx Alignment Query Sequence Information

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 26 Conclusions Possible sRNAs are found from 20~39% of the intergenic regions in each organism Among them, ~31% of the sRNAs satisfy the log-odds score threshold of 5.0 or higher 137 “families” are conserved in equal to or more than 5 organisms Being well conserved, sRNAs may be responsible for fundamental functions of living organisms

BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 27 Future Direction Search for sRNAs will be expanded to a larger quantity of more diverse genomes Secondary structure prediction will be later employed in greater detail to verify well conserved sRNA regions among multiple evolutionarily distant organisms Experimental verification of the findings of this particular study under way (particularly for Shewanella oneidensis) Comparative genomics will be used to discover the function associated with each sRNA and possibly lead to learning its part in pathway