Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre
Software Tools Data stored in retrievable forms in database systems Data generated by machines, DNA / Protein sequencers, automated systems Biological Data Automated Machines Research Labs Databases Analytical Tools New Knowledge
Common Computational Analyses Sequence Assembly Simple sequence analysis –Translation and reverse Complement, ORF –Composition statistics (protein & DNA) –Molecular mass –Total charge and pI; local hydropathy –Simple determination of secondary structures –Restriction site analysis –Internal repeat analysis Detection of active sites, functional residues, characteristic structures, substrates, and processing signals
Common Computational Analyses Database sequence search Multiple alignment 2 and 3 Structure prediction; transmembrane helix detection Structure modeling Docking prediction and design Hidden Markov model searches
Sequence Assembly Fragmented data from DNA sequencers Detection of Overlap Merging of Contigs Assembly into continuous sequence 5' 3'
Sequence Format Interconversion DNA/Protein and other sequence data come in different formats. Annotations Different programs use different formats Interconversion utility tools eg. READSEQ, TOGCG, TOSTADEN, etc
Simple Sequence Analysis 1. Linear Sequence eg. DNA/ Protein 2. Open a Window - n = 1 n = variable n = sliding 3. Calculate based on list of criteria ………….… …………….. ……………...
Some Simple Sequence Analysis Applications DNA complementary strand eg. COMPLEMENT & REVERSE –Open window size 1 –A--->T –C --->G –T ---> A –G ---> C –Slide to next Window of 1 –Proceed to end of sequence –Reverse order of complement –5'...ATCTCGATACTACTACG...3' – ||||||||||||||||| –3'...TAGAGCTATGATGATGC...5'
DNA to Protein sequence translation, e.g. TRANSLATE –Open window of 3 bases –Look up Codon Usage table –Assign Amino acid residue –Slide window to next 3 bases –Proceed till stop codon detected. –Repeat whole procedure for six frames ATACTACTGAGATCTAGGCTAGTACTGCGTGCG Frame 1 Frame 2 Frame 3 Complement - Frames 4-6 Some Simple Sequence Analysis Applications
Detect Open Reading Frame e.g. ORF –Translate sequence, report long stretches of start and stop codons Compositional analysis –eg. Calculate total A, T, G, C –eg. Calculate total molecular mass of protein, analysis percentages of amino acids –eg. Total Charge composition, pI Some Simple Sequence Analysis Applications
Simple prediction of secondary structure of Protein sequence –decide a window size –compute for each window of amino acids statistical potential to form helix, beta sheet, turn, etc. Chou-Fasman, GOR etc algorithms –use a statistical potential chart –plot potentials in graphical or pictorial format Some Simple Sequence Analysis Applications
Restriction Mapping eg. MAP, MAPPLOT,MAPSORT, PLASMIDMAP etc –Table of Restriction Enzymes and cut sites eg. EcoRI, BamHI AluI and their cut sites eg. GAATTC, AATT –Take a DNA sequence –Pattern match against the list of cut sites –For each match, assign Restriction enzyme –Calculate distance between cut sites –Display in table, graphical, or restriction map, etc Some Simple Sequence Analysis Applications Plasmid map gel
Protein sequence Motifs pattern matching eg. PROSITEMAP, MOTIFS, BLOCKS etc –Table/Database of Sequence Patterns/Motifs and their signature sequence eg. Arg-Gly-Asp (RGD) or consensus sequence (eg. PROSITE, BLOCKS db) –Take Protein sequence –Pattern match against the list of signature sites –For each match, assign potential function according to database –Display in table or graphically, or hyperlinked Some Simple Sequence Analysis Applications
Peptide Cleavage Maps eg. PEPTIDESORT, PEPTIDE MAP –Table of Protease vs Cleavage sites eg. Trypsin, chymotrypsin, and Chemical cleavage sites cyanogen bromide –Pattern match with entire protein sequence –Calculate size of peptide fragments –Sort and Map, Plot as electrophoretic patterns on a log-linear simulated digest. –Compute Partial Digest patterns Some Simple Sequence Analysis Applications
DOTPLOT- selfcomparison –Take a Window size –Compare against entire length of own sequence –Report matches above a threshold –Plot on Graph –Slide window, repeat till end of sequence –Detection of Internal repeats Pairwise comparison - detection of homology Some Simple Sequence Analysis Applications Sequence A
RNA secondary structure analysis Mfold, PlotFold, FoldRNA, Squiggles, Circles, Domes, Mountains, StemLoop Folding of RNA into stems, loops Calculation of energy - prediction of stability of structure Display of structure and alternatives Some Simple Sequence Analysis Applications...AUCGAAUCUC... AUGCAUGC UACGUACG-- -- AUCG U G G A
Database Searching Text-based Database Searching - using a text string to match an annotation in a sequence database record, ie. Keyword search Sequence-based Database Searching - using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records
Text-Based Database Searching Examples: Entrez, SRS, DBGET, AceDB - common integrated database systems Search Concepts –Boolean Search - AND, OR, NOT –Broadening Search –Narrowing the Search –Proximity searching, soundex –Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic Use standard string search algorithms and boolean operations, vocabulary matches
Text-based Database Searching Example: To find the human homolog of the Drosophila per gene Procedure –Web to Entrez –All Fields : enter "human" "per" –Hits returned, irrelevant - broaden search –"human" "period" - more hits –check every one, find the human RIGUI gene Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)? Use Boolean searches?
Sequence-based Database Searching Homology Search Global or Local Sequence Alignment Needleman-Wunch Algorithm Smith-Waterman Algorithm Lipman - Pearson FASTA Altschul's BLAST Take a sequence, pairwise comparison with each sequence in the database
Sequence-based Database Searching Basic Assumptions: Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.
Sequence-based Database Searching Global Alignment forces complete alignment of the pairwise comparison of the two input sequences Local Alignment looks for local stretches of similarity and tries to align the most similar segments Algorithms used may be similar, but output different, statistics needed to assess results
Sequence-based Database Searching Alignment Scoring Substitution score and substitution matrix PAM, BLOSUM affine gap costs/gap penalty and gap scores Optimal alignments, dynamic programming Needleman-Wunsch algorithm, Smith-Waterman algorithm (SSEARCH) Additional heuristics - FASTA, BLAST