Sequence Similarity The bioinformatics for molecular biologists lecture series
Sequence similarity One of the two major database searching strategies Types of sequence similarity comparison Accurate alignment of two sequences Heuristic comparison between one query sequence and a database of target sequences Alignment of multiple sequences of the same biological function / biological origin
Why similarity Similarity and Homology Similarity refers to the likeness or % identity between two sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology Homology refers to shared ancestry Two sequences are homologous is they are derived from a common ancestral sequence Homology implies similarity
Similarity v.s. Homology Similarity can be quantified It is correct to say that two sequences are 30% identical It is generally incorrect to say that two sequences are 30% similar It is correct to say that two sequences have a similarity score of 200 The definition of similarity score determines the “best” or the “correct” alignment
Similarity v.s. Homology Homology cannot be quantified If two sequences have a high percentage identity, it is OK to say they are homologous It is incorrect to say two sequences are 40% homologous It is incorrect to say two sequences have a homology score of 150 Two types of homology Orthologous Paralogous Homology is usually inferred rather than observed
Orthologs Ortho = exact Orthologs is the result of speciation. For example, “Hemaglobin A” in human and in mouse. When the genes are orthologous, the history of the gene reflects the history of the species. Orthologs implies conserved function
Paralogs Para = in parallel Paralogs are homologous sequences that arose by a mechanism such as gene duplication. For example, when both copies have descended side by side during the history of an organism e.g. Hemoglobin A and B, the genes are paralogous. They have distinct but related functions
Analogous
Homology v.s. Analogy Homology: Similarity in characteristics resulting from shared ancestry Analogy: Similarity of structure between two species that are not closely related; attributable to convergent evolution
Summary of concepts Similarity – an observation Homology – a biological relationship Similarity is a result of homology or analogy Molecular similarity is likely a result of homology Molecular similarity is frequently used to infer biological relationship – homology.
How to measure similarity Score an alignment Scoring matching position Scoring gaps Total score = score of matching positions + score of gaps
Scoring matrix For DNA
Scoring protein alignment E: Glutamic Acid Q: Glutamine T: Threonine
14 Point Accepted Mutation Margaret Dayhoff et al (1970s) First to assemble sequences into protein seq atlas – families and superfamilies based on seq similarity Tables of frequency of changes/mutations observed in the sequences of a family derived. Percent amino acid mutations accepted by evolutionary selection or PAM Tables derived. Shows probability that one amino acid change into any other in these families A score above zero assigned to two amino acids indicates that these two replace each other more often than expected by chance alone. ie., they are functionally exchangeable. A negative score (below zero) indicates that the two amino acids are rarely interchangeable. eg., a basic amino acid for an acidic one or one with an aromatic side chain for one with aliphatic side chain. 1 PAM – average change in 1% of all amino acid possibilities 100 PAMs (1 PAM to the power of 100) does not mean every residue is changed
PAM250
BLOSUM BLOSUM (Blocks Substitution Matrix) matrix These are substitution matrices derived from the observed frequencies of amino acid replacements in highly conserved regions of ungapped local alignments. Henikoff and Henikoff PNAS 1992 Number indicate percent identity within set eg. BLOSUM62 means 62% identity The data for the substitution scores in these matrices come from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins [ Ref: Henikoff, S., and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89: ] The BLAST server from NCBI and the search servers from EBI use different versions of the BLOSUM matrix for protein similarity searches and alignments.
Choice of scoring matrices For DNA Identity matrix For protein
Scoring gaps Introducing gaps may improve the alignment APPLESANDORANGES APPLESANDORANGES |||||| |||||| ||||||| APPLESORANGES APPLES---ORANGES Introducing too many gaps is not meaningful ATCCTACTCATCAT ATCCTACTCA-T-CAT- ||| | | ||| |||| | | | | ATCTACTACTACTG ATC-TACT-ACTAC-TG Affine gap penalty: Penalty =a + bx a, b are constants; x is gap length, a is usually big Typical a and b: Protein (11,1) DNA(5,2)
Sequence alignment Find the best alignment in terms of score Types of sequence similarity comparison Accurate alignment of two sequences Heuristic comparison between one query sequence and a database of target sequences Alignment of multiple sequences of the same biological function / biological origin
Accurate alignment Global alignment Needleman-Wunsch Alignment of two complete sequences Local alignment Smith-waterman Alignment of the most similar fragments in two sequences All based on the dynamic programming algorithm with O(MN) complexity and gives the optimal solution
Global Alignment Earliest pairwise alignment method Easily detectable similarity along entire length of sequence E.g. trypsin and quinone oxidoreductase/ zeta crystallin alignments. Needleman Wunsch Algorithm 1970 Optimised over entire length of query sequence |||||||||||||||||||||||||| |||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||
Local Alignment Many proteins composed of mosaics of domains The gene sequences match at the domain level rather than the entire length of the protein. They share sequence similarity at localised regions in the gene, primarily at the domains. Local alignment algorithm used: Smith-Waterman 1981 Optimised for local optimal alignments Most useful for database searching ||||||||||||||||||||||||||
Heuristic algorithm BLAST – Basic local alignment search tool
Types of blast blastn: Search a nucleotide database using a nucleotide query blastp: Search protein database using a protein query blastx: Search protein database using a translated nucleotide query tblastn: Search translated nucleotide database using a protein query tblastx: Search translated nucleotide database using a translated nucleotide query
Blastn Blastn: The general algorithm Megablast: comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast. Discontiguous megablast: intended for cross-species comparisons of the same gene.
Blastp Blastp: the general algorithm PSI-BLAST: allows the user to build a PSSM (position- specific scoring matrix) using the results of the first BlastP run. PHI-BLAST: blast + motif scan DELTA-BLAST: search first against CCD to get a PSSM, then use it to search the entire database.
Typical blast result Domain hit
Typical blast result Sequence hit
Typical blast result Sequence hit
Blast Search Parameters A BLAST search can be limited to the result of an Entrez query against the database chosen.
Search scope limit by query protease NOT hiv1[organism] 1000:2000[slen] – sequence length Mus musculus[organism] AND biomol_mrna[properties] 10000:100000[mlwt] – molecular weight all[filter] NOT enviromnental sample[filter] NOT metagenomes[orgn]
BLAST heads up For short amino acid sequences with size 20-40, 50% identity happens by chance Similarity can be present even if there is absence of homology low complexity transmembrane and coiled coil regions
More details Choice of programs The blast document The statistics behind blast scores
Next generation sequencing Millions of short reads The entire human genome for $5000 Wide applications 454 – longer reads Illumina – shorter but higher throughput Single end/pair end
Genome re-sequensing Genetic variation detection Single nucleotide polymorphism Copy number variation 1000 human genome and 1001 Arabidopsis genome Focused re-sequencing Exon capture chips Chip-seq Methylomes
Transcriptome sequencing Expression level ncRNA transcripts Novel transcripts Alternative splicing ENCODE project - The Encyclopedia of DNA Elements
Next generation sequencing analysis Short but similar sequence reads with a reference genome BWA and BOWTIE Fastaq (fq) and SAM tools Tablet
Fastq First line – seqID Second line – sequence Third line – “+” anything Fourth line – quality For illumina – Q30:99.9%; Q20:99%; Q10:90%
SAM tools SAM: sequence alignment / map BAM: compressed SAM file Read name; Flag for matching status; Target name; position; mapping confidence; 49bp matched; RnExt; Paired matched; matched length; Seq; Quality; Flags
Tablet
Summary Similarity search