Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark.

Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark this page Click on the link to the file: –useful-websites.html –Bookmark this page too –It also contains links to the example sequence files used in the workshop, and the presentations themselves

The Basic Questions Where, and how, do I find something? How do I know it’s real? Exercise 0: Write a concise definition of what a gene is.

Part 1: Structural Genomics DNA arranged in chromosomes Vertebrate ~ 10 9 base pairs

Chromosomes and Genes Total of ~30,000 genes on ~20 chromosomes 1000 – 2000 genes per chromosome

locus Gene to Protein ~ gene mRNA protein genome primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA Sequence Signals CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA mRNA MLTILAL

Genomic Signals transcription start site ===CGCTATAAGCG=== ================= ===CGCAATAAAGCG== ================= polyadenylation signal ===CACGATCGAGTC== ================= promoters enhancers ==ACGTA…………CAGTA== ================== splice sites

Derivative Sequences mRNA capture by cloning into cDNA library 3’ EST 5’ EST cDNA sequence EST: single pass sequence from each end of the clone cDNA: multiple pass sequencing over whole length of the clone 5’3’

Gene Models gene model exons

Sequences and Genes (Accession Numbers and Names) AAB22970.1 AAP21245.1 CAA41545.1 NP_187759.2 proteins S43105.1 mRNAs/cDNAs ‘similar to Cyclin B1 [mus musculus]’ gene BT006437.1 ‘Cyclin B1, isoform 1 [mus musculus]’ X58708.1 NM_111985.3 ‘CCNB1, Cyclin B1 [mus musculus]’ ‘Cyclin B1, isoform 2 [mus musculus]’

Gene Symbols, Names, Etc. Gene Symbol: CCNB1 Gene Name: cyclin B1 [Homo sapiens] Description: G2/mitotic-specific cyclin B1 Aliases:CCNB, CYCB1

A Gene-Centric View Entrez Gene http://www.ncbi.nlm.nih.gov/ Cyclin B1 S43105.1 BT006437.1 X58708.1 NM_111985.3 AAB22970.1 AAP21245.1 CAA41545.1 NP_187759.2 Exercise 1: Go to Entrez Gene and look for your favourite gene or genes. genomic location expression data

Sequences and Accession Numbers NM_001015922.1 gi=62860271 GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA BC009638.1 gi=16307106 GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA NM_001015922.2 gi=62860589 GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA NP_001015922.1 protein translated from mRNA XM_001102567.1 predicted mRNA XP_001089765.1 predicted protein translated from predicted mRNA

mRNA Splicing Signals gene model genome CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA GTAAG. donor.TTTCAG acceptor mRNA exon intron exon intron exon splice sites

Gene Predictions Given: - coding sequence must run from ATG – STOP codon in-frame - introns GT...... AG can be spliced out Also take a statistical approach: - coding and non-coding sequence are slightly different in composition - some ‘possible’ splice sites are more likely than others...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA... scan genomic sequence …...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.. most likely gene model

Supporting Evidence! EST evidence genome gene model We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even then…). So predicted genes based on computational gene models alone will usually lack UTR regions, which has some important consequences. exons: 1 2 3 4

Theoretical/Predicted Sequences genome predicted gene model exons: 1 2 3 4 We’ve now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence, but we shouldn’t lose sight of the fact that we don’t really know if these predicted proteins exists – especially where supporting EST evidence is weak, or non-existent. predicted transcript predicted protein

Sequences for a model organism ESTs – millions @ £10 each Cheap to sequence – so we get millions per organism But lots of errors And incomplete gene sequences Can give us relative expression levels cDNAs – tens of thousands @ £1000 each Expensive – but only need to do one (or a small number) per gene Few errors with multipass sequencing Gives us protein sequences Genomes – one ! @ £30,000,000 Extremely expensive But the only way to get the whole picture Gives us gene regulation

So What’s in the Databases Now? 15,000,000 ESTs 3,300,000 cDNAs NCBI July 2005 2,700,000 proteins 950,000 proteins nr RefSeq DNA Proteins

Part 2: Comparative Genomics ATGAAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCAACGACTGCCGTG ATGCATGCTGCCAACGACTGCCGTG ATGCATGCTGCCAACGACTGCCCTG ATGCATGCTGCCAACGGCTGCCCTG ATGCATGCTGCCAACGGATGCCCTG ATGCATGCCGCCAACGGATGCCCTG ATGCATGCCGCCAACGGATGTCCTG Imagine one mutation gets fixed every 100,000 years in this gene sequence… Gene sequence Evolution by sequence mutation

Speciation ATGAAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCAACGACTGCCGTG ATGCATGCTGCCAACGACTGCCGTG ATGCATGCTGCCAACGACTGCCCTG ATGCATGCTGCCAACGGCTGCCCTG ATGCATGCTGCCAACGGATGCCCTG Gene A ATGAAGGCTGCCTACGACTGCCGTG ATGAAGGCCGCCTACGACTGCCGTG ATGAAGGCCGCCAACGACTGTCGTG ATGAAAGCCGCCAACGACTGTCGTG ATGAAAGCCGCCAACGACAGTCGTG ATGAAAGCCGCCTACGACAGTCGTG ATGAAAGCCGCCTACGACAGTCCTG ATGCATGCTGCCAACGGATGCCCTG ATGAAAGCCGCCTACGACAGTCCTG ||| | || ||| ||| | |||| If the genetic difference means they can no longer interbreed, with fertile offspring – then we have a new species…

Residual Similarity ATGCATGCTGCCAACGGATGCCCTG ATGAAAGCCGCCTACGACAGTCCTG ||| | || ||| ||| | |||| ATGCATGCTGCCAACGGATGCCCTG ATGGAAGGCGCTTAGGATAGTCCAG ||| | | || | | | || | After longer periods of evolution, homology may no longer be detectable in the DNA sequence… We can still easily detect residual similarity between these sequences, this is what we call homology – detectable similarity because of common evolutionary origin.

Computers Can Detect Homology In fact computers are very good at this task – the two primary challenges are: (a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientist’s attention span (b) at low levels of similarity, being able to distinguish between biologically related sequences and chance matches… ATGCATGCTGCCAACGGATGCCCTG ATGAAAGCCGCCTACGACAGTCCTG ||| | || ||| ||| | |||| GCTGACTCGTAGCGCTTAGCTAGCT CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs A A A Gene duplication though speciation The two copies of Gene A will now evolve independently, but will continue to have the ~same function They are ORTHOLOGS

Paralogs A Gene duplication though internal genome duplication The two copies of Gene A will now evolve independently, but will probably not continue to have exactly the same function They are PARALOGS A A A’ A

‘Other’-logs What about gene duplication after speciation ? How can we describe the relationship(s) between the various copies of gene A in the two frogs? Bear in mind that understanding gene function is more important than semantics… The two copies of A in the orange frog are sometimes called IN-PARALOGS. If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS. A A A A’ A

The Essential Paradigm 1. any group of modern species can be traced back to some extinct common ancestor A A 2. in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor 3. If we can experimentally determine the function of a gene in one of these organisms, then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function A A cyclin b1

Function Conserved Longer than Detectable Similarity start from first self-replicating sequence same function detectable similarity living organisms whole genome duplication local duplication

Redundancy in the Genetic Code GCA Aalanine GCC A GCG A GCT A TGC Ccystine TGT C GAC Daspartate GAT D GGA Gglycine GGC G GGG G GGT G ‘Synonymous’ or ‘silent’ mutations in the third position of the codon triplets have no effect on the amino acid coded for – so there is no evolutionary pressure against this…

Protein Similarity Persists Longer CTATCACGAGAACCTGTG CTATCCCGAGAACCTGTG CTATCCCGAGAACCAGTG CTATCCCGTGAACCAGTG CTATCCCGTGAGCCAGTG CTATCCCGTGAGCCAGTT CTGTCCCGTGAGCCAGTT CTATCACGAGAACCTGTG CTGTCCCGTGAGCCAGTT || || || LSREPV |||||| CTATCACGAGAACCTGTG TTGTCCCGGTCGCCAGTT | || | || || LSREPV LSRFPV ||| || 67% 100% 44% 80%

Always Compare Protein Sequences ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+|| ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR DNA comparison amino acid comparison The DNA sequence can change while the amino acid sequence stays the same, so always look for similarities by comparing amino acid sequences.

Exercise 1 nucleotide vs amino acid search Go to the file example-sequences.html, and locate the section for this exercise. There should be two sequences: ‘surfeit1’ for frog and fly. Go to NCBI Blast home page, then ‘Align two sequences’ (bottom left ‘special’ panel), paste one sequence into each window and hit ‘Align’ – this will do a direct DNA/DNA comparison. Now find the open reading frames of the two genes, and translate them into amino acid protein sequences, then repeat the two sequences comparison. Go to NCBI ORF Finder – paste sequence – hit OrfFind – identify longest ORF – click on it – next screen, hit Accept – change View to Fasta protein – hit View – copy sequence to Blast2Seqs. Do the same with the other sequence. Before you hit ‘Align’ change the ‘Program’ (top left) to blastp…

Answers: Exercise 1

The Essential Task experimentdata mining gene sequence what is its function? database of proteins in other species Cyclin-A FoxA1 cdc25 alpha-tubulin Predicted protein Gravin-like Sprouty-2 calmodulin KIAA10786568 frizzled Wint8 Troponin T3 Gravin-like we can only do this because of implied function based on orthology

Functional Orthologs ? function known, annotation ‘Gravin’ available Human gene Xenopus gene function unknown sequence similarity orthologs same function ? But we know that function is largely determined by shape similar shape? Which in general we cannot determine – but it is probably SHAPE not SEQUENCE that is conserved! We make an assumption that the same gene function is likely to be present in the two organisms, and the ones that have this function are likely to be the most similar in sequence

Finding Orthologs So how do we find orthologs, and can we know when we have? The simplest is Reciprocal Best BLAST, but it implicitly relies on having all the protein sequences of you own organism, and the one you wish to find an ortholog in. frog protein database of human proteins best match human protein database of frog proteins x

Using Synteny is Better We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another. And we find the same genes (i.e. orthologs!) in more or less the same order in the syntenic sections. These of course represent chromosomal re-arrangements since these organisms diverged. Human chromosome 5 Mouse chromosome 10 Mouse chromosome 2

Metazome Fortunately someone has done all the hard work for us…. Dan Rokhsar http://www.metazome.net/

Metazome Exercise Go back to Entrez Gene and look for your favourite gene again. Pick probable ortholog vertebrate genes from common organisms (human, mouse, rat, chicken, frog, fish) and paste their protein sequences into a temporary space. Go to Metazome (http://www.metazome.net/), find the blast window, open two versions of it, and blast your sequences against the Tetrapod or Jawed vertebrate node. See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)…

Part 3: Finding Sequence Similarities We want computer programs which will compare sequences at all possible different alignments, looking for a degree of similarity greater than we would expect to find by chance. But first we have to consider the implication of gaps… Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments: ATGCATGCTGCCAACGGATGTCCTG ATGAAAGCCGCCTACGAAAGTCCTG ||| | || ||| ||| | |||||| ATGCATGCTGGCCAACGGATGTCCTG ATGAAAGCCGCCTACGAAAGTCCTG ||| | || ||| | | | |

Gaps in Alignments Consider these two obviously similar sequences: TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | | TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence: TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| |||| TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA So in general we allow ourselves to insert gaps, until we find the optimal alignment. But where should this process stop?

The Downside of Gaps Take two random sequences, with no ‘real’ similarity: GACACTAGGTCGATGCGTGGTGGCGAGA ACGCATCCGGATGTGCACCGTGGAACTG And allow ‘cost free’ gaps: GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG Clearly, although the alignment has no mismatches, it is obviously not biologically meaningful! To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches – and this is the essence of ‘finding gapped alignments’. We want to find the ‘alignment’ between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps …

BLAST >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG GTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCA CGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGC AACGAA There are many programs used to find similarities between sequences. They range from relatively slow programs which find the exact best matching alignment, through ones which take progressively inexact shortcuts to speed things up. Of this latter class, the best known, and easily most widely used is BLAST, developed by Stephen Altschul and others, and continuously refined over the last 10-15 years. The essential idea is to compare your query sequence against a collection or ‘database’ of target sequences, looking for the one(s) that match the query sequence the best. >target1 AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG >target2 CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG >target3 GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC >target4 CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG query database COMPARE LIST MATCHES

Flavours of BLAST ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT query sequenceother operation?database sequences ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT BLASTn BLASTp BLASTx tBLASTn tBLASTx ACGATAGATCCCATCCATAAAT MQWCGYRWTYQGYRW FAST SLOW SLOWER HORRIBLY SLOW! 6 frame translation

How does it work? The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is: CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | | CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC ||||||||||||||||||||||||| |||||||||||||||||||||||| query 1 st database sequence This would actually be a very slow search process if implemented like this… BLAST achieves its speed through two strategies: - it takes a WORD based approach - it pre-INDEXES database sequences

BLAST: WORDS and INDEXING 1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA 2 TAAGCAAATTTAATTTTGTTTACATTTTC 3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 : sequence position word 1 133658 1207967 1316210 : 315 33568 3 16 07967 : Database of sequences Numbered list of all possible ‘words’ Build a position index of all words in the database

Analyse the Query Sequence >query AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 : QUERY SEQUENCE Numbered list of all possible ‘words’ position word 1 14236 2 33658 3 07967 : Analyse QUERY SEQUENCE sequence position word 1 133658 1207967 1316210 : 315 33568 3 16 07967 : Index of database

Expand from Word Based Matches We ‘instantly’ know which sequences in the database have at least a word length match with our query sequence, and at what relative position. Next, the potential alignments are expanded, adding up a score for (total matches – mismatches – gap penalties), to make the best possible alignment. But this is usually for a tiny proportion of the sequences in the database – so overall it is much quicker. The highest scoring alignments are reported. But we can potentially miss alignments with no word-size bits in common, consider BLASTn with a default word-size of 11: TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC ||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC Care is sometimes needed…

BLAST –Typical Output INPUT: >partial cDNA sequence, Xenopus tropicalis CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCC CCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAA GAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA OUTPUT: Query= (311 letters) Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters >gi|41055060|ref|NP_957420.1| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]gi|41055060|ref|NP_957420.1| Length=691 Score = 133 bits (335) Expect = 6e-31 Identities = 76/98 (77%) Positives = 82/98 (83%) Gaps = 4/98 (4%) Frame = +2 Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59 Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant? RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS Here is a ‘typical’ weak alignment from BLASTp: In fact the sequences were randomly generated, so there is no biologically significant alignment…

E-values The number of matches like the discovered match that I would expect to find by chance. An E-value of 0.0 implies that I would expect no matches like this to arise by chance, therefore… An E-value of 1 implies I would expect 1 match like this to arise by chance, so if I have a match with such an E-value… Also “expect value“ or “expectation”

E-values From First Principles Some database statistics (23 rd July 2005): Database: NCBI RefSeq mRNA 272,619 sequences; 503,566,580 total letters (~5.0 x 10 8 ) Database: NCBI nr 3,329,110 sequences; 14,601,814,750 total letters (~1.4 x 10 10 ) Notation: 1.2e-35 = 1.2 x 10 -35 4.8 x 10 6 = 4,800,000 We will consider first searching a nucleotide sequence (‘ACGTAGACGT’) against a nucleotide database, e.g. the RefSeq mRNA above. Then we will consider the more complex case of amino acid sequence (protein) searches. Which is of course what we mostly do.

Calculating an E-value The RefSeq mRNA database has ~ 5.0 x 10 8 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance? CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG AAAAAAAAAAAAAA Query = ‘A’ CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘AC’ AC CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘ACG’ ACG Expected number of matches = (5.0 x 10 8 ) / 4 = ~1.2 x 10 8 Expected number of matches = (5.0 x 10 8 ) / (4x 4) = ~3.1 x 10 7 Expected number of matches = (5.0 x 10 8 ) / (4 x 4 x 4) = ~8.1 x 10 6 Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer Expected number of matches = (5.0 x 10 8 ) / (4 x 4 x 4 x 4 … 60 times ) = (5.0 x 10 8 ) / 10 36 = 5.0 x 10 -28 E-value = 5.0 x 10 -28

E-values In Practice So if I take a 60 nt sequence: >sequence ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database, I get: BLAST OUTPUT: >gi|27469838|gb|BC041710.1| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds Length=6060gi|27469838|gb|BC041710.1| Score = 119 bits (60), Expect = 2e-26 Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036 What do I get if I BLAST it against the larger nr database? BLAST OUTPUT: >gi|27469838|gb|BC041710.1| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds Length=6060gi|27469838|gb|BC041710.1| Score = 119 bits (60), Expect = 6e-25 Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036 theoretical value was 5.0e -28 - !?

E-value Exercise Given a transcription factor binding site: ACC[T/G]TA How many would you expect to find by chance in a 10k promoter sequence? How would this differ if there was an optional additional base between the 4 th and 5 th positions? I.e. ACC[T/G]TA OR ACC[T/G]?TA

E-value Exercise: Answer ACC[T/G]TA Expect ‘A’ every 4 nt Expect ‘ACC’ every 4x4x4 = 64 nt Expect ‘T or G’ every 2 nd nt Expect ‘ACC[T/G]’ every 64x2 nt = 128 nt Expect ‘TA’ every 4x4 = 16 nt Expect ‘ACC[T/G]TA’ every 128x16 nt = 2048 nt (4x4x4x2x4x4) We would expect ~5 of these promoter sites every 10k by chance If also ACC[T/G]?TAA allowed? The two motifs independently have the same E-value. To allow either means we expect twice as many.

E-values: Effect of Database Size The nr mRNA database has ~ 1.4 x 10 10 letters (was RefSeq and 5.0 x10 8 ) There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance? CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG AAAAAAAAAAAAAA Query = ‘A’ CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘AC’ AC CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG Query = ‘ACG’ ACG Expected number of matches = (1.4 x 10 10 ) / 4 = ~1.2 x 10 8 Expected number of matches = (1.4 x 10 10 ) / (4x 4) = ~3.1 x 10 7 Expected number of matches = (1.4 x 10 10 ) / (4 x 4 x 4) = ~8.1 x 10 6 Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer Expected number of matches = (1.4 x 10 10 ) / (4 x 4 x 4 x 4 … 60 times ) = (1.4 x 10 10 ) / 10 36 = 1.4 x 10 -26 E-value = 1.4 x 10 -26 (was E-value = 5.0 x 10-28)

E-values: Effect of Database Size The E-value is simply dependent on database size. RefSeq nr 1.4 x 10 10 letters 5.0 x10 8 letters 30 x bigger BLAST the same sequence against each E-value = 1.4e -26 E-value = 5.0e -28 The database was ~30 times bigger and so the E-value was ~30 times bigger.

Why were the values different? Our calculated E-value for searching against the RefSeq mRNA database was 5.0 x 10 -28. But our actual BLAST search at NCBI gave a value of 2.0 x 10 -26 - about 40x larger - why is this? Gapped alignments If we were expecting N matches for a query sequence ‘ACGTACGTACGT’, imagine what would happen to N if we allowed gaps in our matches. ACGTAC?GTACGT This would now give us additional possible alignments that would meet our ‘match’ criteria: ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc. |||||||||||| |||||| |||||| |||||| |||||| ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT We will expect many more matches in a given database, if we allow our alignments to have gaps. The E-value will be larger.

E-values: Effect of Query Length Biologically it’s the same match! Does it mean we are any less sure that this match didn’t occur by chance? The E-value is simply dependent on match length. database BLAST 500 nt sequence against a database BLASTn Get a full length match with sequence XYZ at an E-value = 5.0e -160 >sequence ACTAGTCTAGCTAGACATCG ATCGATGATGCTACACAGAT AGACGATAGATAGTAAGTCG ATCGATCGCGCATCGATCGT CTAGATCGATCGCTCGCTGT GTAGATAGATCGGCGATAGA database BLAST half of the same sequence against the same database BLASTn >sequence ACTAGTCTAGCTAGACATCG ATCGATGATGCTACACAGAT AGACGATAGATAGTAAGTCG Get a match with sequence XYZ again, but at an E-value = 5.0e -80

Why not just use % identity? At some levels this a good question. But consider two very different searches, both of which give a 75% identity match Query1 was 60 nt long: CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG ||||||||||| || | || | || || |||| | | | |||||| | |||||||||| CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG Which would have an E-value ~ 5.0 x 10 -19 And, Query2 only 16 nt long: ACGTACGTACGTACGT ||| || | |||| || ACGCACCTTCGTAGGT Which would have an E-value ~ 30 And intuitively we feel we would expect to see that sort of number of matches in the database just by chance…

So what’s the real problem? Basically you are usually trying to answer the question: Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism? Are there any useful guidelines though, at least for biological meaningfulness? Basically you are usually trying to answer the question: Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism? BLAST The difficulty is because: ORTHOLOGY BLASTSimilarity + Probability biological knowledge nature of query sequence phylogenetic relationship match length, PI, size of database…

Rules of Thumb How good does an E-value have to be before we might even think we have an ortholog?  larger/worse smaller/better  E-values 10 -5 10 -10 10 -40 10 -100 0.0 fantasy borderline encouraging pretty good can’t get better But note that in some gene families with closely related members you can get an E-value of 0.0 for several different matches, and then % identity may be more sensitive. Also bear in mind, in cases like this, that ideas of ‘functional’ orthology may break down, with more than one locus producing identical proteins which share the same function…

Protein BLAST It’s (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level, because the amino acid sequence is more conserved than the underlying DNA sequence. Does this cause us to treat expected values any differently? If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database, each additional amino acid will reduce the E-value by 1/20 th (there are 20 different amino acids). And as there are 347,895,532 letters in that database, E-value = ~3.5 x 10 8 / (20 x 20 x 20 …20 times) = ~3.5 x 10 -18. But this is what we get if we run the blast at NCBI: Score = 43.1 bits (100), Expect = 8e-04 Identities = 20/20 (100%), Positives = 20/20 (100%), Gaps = 0/20 (0%) Frame = +3 Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCI Sbjct 972 SSSSFRAYRAALSEVEPPCI 991 Really too big a discrepancy to easily explain with hand waving…

Amino Acid Substitutions A S C F LWY G I LMV L IMFV M ILV P V ILM W FY NDHS Q REHK SANT T S Y HFW H NQY K RQE R QK DNE E DQK In fact we need to take into account both amino acid substitutability, as well as, as before, allowing gapped alignments. On average any residue can be substituted for by about 2 others, so each position has about 1/7 th chance of ‘matching’ rather than 1/20 th. So now we get: E-value = ~3.5 x 10 8 / (7 x 7 x 7 …20 times) = ~4.4 x 10 -9, which is much closer to the actual BLAST value. These substitutabilities are dealt with by the BLOSUM and PAM matrices

Exercises Go to the file random-DNA-sequences.html, select one of the 20 randomly generated nucleotide sequences, and do a BLASTx (translated DNA->protein) at NCBI against the nr protein database. Did you find any ‘significant’ hits? Repeat with a second sequence. What conclusions might you draw from this exercise? Try the same sequence(s) against the nr nucleotide database. Is there any general difference?

Part 4: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that can be run just by typing the appropriate command and options, e.g. prompt> blastall –p –i –d This is the simplest form: where the basic program ‘blastall’ takes a number of different options, or parameters, indicated by the –x and followed by its value. -p -i -d

Not All Parameters are … There are many other parameters, and if not listed explicitly they will use a default value most appropriate to the blast flavour requested. E.g. for –W blastn uses –W 11, where blastx uses –W 3. There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way. One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search.

The Many Parameters of BLAST There are almost literally hundreds of parameters, but most are way too obscure even for die-hard techies like me! Very few of them are regularly useful in any but their default value, but just occasionally they are very necessary. Here are some of the ones that I have used: -e max expected value -moutput format(graphical or tabular/spreadsheet) -F filter query sequence for low complexity(default TRUE) -U use only upper case regions of query (default FALSE) -Ggap opening cost -E gap extension cost -q nucleotide mismatch penalty (BLASTx uses matrices) -r nucleotide match reward -b number of matching sequences to report -g allow gaps (default TRUE) -W word size -z effective database size (removes effect of actual database size!) -S query strands to search(default both directions) -lrestrict database sequences to given list of ‘gi‘ numbers

BLAST Parameters Exercises 1. BLASTn vs. BLASTx Open the file example-sequences.html, copy the sequence: >blastn-vs-blastx This is a Xenopus tropicalis cDNA sequence. Go to the NCBI BLAST Home Page/Nucleotide-nucleotide BLAST (blastn) section. Paste your sequence into the box.Nucleotide-nucleotide BLAST (blastn) Run BLASTn against the nr nucleotide database using all default options. Then hit [format] to wait for the results in a new page. (hint if you paste the sequence definition line ‘>name’ into the box as well, your results will be labelled accordingly, which can be useful) Now repeat but go to the TRANSLATED BLAST section, and BLAST against the nr protein database using BLASTx. How might the different results help us view the presence of this gene in other vertebrates?

Results for Exercise 1. BLASTn BLASTx

BLAST Parameters Exercises 2. Low complexity filtering Open the file example-sequences.html, copy the sequence: >low-complexity-filtering-A This sequence contains a long AT tandem repeat. Go to the NCBI BLAST Home Page/TRANSLATED BLAST section/BLASTx. Paste your sequence into the box. Carefully UNTICK the “Choose filter [ ] Low complexity” BOX in the second section. And then run BLASTx against the nr database.Choose filter What do you feel about these alignments? Re-run, but leave the low-complexity filter ON this time. Does this change our view of the protein matches? Now continue with >low-complexity-filtering-B and –C. C is an especially interesting case – what can we deduce about the cDNA sequence? Annotators beware!

Results for Exercise 2A (OFF) BLASTn – low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn – low complexity filtering ON

Results for Exercise 2B ONOFF

Results for Exercise 2C There is a sequence error, an extra G at position 117 in the sequence: cDNA (117) AGAAAAGAAGAAACATGGCAATGGATCAGAA |||||||||||||||| |||||||||||||| AGAAAAGAAGAAACAT-GCAATGGATCAGAA Genomic sequence ON OFF

BLAST Parameters Exercises 3. Limit by Entrez query Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items. For instance to find only matching sequences in fruit fly, enter ‘Drosophila melanogaster[ORGN]’ in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list).Limit by entrez query To combine items use logical AND, OR or NOT. Open the file example-sequences.html. Copy the sequence >cyclin-D1-Xt and go to the NCBI BLAST Home Page/ TRANSLATED BLAST section/BLASTx, and paste the sequence. Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1. At what E-value do we expect we are no longer looking at cyclins? Try running the search again with that E-value as a limit…

BLAST Parameters Exercises 4. BLASTn vs tBLASTx and nucleotide mismatch penalties Open the file example-sequences.html. Also open the NCBI BLAST Home Page/SPECIAL – Align two sequences section. There are several Xenopus tropicalis cyclins in the examples file. Copy the sequence >cyclin-A1-Xt to the Sequence 1 BLAST window Copy the sequence >cyclin-A2-Xt to the Sequence 2 BLAST window (i) Run the default comparison, should be BLASTn. Note the alignment. Now run again using tBLASTx – what does this do to our understanding of the relationship between these two sequences? Are they homologs, orthologs or paralogs – or none of these? (ii) Revert to BLASTn, and try varying the values for mismatch penalties and gapping – start by reducing the mismatch penalty to -1. Then try reducing the gap open and gap extension penalties…. What do we learn from this? (iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2…

Results for Exercise 4 (i) BLASTntBLASTx

Results for Exercise 4 (ii) Mismatch penalty = -2 (default)Mismatch penalty = -1

BLAST Parameters Exercises 5. E-Value maximum for reporting Open the file example-sequences.html. Copy the sequence >sumo-binding-motif and go to the NCBI BLAST Home Page. Go to the PROTEIN BLAST section, BLASTp, and paste the sequence. Run the search with the default values. Now re-run the search, setting the maximum E-value in the box: Expect 100 What difference does this make?

BLAST Parameters Exercises 6. Word Size Open the file example-sequences.html. Copy the sequence >morpholino and go to the NCBI BLAST Home Page. Go to the NUCLEOTIDE BLAST section, BLASTn, and paste the sequence. Check OFF the low complexity filter, and then run the search. Now re-run the search, setting the following parameters: Low complexity OFF Expect 100 Word Size7 Other advanced -q-1 (mismatch penalty -1 instead of default -3) What difference does this make?

Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark.

Similar presentations

Presentation on theme: "Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark.

Similar presentations

Presentation on theme: "Bioinformatics Workshop 1 Sequences and Similarity Searches Open a web browser and type in the URL: –informatics.gurdon.cam.ac.uk/online/workshops –Bookmark."— Presentation transcript:

Similar presentations

About project

Feedback