Download presentation
Presentation is loading. Please wait.
Published byColin Potter Modified over 9 years ago
1
MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer
2
Quiz 10 min
3
Databases Bioinformatics needs data –Where is this data? –Is there any organization? –How should I cite data?
4
Where is the data? Many targeted resources exist –miRBase http://www.mirbase.org/http://www.mirbase.org/ Contains microRNAs –PDB http://www.rcsb.org/pdb/home/home.dohttp://www.rcsb.org/pdb/home/home.do Contains protein structures –PeptideAtlas http://www.peptideatlas.org/http://www.peptideatlas.org/ Contains mass spectrometric measurements –KEGG http://www.genome.jp/kegg/http://www.genome.jp/kegg/ Contains regulatory and biochemical pathways –PubMed http://www.ncbi.nlm.nih.gov/pubmed/http://www.ncbi.nlm.nih.gov/pubmed/ Contains indexed journals –...
5
Where is the data? Sequence Databases –EBI(www.ebi.ac.uk/)www.ebi.ac.uk/ –Ensembl(www.ensembl.org)www.ensembl.org –GenBank(www.ncbi.nlm.nih.gov/Genbank)www.ncbi.nlm.nih.gov/Genbank –SwissProt(www.tigr.org/tdb)www.tigr.org/tdb –... Make these pages bookmarks –Are your bookmarks where you are? Try: http://www.delicious.comhttp://www.delicious.com –Or bring your own browser http://portableapps.com/apps/internet/google_chrome_portable
6
How is Data Organized? Flat Text Files –FASTA Format Structured Text Files –XML based Formats (e.g.: ASN.1) Databases –Structure –Index –Users –Details in MBG403
7
Flat Text Files FASTA Format (Pearson and Lipman, 1988) –Allows multiple sequences per file –Requires identifiers for each sequence –Some special characters and formatting rules > introduces the definition line (sequence identifier) 80 characters per sequence line Only supported characters (IUPAC) –http://www.bioinformatics.org/sms/iupac.htmlhttp://www.bioinformatics.org/sms/iupac.html Example >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC... >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC...
8
FASTA Tools FASTA Viewer and DNA Translator –http://www.biolnk.com/http://www.biolnk.com/ Some FASTA Tools –http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaToo lshttp://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaToo ls FASTA Validator/ Converter to CSV file –http://mbg305.allmer.de/tools/http://mbg305.allmer.de/tools/
9
FASTA Usage Most programs that accept sequence input accept FASTA format –BLAST (partially) –FastA (obviously) –Multiple Sequence Alignment Tools Most –MS-based Database Search Engines Some (only database, not queries) –Most Online Forms
10
FASTA Definition Line Formats http://en.wikipedia.org/wiki/Fasta_format –GenBank gi|gi-number|gb|accession|locus –EMBL Data Library gi|gi-number|emb|accession|locus –DDBJ, DNA Database of Japan gi|gi- number|dbj|accession|locus –NBRF PIR pir||entry Protein Research Foundation prf||name –SWISS-PROT sp|accession|name –Brookhaven Protein Data Bank (1) pdb|entry|chain –Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE –Patents pat|country|number GenInfo –Backbone Id bbs|number –General database identifier gnl|database|identifier –NCBI Reference Sequence ref|accession|locus –Local Sequence identifier lcl|identifier
11
GenBank Flat Text File GenBank –Sample record and explanation: http://www.ncbi.nlm.nih.go v/Sitemap/samplerecordhttp://www.ncbi.nlm.nih.go v/Sitemap/samplerecord –FAQs http://www.ncbi.nlm.nih.go v/books/NBK49541/#NucP rotFAQ.Section_A_GenBa nk_nucleotidehttp://www.ncbi.nlm.nih.go v/books/NBK49541/#NucP rotFAQ.Section_A_GenBa nk_nucleotide
12
Structured Text Files Different ways to structure text files –ASN.1 –XML –JSON –Wait for MBG403 for details
13
Structured Text Files ASN.1 Example –http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622.1?rep ort=asn1&log$=se qviewhttp://www.ncbi.nl m.nih.gov/nuccore/ NC_003622.1?rep ort=asn1&log$=se qview –http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622 Select Display Settings ASN.1
14
Databases Unlike the previous formats not easily readable –Special tools and languages are used to add, edit, retrieve, and view data Advantages –Secure –Stable –Distributed –Fast Access –Huge sizes supported http://www.freerepublic.com/focus/f-chat/2508670/posts Ever tried to search in 100 TB of text for something?
15
Scientific Data Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf
16
Characteristics of Scientific Data Highly Complex –Images, sequences, time series,... –Strong interdependence of data In Science –Outliers are of interest –Focus of interest changes rapidly –Data is usually shared –Data must be secure Never change data only add Many viewers few creators Collections –Large collections must be shared via strong servers –Small collections (e.g. SwissProt 63MB) can be shared more easily –New methodologies (MS, NGS,...) have expanded size of databases
17
Desired Features for Databases Efficiency Scalability Concurrency Security Integrity Stability Cross references to other databases Universally accessible Query Language Data mining Data Warehouse
18
How Many Bioinformatics Databases? Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf
19
An Abundance of Databases Databases and Collections on http://www.hsls.pitt.edu/obrc/Databases and Collections on http://www.hsls.pitt.edu/obrc/ (2012 -> 2014) –DNA Sequence Databases and Analysis Tools (499) -> 463DNA Sequence Databases and Analysis Tools –Enzymes and Pathways (281) -> 242Enzymes and Pathways –Gene Mutations, Genetic Variations and Diseases (303) -> 257Gene Mutations, Genetic Variations and Diseases –Genomics Databases and Analysis Tools (703) -> 636Genomics Databases and Analysis Tools –Immunological Databases and Tools (61) -> 49Immunological Databases and Tools –Microarray, SAGE, and other Gene Expression (215) -> 166Microarray, SAGE, and other Gene Expression –Organelle Databases (29) -> 25Organelle Databases –Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179) -> 147Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) –Plant Databases (159) -> 146Plant Databases –Protein Sequence Databases and Analysis Tools (492) -> 408Protein Sequence Databases and Analysis Tools –Proteomics Resources (74) -> 58Proteomics Resources –RNA Databases and Analysis Tools (257) -> 222RNA Databases and Analysis Tools –Structure Databases and Analysis Tools (452) -> 384Structure Databases and Analysis Tools Sum: 3704 -> 2457
20
Data Warehouses Are resources like NCBI and EBI databases? –No they are larger than what is generally called a database –They can be called data warehouses –They consist of many interlinked databases
21
Need for Improvement Anyone can submit data to online resources Rigorous data checking is necessary –Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215) –Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038)http://dx.doi.org/10.1109/HIBIT.2012.6209038 Data must be standardized Quality of data must be specified
22
How to Cite Data It is rarely necessary to present a sequence in any writing In general it suffices to give –Accession number of sequence –Database where sequence is located If database is not given try –Accession Parser (www.biolnk.com)www.biolnk.com In case you have a new sequence –Generally required to deposit it in a database –E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/ –Then cite the assigned accession number(s)
23
End of Theoretical Part 1 Mind mapping 10 min break
24
Practical Part 1
25
Where is the data? Turn on your computers and let’s find out EBI(www.ebi.ac.uk/)www.ebi.ac.uk/ Ensembl(www.ensembl.org)www.ensembl.org GenBank(www.ncbi.nlm.nih.gov/Genbank)www.ncbi.nlm.nih.gov/Genbank SwissProt(www.tigr.org/tdb)www.tigr.org/tdb Make these pages bookmarks –Are your bookmarks where you are? –Try: http://www.delicious.comhttp://www.delicious.com
26
Retrieve Data You want the DNA sequence of some human Hemoglobine How do you get it? Try to achive this goal for a few minutes
29
Ctrl-F
30
No results
31
Where have we gone wrong? Language! Database!
33
GenBank
34
http://www.ncbi.nlm.nih.gov/Sit emap/samplerecord.htmlhttp://www.ncbi.nlm.nih.gov/Sit emap/samplerecord.html
35
GenBank Accession number –Applies to full record –X00000 –XX000000 –Never changes
36
GenBank Version –Identifies a single sequence –Adds version to accession number format X00000.0 –Version ie.0 ->.1 changes if even a single nucleotide in the sequences is changed –Other versions are referenced http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
37
GenBank GeneInfo identifier (GI) –Any change to the sequences forces a new gi number –Translations get separate gi numbers –GI:00000
38
GenBank
39
Sequence?
40
GenBank Eukaryotic
41
Retrieving Sequences By Example Basic Local Alignment Search Tool BLAST
45
http://www.ebi.ac.uk/
48
What did we do? We wanted to find one of the human hemoglobins –The nucleotide sequence in FASTA format We wanted to find similar sequences –BLAST (ncbi) –FASTA (ebi) Who got lost in the jungle of LINKS? –That is normal –Bioinformatics is a quickly growing field –Consolidation not any time soon
49
End of Practical Part 1 15 min break
50
Theoretical Part 2 And now for something completely different –http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Differenthttp://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different How can we find sequences? Can the algorithm we found last week be used?
51
Similarity Searching Search Algorithms –BLAST –FASTA –... This is at the heart of bioinformatics It demands a lot of attention
52
Similarity Searching Exact pattern matching Approximate pattern matching
53
String Matching Math Remember the string matching we did last week? Today we will look at the math of finding EXACT matches between queries and databases If time allows we will look into substitution matrices
54
Probability for perfect matches Query (Q): ATTGCC Target (T): CGATTGCCCG LQLQ LTLT L Q = length of query (number of nucleotides) L T = length of sequence (number of nucleotides)
55
Element Probability Probability of finding a nucleotide Very roughly 0.25 Given the sequence: ATTTCCGGGGTAGCTAGCTAGTATATTATCGGCGCTAA What are the probabilities for A, C, G, and T? NucleotideNumberFrequency A90.24 C70.18 G100.26 T120.32 N381.00
56
Sequence Probability p = P A P C 2 P G P T 2 What is p? p = the probability of randomly generating the sequence given the frequency and number of its elements (e.g.: P A ). There is no sequential dependency assumed in this model. What is the probability of generating AAAAGTTT given the probabilities that we just calculated? p = 0.24 4 * 0.26 * 0.32 3 = 0.003 * 0.260 * 0.033 = 0.000026
57
The number of matches is restricted by the database size How often can we shift Q (Query) against T (Target)? This defines the number of possible matching operations n = L T – L Q +1 Example: L Q = 6 L T = 10 n = 10 – 6 + 1 n = 5 Query: ATTGCC Target: CGATTGCCCG How Often do we Expect to Find the Query
58
The probability distribution of the number of matches is approximately binomial: Definition: q = 1 - p p(x) = (n! / x!(n – x)!) p x q n-x What is p? What is n? What is q? http://en.wikipedia.org/wiki/Binomial_distribution n = 20 p = 0.1 p = 0.5 p = 0.8 P: probability for being true Q: probability for being false N: number of trials X: number of successes
59
Problem Factorial leads to overflows in computer programming With n*p < 1 and large n The distribution can be approximated by a Poisson distribution –Much easier to calculate for a computer
60
Poisson vs. Binomial Distributions Poisson p(x) = e -λ (λ x / x!) λ: n*p Binomial p(x) = (n! / x!(n – x)!) p x q n-x
61
Partial matches So far we considered matching the complete query Partial match: L ( L<= L Q ^ L <= L T ) p = 2 -2L m = L Q - L -1 n = L D - L -1 E = m n 2 -2L
62
BLAST E-Value E = mn2 -S E = mn2 -2L Describes the number of expected matches which are equally good or better
63
End of Theoretical Part 2 Mind mapping 10 min break
64
Practical Part 2
65
Practice Poisson vs Binomial Q: ATG D: CGATTGCCCG Calculate p(0), p(1) and p(3) Note: at least one match = 1 – p(0)
66
E = m n 2 -2L Assuming a database size of 10 000 000 and a query length of 10 calculate the number of matches that would happen by chance?
67
Practical Concerns Human genome 3 billion nucleotides Dogma: 14 nucleotides are enough to uniquely identify a gene Verify this using Poisson distribution Poisson p(x) = e -λ (λ x / x!) λ: n*p
68
BLAST Interface Setting a cutoff E-value –Consider the calculation you just did –If someone was to set the cutoff to 0.01 with the same assumptions How many results would you expect? What would you advise the user? Topic will be revisited later
69
Amino Acid Sequences What changes when instead of nucleotide sequences we were to use amino acid sequences?
70
Practise this Determine how long a query must be that it can uniquely identify a gene in the human genome –p < 0.05
71
Assignments Go to GenBank and inspect all parameters –Find their meaning (even if you think you know what it means) –Sometimes definitions are surprising Collect information about parameters that pose problems to you –Submit this information to us so that we can discuss in the following week
72
Homework 1 Make a table showing the E-value against L Q (10..100) with L D = 3 000 000 000 Use Excel to do this Send the results to bioinformatics@allmer.debioinformatics@allmer.de
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.