MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer.

MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer

Quiz 10 min

Databases Bioinformatics needs data –Where is this data? –Is there any organization? –How should I cite data?

Where is the data? Many targeted resources exist –miRBase http://www.mirbase.org/http://www.mirbase.org/ Contains microRNAs –PDB http://www.rcsb.org/pdb/home/home.dohttp://www.rcsb.org/pdb/home/home.do Contains protein structures –PeptideAtlas http://www.peptideatlas.org/http://www.peptideatlas.org/ Contains mass spectrometric measurements –KEGG http://www.genome.jp/kegg/http://www.genome.jp/kegg/ Contains regulatory and biochemical pathways –PubMed http://www.ncbi.nlm.nih.gov/pubmed/http://www.ncbi.nlm.nih.gov/pubmed/ Contains indexed journals –...

Where is the data? Sequence Databases –EBI(www.ebi.ac.uk/)www.ebi.ac.uk/ –Ensembl(www.ensembl.org)www.ensembl.org –GenBank(www.ncbi.nlm.nih.gov/Genbank)www.ncbi.nlm.nih.gov/Genbank –SwissProt(www.tigr.org/tdb)www.tigr.org/tdb –... Make these pages bookmarks –Are your bookmarks where you are? Try: http://www.delicious.comhttp://www.delicious.com –Or bring your own browser http://portableapps.com/apps/internet/google_chrome_portable

How is Data Organized? Flat Text Files –FASTA Format Structured Text Files –XML based Formats (e.g.: ASN.1) Databases –Structure –Index –Users –Details in MBG403

Flat Text Files FASTA Format (Pearson and Lipman, 1988) –Allows multiple sequences per file –Requires identifiers for each sequence –Some special characters and formatting rules > introduces the definition line (sequence identifier) 80 characters per sequence line Only supported characters (IUPAC) –http://www.bioinformatics.org/sms/iupac.htmlhttp://www.bioinformatics.org/sms/iupac.html Example >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC... >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC...

FASTA Tools FASTA Viewer and DNA Translator –http://www.biolnk.com/http://www.biolnk.com/ Some FASTA Tools –http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaToo lshttp://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaToo ls FASTA Validator/ Converter to CSV file –http://mbg305.allmer.de/tools/http://mbg305.allmer.de/tools/

FASTA Usage Most programs that accept sequence input accept FASTA format –BLAST (partially) –FastA (obviously) –Multiple Sequence Alignment Tools Most –MS-based Database Search Engines Some (only database, not queries) –Most Online Forms

FASTA Definition Line Formats http://en.wikipedia.org/wiki/Fasta_format –GenBank gi|gi-number|gb|accession|locus –EMBL Data Library gi|gi-number|emb|accession|locus –DDBJ, DNA Database of Japan gi|gi- number|dbj|accession|locus –NBRF PIR pir||entry Protein Research Foundation prf||name –SWISS-PROT sp|accession|name –Brookhaven Protein Data Bank (1) pdb|entry|chain –Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE –Patents pat|country|number GenInfo –Backbone Id bbs|number –General database identifier gnl|database|identifier –NCBI Reference Sequence ref|accession|locus –Local Sequence identifier lcl|identifier

GenBank Flat Text File GenBank –Sample record and explanation: http://www.ncbi.nlm.nih.go v/Sitemap/samplerecordhttp://www.ncbi.nlm.nih.go v/Sitemap/samplerecord –FAQs http://www.ncbi.nlm.nih.go v/books/NBK49541/#NucP rotFAQ.Section_A_GenBa nk_nucleotidehttp://www.ncbi.nlm.nih.go v/books/NBK49541/#NucP rotFAQ.Section_A_GenBa nk_nucleotide

Structured Text Files Different ways to structure text files –ASN.1 –XML –JSON –Wait for MBG403 for details

Structured Text Files ASN.1 Example –http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622.1?rep ort=asn1&log$=se qviewhttp://www.ncbi.nl m.nih.gov/nuccore/ NC_003622.1?rep ort=asn1&log$=se qview –http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622http://www.ncbi.nl m.nih.gov/nuccore/ NC_003622 Select Display Settings ASN.1

Databases Unlike the previous formats not easily readable –Special tools and languages are used to add, edit, retrieve, and view data Advantages –Secure –Stable –Distributed –Fast Access –Huge sizes supported http://www.freerepublic.com/focus/f-chat/2508670/posts Ever tried to search in 100 TB of text for something?

Scientific Data Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

Characteristics of Scientific Data Highly Complex –Images, sequences, time series,... –Strong interdependence of data In Science –Outliers are of interest –Focus of interest changes rapidly –Data is usually shared –Data must be secure Never change data only add Many viewers few creators Collections –Large collections must be shared via strong servers –Small collections (e.g. SwissProt 63MB) can be shared more easily –New methodologies (MS, NGS,...) have expanded size of databases

Desired Features for Databases Efficiency Scalability Concurrency Security Integrity Stability Cross references to other databases Universally accessible Query Language Data mining Data Warehouse

How Many Bioinformatics Databases? Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

An Abundance of Databases Databases and Collections on http://www.hsls.pitt.edu/obrc/Databases and Collections on http://www.hsls.pitt.edu/obrc/ (2012 -> 2014) –DNA Sequence Databases and Analysis Tools (499) -> 463DNA Sequence Databases and Analysis Tools –Enzymes and Pathways (281) -> 242Enzymes and Pathways –Gene Mutations, Genetic Variations and Diseases (303) -> 257Gene Mutations, Genetic Variations and Diseases –Genomics Databases and Analysis Tools (703) -> 636Genomics Databases and Analysis Tools –Immunological Databases and Tools (61) -> 49Immunological Databases and Tools –Microarray, SAGE, and other Gene Expression (215) -> 166Microarray, SAGE, and other Gene Expression –Organelle Databases (29) -> 25Organelle Databases –Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179) -> 147Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) –Plant Databases (159) -> 146Plant Databases –Protein Sequence Databases and Analysis Tools (492) -> 408Protein Sequence Databases and Analysis Tools –Proteomics Resources (74) -> 58Proteomics Resources –RNA Databases and Analysis Tools (257) -> 222RNA Databases and Analysis Tools –Structure Databases and Analysis Tools (452) -> 384Structure Databases and Analysis Tools Sum: 3704 -> 2457

Data Warehouses Are resources like NCBI and EBI databases? –No they are larger than what is generally called a database –They can be called data warehouses –They consist of many interlinked databases

Need for Improvement Anyone can submit data to online resources Rigorous data checking is necessary –Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215) –Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038)http://dx.doi.org/10.1109/HIBIT.2012.6209038 Data must be standardized Quality of data must be specified

How to Cite Data It is rarely necessary to present a sequence in any writing In general it suffices to give –Accession number of sequence –Database where sequence is located If database is not given try –Accession Parser (www.biolnk.com)www.biolnk.com In case you have a new sequence –Generally required to deposit it in a database –E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/ –Then cite the assigned accession number(s)

End of Theoretical Part 1 Mind mapping 10 min break

Practical Part 1

Where is the data? Turn on your computers and let’s find out EBI(www.ebi.ac.uk/)www.ebi.ac.uk/ Ensembl(www.ensembl.org)www.ensembl.org GenBank(www.ncbi.nlm.nih.gov/Genbank)www.ncbi.nlm.nih.gov/Genbank SwissProt(www.tigr.org/tdb)www.tigr.org/tdb Make these pages bookmarks –Are your bookmarks where you are? –Try: http://www.delicious.comhttp://www.delicious.com

Retrieve Data You want the DNA sequence of some human Hemoglobine How do you get it? Try to achive this goal for a few minutes

Ctrl-F

No results

Where have we gone wrong? Language! Database!

GenBank

http://www.ncbi.nlm.nih.gov/Sit emap/samplerecord.htmlhttp://www.ncbi.nlm.nih.gov/Sit emap/samplerecord.html

GenBank Accession number –Applies to full record –X00000 –XX000000 –Never changes

GenBank Version –Identifies a single sequence –Adds version to accession number format X00000.0 –Version ie.0 ->.1 changes if even a single nucleotide in the sequences is changed –Other versions are referenced http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

GenBank GeneInfo identifier (GI) –Any change to the sequences forces a new gi number –Translations get separate gi numbers –GI:00000

GenBank

Sequence?

GenBank Eukaryotic

Retrieving Sequences By Example Basic Local Alignment Search Tool BLAST

http://www.ebi.ac.uk/

What did we do? We wanted to find one of the human hemoglobins –The nucleotide sequence in FASTA format We wanted to find similar sequences –BLAST (ncbi) –FASTA (ebi) Who got lost in the jungle of LINKS? –That is normal –Bioinformatics is a quickly growing field –Consolidation not any time soon

End of Practical Part 1 15 min break

Theoretical Part 2 And now for something completely different –http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Differenthttp://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different How can we find sequences? Can the algorithm we found last week be used?

Similarity Searching Search Algorithms –BLAST –FASTA –... This is at the heart of bioinformatics It demands a lot of attention

Similarity Searching Exact pattern matching Approximate pattern matching

String Matching Math Remember the string matching we did last week? Today we will look at the math of finding EXACT matches between queries and databases If time allows we will look into substitution matrices

Probability for perfect matches Query (Q): ATTGCC Target (T): CGATTGCCCG LQLQ LTLT L Q = length of query (number of nucleotides) L T = length of sequence (number of nucleotides)

Element Probability Probability of finding a nucleotide Very roughly 0.25 Given the sequence: ATTTCCGGGGTAGCTAGCTAGTATATTATCGGCGCTAA What are the probabilities for A, C, G, and T? NucleotideNumberFrequency A90.24 C70.18 G100.26 T120.32 N381.00

Sequence Probability p = P A P C 2 P G P T 2 What is p? p = the probability of randomly generating the sequence given the frequency and number of its elements (e.g.: P A ). There is no sequential dependency assumed in this model. What is the probability of generating AAAAGTTT given the probabilities that we just calculated? p = 0.24 4 * 0.26 * 0.32 3 = 0.003 * 0.260 * 0.033 = 0.000026

The number of matches is restricted by the database size How often can we shift Q (Query) against T (Target)? This defines the number of possible matching operations n = L T – L Q +1 Example: L Q = 6 L T = 10 n = 10 – 6 + 1 n = 5 Query: ATTGCC Target: CGATTGCCCG How Often do we Expect to Find the Query

The probability distribution of the number of matches is approximately binomial: Definition: q = 1 - p p(x) = (n! / x!(n – x)!) p x q n-x What is p? What is n? What is q? http://en.wikipedia.org/wiki/Binomial_distribution n = 20 p = 0.1 p = 0.5 p = 0.8 P: probability for being true Q: probability for being false N: number of trials X: number of successes

Problem Factorial leads to overflows in computer programming With n*p < 1 and large n The distribution can be approximated by a Poisson distribution –Much easier to calculate for a computer

Poisson vs. Binomial Distributions Poisson p(x) = e -λ (λ x / x!) λ: n*p Binomial p(x) = (n! / x!(n – x)!) p x q n-x

Partial matches So far we considered matching the complete query Partial match: L ( L<= L Q ^ L <= L T ) p = 2 -2L m = L Q - L -1 n = L D - L -1 E = m n 2 -2L

BLAST E-Value E = mn2 -S E = mn2 -2L Describes the number of expected matches which are equally good or better

End of Theoretical Part 2 Mind mapping 10 min break

Practical Part 2

Practice Poisson vs Binomial Q: ATG D: CGATTGCCCG Calculate p(0), p(1) and p(3) Note: at least one match = 1 – p(0)

E = m n 2 -2L Assuming a database size of 10 000 000 and a query length of 10 calculate the number of matches that would happen by chance?

Practical Concerns Human genome 3 billion nucleotides Dogma: 14 nucleotides are enough to uniquely identify a gene Verify this using Poisson distribution Poisson p(x) = e -λ (λ x / x!) λ: n*p

BLAST Interface Setting a cutoff E-value –Consider the calculation you just did –If someone was to set the cutoff to 0.01 with the same assumptions How many results would you expect? What would you advise the user? Topic will be revisited later

Amino Acid Sequences What changes when instead of nucleotide sequences we were to use amino acid sequences?

Practise this Determine how long a query must be that it can uniquely identify a gene in the human genome –p < 0.05

Assignments Go to GenBank and inspect all parameters –Find their meaning (even if you think you know what it means) –Sometimes definitions are surprising Collect information about parameters that pose problems to you –Submit this information to us so that we can discuss in the following week

Homework 1 Make a table showing the E-value against L Q (10..100) with L D = 3 000 000 000 Use Excel to do this Send the results to bioinformatics@allmer.debioinformatics@allmer.de

MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer.

Similar presentations

Presentation on theme: "MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer.

Similar presentations

Presentation on theme: "MBG305 Applied Bioinformatics Week 2 (02.10.2010) Jens Allmer."— Presentation transcript:

Similar presentations

About project

Feedback