Dilvan Moreira (based on Prof. André Carvalho presentation) Sequence Analyzes Dilvan Moreira (based on Prof. André Carvalho presentation)
Reading Introduction to Computational Genomics: A Case Studies Approach Chapter1
Introduction Cells Molecular Biology Probabilistic Sequence Models Multinomiais Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018
Cells
Cells Cell Basic unity of all living being Compartment involved by membrane, filled with aqueous solution May have organelles with specific functions Mitochondria: energy generation Golgi Complex: accumulation of secretions Among others André de Carvalho - ICMC/USP 27/01/2018
Cells Cell Doctrines All living beings are made of cells and its products Cells have structure and function All the cells emerge from pre existing cells One cell can made copies of itself by replication and division André de Carvalho - ICMC/USP 27/01/2018
Cells Depending on the number of cells, an organism is classified as: Unicellular (bacteria, protozoa) Pluricellular (worms, mammals) According to the presence of a nucleus in the cells, an organism can be classified as: Eukaryote: has a nucleus defined by membrane Prokaryote: do not has a nucleus André de Carvalho - ICMC/USP 27/01/2018
Cells The fact that an organism is a prokaryote does not mean it is unicellular The majority lives as a unicellular organism Although some species group in a “bunch”, chains or other organization forms of multicellular structures Many unicellular organisms are eukaryotes André de Carvalho - ICMC/USP 1/27/2018
An animal cell Nuclei: DNA and RNA. Rough Endoplasmic Reticulum (ER): produces proteins Smooth ER: produce lipides Golgi Complex: cellular digestion as a basic function Mitochondria: Produces energy. It has its own DNA and auto duplication capability. 1/27/2018 André de Carvalho - ICMC/USP
Cells All the cells of the same organism have the same genes Not all the cells have the same organelles in equal proportions Cells vary in form and function Normally, the form is related to function Specific cell function and form are defined by its expressed genes André de Carvalho - ICMC/USP 1/27/2018
Cells The chemical processes that occur in a cell are basically the same For all the cell types and organisms Even though those cells present different forms and functions The DNA replication in a bacteria is similar to the DNA replication in a mammal It makes scientific advances easier Allowing experiments made with basal living to be used to infer results for other beings André de Carvalho - ICMC/USP 1/27/2018
Cytology X Molecular Biology Science that studies the cell (fixed) Studies the cellular organization, types, functioning, division mechanism, etc With science advances, it is possible to analyze living cells (in vivo) Molecular level Originated the term: Molecular Biology André de Carvalho - ICMC/USP 1/27/2018
DNA Deoxyribonucleic Acid May have single or double-stranded Double-stranded DNA two strands twisted around each other to form a double helix The long polymer compacts itself in a chromosome The DNA is composed by four different nucleotides (bases) Adenine, Cytosine, Guanine and Thymine (Uracil on RNA) The double-strand is caused by base pairing André de Carvalho - ICMC/USP 1/27/2018
DNA The DNA strands are kept together by links that connect each nucleotide of one strand to its complement in the other strand André de Carvalho - ICMC/USP 1/27/2018
DNA The DNA is always read from the 5‘ end to the 3‘ end in the transcriptional process 5’ ATTTAGGCC 3’ 3’ TAAATCCGG 5’ 27/01/2018 André de Carvalho - ICMC/USP
DNA 5’ end In one end, there is the first nucleotide. It has a phosphate group C5 projecting out. 3’end In the other end, there is the last nucleotide added to the DNA strand. It is the only one that still has the component C3–OH. 27/01/2018 André de Carvalho - ICMC/USP
Molecular Biology The genome is the set of all DNA from a cell (organism) Including the genes Genes carry the necessary information to produce the required proteins of an organism The proteins determine Organism’s appearance How the body metabolizes food or defend itself from infections Sometimes, the organism behavior André de Carvalho - ICMC/USP 1/27/2018
Fraction of yeast genome CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGCAATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTAACCGCAATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAGCTTTCAAGATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAGATCAAAAAGGAAAGCATGGAGGGAAACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCATTGAAGAGCTTCCTTAAAAAATTTAGAATCTCCCATGTCAACGGGTTTCCATACCTCCCCAGCATCATACATCTTTTTTCAAAGAAACTTCAAATGCCTCTTTTATGCAAGGGGCAAAATCCTGAAATGACTTAAACTTAGCAGTTTCGTCTTTTTTCAAAGAGAATGGTTGAAGAAGAATTGTTTTGGACGCTTATTGACAATCTGTTGCATTGATAAAGTACCTACTATCCCAGACTATATTTGTATACAAGTACAAAATTAGGTTTGTTGAAACAACTTTCCGATCATTGGTGCCCGTATCTGATGTTTTTTTAGTAATTTCTTTGTAAATACAGGGAGTTGTTTCGAAAGCTTATGAGAAAAATACATGAATGACAGGTAAAAATATTGGCTCGAAAAAGAGGACAAAAAGAGAAATCATAAATGAGTAAACCCACTTGCTGGACATTATCCAGTAAAGGCTTGGTAGTAACCATAATATTACCCAGGTACGAAACGCTAAGAACCTTGAAAGACTCATAAAACTTCCAGGTTAAGCTATTTTTGAAAATATTCTGAGGTAAAAGCCATTAAGGTCCAGATAACCAAGGGACAATAAACCTATGCTTTTCTTGTCTTCAATTTCAGTATCTTTCCATTTTGATAATGAGCATGTGATCCGGAAAGCTACTTTATGATGTTTCAAGGCCTGAAGTTTGAATATTTATGTAGTTCAACATCAAATGTGTCTATTTTGTGATGAGGCAACCGTCGACAACCTTATTATCGAAAAAGAACAACAAGTTCACATGCTTGTTACTCTCTATAACTAGAGAGTACTTTTTTTGGAAGCAAGTAAGAATAAGTCAATTTCTACTTACCTCATTAGGGAAAAATTTAATAGCAGTTGTTATAACGACAAATACAGGCCCTAAAAAATTCACTGTATTCAATGGTCTACGAATCGTCAATCGCTTGCGGTTATGGCACGAAGAACAATGCAATAGCTCTTACAAGCCACTACATGACAAGCAACTCATAATTTAA André de Carvalho - ICMC/USP 27/01/2018
Molecular Biology Haploid Cells: 1 set of chromosomes Diploid Cells: 2 sets of chromosomes (pairs) André de Carvalho - ICMC/USP 1/27/2018
Molecular Biology Genes Subsequences of DNA Found in the chromosome They are the mold of protein or RNA production Between the genes there segments called are non- coding regions André de Carvalho - ICMC/USP 1/27/2018
Not all the DNA code genes 22.000 ? 3200 x 106 Human House fly 13601 180 x 106 Drosophila melanogaster worm 19099 95.5 x 106 C. elegans yeast 5885 12.1 x 106 Saccharomyces cerevisiae 4406 4639221 E. coli Ear infection 1738 1830138 Hemophilus influenzae Pneumonia 680 816394 Mycoplasma pneumoniae subcelular organell 37 16569 Human mitochondrion E.coli virus 10 5386 ФX-174 Descrição Genes Num. de pb Organismo Humans André de Carvalho - ICMC/USP 27/01/2018
Non-coding DNA It is not part of protein/RNA synthesis It was considered genetic “junk” It binds to a DNA strand One of its functions: blocking the transcriptional process The bound gene is not read It avoids the expression of the associated protein Inhibition of genes may prevent tumor cells growth Researchers were able to bind genes not related to tumor growth André de Carvalho - ICMC/USP 1/27/2018
Molecular Biology Scientists identified the gene related to breast cancer (SATB1) Paper published on Nature (March, 2008) Healthy organism: organizer of other genes Cancerous organism: growth of tumors, controlling around 1000 other genes Gang leader, gang, mob Active role on formation of other cancer focus (metastasis) Most common cause of death in ill patients André de Carvalho - ICMC/USP 1/27/2018
Cells of breast cancer with a defective gene Bioinformatics Experiments in rats: After the gene’s inactivation, the tumor cells exacerbated proliferation ends Cancer looses aggressiveness potential Allows more accurate and early diagnosis Cells of breast cancer with a defective gene 1/27/2018 André de Carvalho - ICMC/USP
Molecular Biology Proteins Define structure, function and regulatory mechanism of cells Example of regulatory mechanisms: control of cellular cycle, genic transcription Linear sequences 20 different amino acid combinations Three consecutive nucleotide (codon) form an amino acid André de Carvalho - ICMC/USP 1/27/2018
Size of Genomes Prokaryotes Viral Eukaryotes Organelles 0.5 to 12 megabases - MB - (1.000.000 bp) Viral 5 to 50 kilobases - KB - (1.000 bp) Eukaryotes 8 megabases to 670 gigabases - GB- (1.000.000.000 bp) High amount of repetitive DNA Organelles Majority of eukaryote also have a genome out of nuclei Probably rests of prokaryotes that lived in symbiosis André de Carvalho - ICMC/USP 1/27/2018
André de Carvalho - ICMC/USP Virus X Bacteria André de Carvalho - ICMC/USP Bacteria Unicellular, prokaryotes Free living May be found isolated or in colonies Generally have a circular genome single- stranded Virus Smaller than bacteria Mandatory parasites Single or double- stranded Basically made of proteins Reproduce by invasion and control of auto replication cellular apparatus 1/27/2018
Probabilistic Models of Sequences The majority of computational genomic studies uses statistical methods Ex.: find structures of interest in sequences of millions bp The majority of the sequence does not have relevant information Need to obtain probabilistic models of DNA sequences André de Carvalho - ICMC/USP 1/27/2018
Probabilistic Models of Sequences Abstraction of the 3D molecule to a symbol sequence (linear) Alphabet {A, C, T, G} Allows the use of powerful mathematic tools Lack of care with information of the tridimensional structure André de Carvalho - ICMC/USP 1/27/2018
Probabilistic Models of Sequences Definition1.1 A DNA sequence s is a finite string of the alphabet N = {A, C, T, G} of nucleotides Genome is the set of all DNA sequences of an organism or organelle It allow us the use of statistical models of: Sequence evolution, sequence similarities, etc André de Carvalho - ICMC/USP 1/27/2018
Probabilistic Models of Sequences Definition 1.2 The sequence elements s are denoted by s = s1,s2, ..., sn, where each si represents one element Given a set of indices K, it is possible to concatenate elements of s in its original order s(K) = si,sj, sk if K = {i, j, k} It is also possible to use K = [i, j] = (i:j) Specific symbol may be denoted by k = {i}, si = s(i) André de Carvalho - ICMC/USP 1/27/2018
Exercise Given a DNA sequence s = ATATGTCGTGCA, find: s{7} = s(2:6) = André de Carvalho - ICMC/USP 1/27/2018
Probabilistic Models of Sequences : Probabilistic Models of Sequences Almost all probabilistic methods of sequence analysis can be grouped into two categories: Multinomial Models Markov Models André de Carvalho - ICMC/USP 1/27/2018
Multinomial Models Simpler models Assume a probability distribution p on the alphabet Nucleotides are independent and identically distributed (i.i.d.) along the sequence Ex.: For the DNA sequence p = (pa, pb, pc, pd), where Px = p(si = x) Independent of I position Pa + pb + pc + pd = 1 (normalization restriction) Define equal probabilities or based on the frequency of each nucleotide André de Carvalho - ICMC/USP 1/27/2018
Multinomial Models It is not expected that DNA sequences are truly random Model validity can be tested with real sequences Estimate frequency symbols in the sequence regions Test of independence violations checking correlations between neighbor individuals Regions where changes occur and there is interest in the correlations André de Carvalho - ICMC/USP 1/27/2018
Markov Models Provide more complex model of DNA sequences Probability of observing a symbol depends on the previous symbol in the sequence Last symbol – order 1 Last two – order 2 No previous – order 0 (multinomial) Can shape local co-relations between nucleotides André de Carvalho - ICMC/USP 1/27/2018
Markov Models Transition Matrix PCA for De = A C G T 0.99 0.99 0.002 A C for 0.002 A C G T 0.99 0.002 0.006 0.006 0.006 0.002 G T De 0.002 0.99 0.99 Equal Probabilities = multinomial = A C G T Probabilities of each initial state André de Carvalho - ICMC/USP 27/01/2018
Markov Models Transition matrix entry are defined by: pxy = p(si+1 = y/ si = x) p(s) = p(s1 s2 ... sn) p(s) = p(s1) p(s2) ... p(sn) - order 0 p(s) = p(sn/sn-1) p(sn-1/sn-2) ... p(s2/s1) (s1) - order 1 André de Carvalho - ICMC/USP 1/27/2018
Basic statistics of H. influenza (1.830.138 bp) Genome Annotation Simple statistics may describe important characteristics Base Number Frequency A 567.623 0.3102 C 350.723 0.1916 G 347.436 0.1898 T 564.241 0.3083 Basic statistics of H. influenza (1.830.138 bp) André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation Base composition Bases frequencies are different in genomes of different organisms Frequencies may vary in different parts Violates multinomial model supposition André de Carvalho - ICMC/USP 1/27/2018
Genome Annotation Look only to one of the strands K size window André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation Look only to one of the strands K size window André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation GC (C and G) content (frequency) Most cited measure in papers C and G (A and T) have similar frequencies Aggregate frequency GC versus AT (AT = 1–GC) Organism GC content H. influenza 38.8 M. turbeculosis 65.8 S. Enteritidis 49.5 GC content for diferent organisms André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation GC content (frequency) May be used to detect external genetic material in a part of genome Species may acquire sub sequences from other organisms (ex. virus) Horizontal genetic transference André de Carvalho - ICMC/USP 1/27/2018
Genome Annotation Point of change analysis Use this method to detect where the bases distribution (or GC) change These change regions divide the sequence in more uniform parts They may help to identify important biological signs Simpler measure: use threshold Threshold value definition (like the window size) is a statistical problem André de Carvalho - ICMC/USP 1/27/2018
Genome Annotation André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation k-mer frequency is a motif bias Other useful measure is the frequency of sequences size 2 and major Dimers, trimers, k-mers K-mers are not usual: any word that is in the genome with frequency highest or lowest than expected Bias in the position or frequency of these words may reveal important information about its function The number of k-mers is counted by a k-size window covering the sequence André de Carvalho - ICMC/USP 1/27/2018
Genome Annotation k-mer frequency and motif bias It is also possible to plot the frequency of only some interest k-mers Ex. Dimers (dinucleotide) AT and CG There are examples of statistical bias of nucleotides Ex.: Low frequency of CGs in some organisms It is easy to see these bias in “genome signatures” Chaos-Game representation (CGR) André de Carvalho - ICMC/USP 27/01/2018
Genome Signature CGR represented by colors of observed frequencies of k-mers The darker, more frequent 2-mers 5-mers 8-mers André de Carvalho - ICMC/USP 1/27/2018
Genome Signature CGR exhibits frequencies of 4k words or strings The quadrate image is sliced into 4 quadrant q1, one for each nucleotide A pixel indicating the frequency of all the qq size strings, ending in a given nucleotide, occur in this nucleotide quadrant André de Carvalho - ICMC/USP 1/27/2018
Exemple 2-mers C G CC GG A T CA GA CA GA AA TA TT AA TA Quadrant with the frequencies of all the words that end with the A nucleotide André de Carvalho - ICMC/USP 27/01/2018
Genome Signature Each quadrant q1 is divided into 4 quadrants q2 One q2 for each present nucleotide in the last but one word position Each q2 is in the same relative position of the same nucleotide in the q1 quadrant Go on until you have the appropriate number of pixels (4K) Fill each square with the proportional color to the frequency of the k-mer C G A T André de Carvalho - ICMC/USP 27/01/2018
Genome Annotation The motifs (k-mers) frequency can bring relevant informations Sequence of frequent nucleotides that may have a biological relevance Simple statistical analysis may consider the nucleotide frequency It can find motifs high or low represented Which helps to decide when the bias is significant or not Non usual motifs may have biological relevance Ex.: motifs may frequently be associated with repetitive elements André de Carvalho - ICMC/USP 1/27/2018
Genome Annotation Look for dimers (dinucleotides) non usual in H. influenza Frequência observada A C G T A 1.2491 0.8496 0.8210 0.9535 C 1.1182 1.0121 1.0894 0.8190 G 0.8736 1.4349 1.0076 0.8526 T 0.7541 0.8763 1.1204 1.2505 27/01/2018 André de Carvalho - ICMC/USP
Important to Distinguish Pattern matching Given a motif, find its occurrence in a sequence Patter discovery Find interest patterns in a sequence Useful Novel André de Carvalho - ICMC/USP 1/27/2018
Genome DataBase The acquired knowledge in this course will be used in real sequences DNA and proteins Stored in DBs available on the internet It is necessary to know how: Access Manipulate Process The involved steps are standardized These data André de Carvalho - ICMC/USP 27/01/2018
Genome DataBase General Specialized DNA, proteins e carbohydrates, 3-dimension structures , ... Specialized EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data, ...) André de Carvalho - ICMC/USP 27/01/2018
Genome DataBase All published genome sequence have to be available in a public DB Members of the International Nucleotide Sequence Database Collaboration are the main repositories Consortia made by 3 big DBs EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) GenBank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan) André de Carvalho - ICMC/USP 27/01/2018
Genome DataBase André de Carvalho - ICMC/USP 27/01/2018
GenBank Each sequence Is identified by a unique adhesion number Includes a quantity of meta data Data about data or annotation Ex.: specie of the sequenced organism André de Carvalho - ICMC/USP 1/27/2018
Data Format and Annotation There are several different formats to provide a sequence and its annotation EMBL, GenBank and DDBJ have their own standard format There are also formats that are not associated to a DB Generally to a sequence analyses program though FASTA André de Carvalho - ICMC/USP 1/27/2018
FASTA Format André de Carvalho - ICMC/USP 27/01/2018
FASTA Format First line: “>” followed by the annotation >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL First line: “>” followed by the annotation Any format, without breakline The information about the sequence starts in the following line Until another symbol “>” to show as the first line character André de Carvalho - ICMC/USP 27/01/2018
FASTA Format Accepted by the majority of the sequence analyses programs Provided by the majority of the online DBs Limits the amount of allowed annotation Other patterns are used to include more meta information Information about the sequence André de Carvalho - ICMC/USP 1/27/2018
GenBank Format An entry has several sections LOCUS: identifies the sequence DEFINITION: define the sequence ACCESSION: only identifies the sequence Related in publications and used to cross reference to other DBs SOURCE and ORGANISM: identifiy biological origin of the sequece REFERENCE: lists articles related to the sequence ORIGIN: lists all the nucleotides Among others André de Carvalho - ICMC/USP 27/01/2018
GenBank Format ORIGIN Sequences are organized in lines content 6 blocks, each of them with 10 bases Simbol “//” indicates the entry end André de Carvalho - ICMC/USP 27/01/2018
GenBank Format André de Carvalho - ICMC/USP 27/01/2018
GenBank Format 27/01/2018
GenBank Format André de Carvalho - ICMC/USP 27/01/2018 LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 PUBMED 7871890 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 PUBMED 8846915 REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" gene 687..3158 /gene="AXL2" CDS 687..3158 /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 André de Carvalho - ICMC/USP 27/01/2018
Pattern Alphabet Sequences in different repositories follow the standard nucleotide alphabet Include symbols for ambiguous nucleotide Most common symbols: A Adenine N any (aNy) base C Citosine R A or G (puRine) G Guanine Y C or T (pYrimidine) T Thymine M A or C (aMino) André de Carvalho - ICMC/USP 27/01/2018
Conclusion Cells Molecular Biology Probabilistic Sequences Models Multinomial Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018
Questions?
NCBI: National Center for Biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.
DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank.
Fasta Protein Database Query Provides sequence similarity searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the Fasta programs. Download Software