Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dilvan Moreira (based on Prof. André Carvalho presentation)

Similar presentations


Presentation on theme: "Dilvan Moreira (based on Prof. André Carvalho presentation)"— Presentation transcript:

1 Dilvan Moreira (based on Prof. André Carvalho presentation)
Sequence Analyzes Dilvan Moreira (based on Prof. André Carvalho presentation)

2 Reading Introduction to Computational Genomics: A Case Studies Approach Chapter1

3 Introduction Cells Molecular Biology Probabilistic Sequence Models
Multinomiais Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018

4 Cells

5 Cells Cell Basic unity of all living being
Compartment involved by membrane, filled with aqueous solution May have organelles with specific functions Mitochondria: energy generation Golgi Complex: accumulation of secretions Among others André de Carvalho - ICMC/USP 27/01/2018

6 Cells Cell Doctrines All living beings are made of cells and its products Cells have structure and function All the cells emerge from pre existing cells One cell can made copies of itself by replication and division André de Carvalho - ICMC/USP 27/01/2018

7 Cells Depending on the number of cells, an organism is classified as:
Unicellular (bacteria, protozoa) Pluricellular (worms, mammals) According to the presence of a nucleus in the cells, an organism can be classified as: Eukaryote: has a nucleus defined by membrane Prokaryote: do not has a nucleus André de Carvalho - ICMC/USP 27/01/2018

8 Cells The fact that an organism is a prokaryote does not mean it is unicellular The majority lives as a unicellular organism Although some species group in a “bunch”, chains or other organization forms of multicellular structures Many unicellular organisms are eukaryotes André de Carvalho - ICMC/USP 1/27/2018

9 An animal cell Nuclei: DNA and RNA.
Rough Endoplasmic Reticulum (ER): produces proteins Smooth ER: produce lipides Golgi Complex: cellular digestion as a basic function Mitochondria: Produces energy. It has its own DNA and auto duplication capability. 1/27/2018 André de Carvalho - ICMC/USP

10 Cells All the cells of the same organism have the same genes
Not all the cells have the same organelles in equal proportions Cells vary in form and function Normally, the form is related to function Specific cell function and form are defined by its expressed genes André de Carvalho - ICMC/USP 1/27/2018

11 Cells The chemical processes that occur in a cell are basically the same For all the cell types and organisms Even though those cells present different forms and functions The DNA replication in a bacteria is similar to the DNA replication in a mammal It makes scientific advances easier Allowing experiments made with basal living to be used to infer results for other beings André de Carvalho - ICMC/USP 1/27/2018

12 Cytology X Molecular Biology
Science that studies the cell (fixed) Studies the cellular organization, types, functioning, division mechanism, etc With science advances, it is possible to analyze living cells (in vivo) Molecular level Originated the term: Molecular Biology André de Carvalho - ICMC/USP 1/27/2018

13 DNA Deoxyribonucleic Acid May have single or double-stranded
Double-stranded DNA two strands twisted around each other to form a double helix The long polymer compacts itself in a chromosome The DNA is composed by four different nucleotides (bases) Adenine, Cytosine, Guanine and Thymine (Uracil on RNA) The double-strand is caused by base pairing André de Carvalho - ICMC/USP 1/27/2018

14 DNA The DNA strands are kept together by links that connect each nucleotide of one strand to its complement in the other strand André de Carvalho - ICMC/USP 1/27/2018

15 DNA The DNA is always read from the 5‘ end to the 3‘ end in the transcriptional process 5’ ATTTAGGCC 3’ 3’ TAAATCCGG 5’ 27/01/2018 André de Carvalho - ICMC/USP

16 DNA 5’ end In one end, there is the first nucleotide. It has a phosphate group C5 projecting out. 3’end In the other end, there is the last nucleotide added to the DNA strand. It is the only one that still has the component C3–OH. 27/01/2018 André de Carvalho - ICMC/USP

17 Molecular Biology The genome is the set of all DNA from a cell (organism) Including the genes Genes carry the necessary information to produce the required proteins of an organism The proteins determine Organism’s appearance How the body metabolizes food or defend itself from infections Sometimes, the organism behavior André de Carvalho - ICMC/USP 1/27/2018

18 Fraction of yeast genome
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGCAATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTAACCGCAATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAGCTTTCAAGATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAGATCAAAAAGGAAAGCATGGAGGGAAACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCATTGAAGAGCTTCCTTAAAAAATTTAGAATCTCCCATGTCAACGGGTTTCCATACCTCCCCAGCATCATACATCTTTTTTCAAAGAAACTTCAAATGCCTCTTTTATGCAAGGGGCAAAATCCTGAAATGACTTAAACTTAGCAGTTTCGTCTTTTTTCAAAGAGAATGGTTGAAGAAGAATTGTTTTGGACGCTTATTGACAATCTGTTGCATTGATAAAGTACCTACTATCCCAGACTATATTTGTATACAAGTACAAAATTAGGTTTGTTGAAACAACTTTCCGATCATTGGTGCCCGTATCTGATGTTTTTTTAGTAATTTCTTTGTAAATACAGGGAGTTGTTTCGAAAGCTTATGAGAAAAATACATGAATGACAGGTAAAAATATTGGCTCGAAAAAGAGGACAAAAAGAGAAATCATAAATGAGTAAACCCACTTGCTGGACATTATCCAGTAAAGGCTTGGTAGTAACCATAATATTACCCAGGTACGAAACGCTAAGAACCTTGAAAGACTCATAAAACTTCCAGGTTAAGCTATTTTTGAAAATATTCTGAGGTAAAAGCCATTAAGGTCCAGATAACCAAGGGACAATAAACCTATGCTTTTCTTGTCTTCAATTTCAGTATCTTTCCATTTTGATAATGAGCATGTGATCCGGAAAGCTACTTTATGATGTTTCAAGGCCTGAAGTTTGAATATTTATGTAGTTCAACATCAAATGTGTCTATTTTGTGATGAGGCAACCGTCGACAACCTTATTATCGAAAAAGAACAACAAGTTCACATGCTTGTTACTCTCTATAACTAGAGAGTACTTTTTTTGGAAGCAAGTAAGAATAAGTCAATTTCTACTTACCTCATTAGGGAAAAATTTAATAGCAGTTGTTATAACGACAAATACAGGCCCTAAAAAATTCACTGTATTCAATGGTCTACGAATCGTCAATCGCTTGCGGTTATGGCACGAAGAACAATGCAATAGCTCTTACAAGCCACTACATGACAAGCAACTCATAATTTAA André de Carvalho - ICMC/USP 27/01/2018

19 Molecular Biology Haploid Cells: 1 set of chromosomes
Diploid Cells: 2 sets of chromosomes (pairs) André de Carvalho - ICMC/USP 1/27/2018

20 Molecular Biology Genes Subsequences of DNA
Found in the chromosome They are the mold of protein or RNA production Between the genes there segments called are non- coding regions André de Carvalho - ICMC/USP 1/27/2018

21 Not all the DNA code genes
? 3200 x 106 Human House fly 13601 180 x 106 Drosophila melanogaster worm 19099 95.5 x 106 C. elegans yeast 5885 12.1 x 106 Saccharomyces cerevisiae 4406 E. coli Ear infection 1738 Hemophilus influenzae Pneumonia 680 816394 Mycoplasma pneumoniae subcelular organell 37 16569 Human mitochondrion E.coli virus 10 5386 ФX-174 Descrição Genes Num. de pb Organismo Humans André de Carvalho - ICMC/USP 27/01/2018

22 Non-coding DNA It is not part of protein/RNA synthesis
It was considered genetic “junk” It binds to a DNA strand One of its functions: blocking the transcriptional process The bound gene is not read It avoids the expression of the associated protein Inhibition of genes may prevent tumor cells growth Researchers were able to bind genes not related to tumor growth André de Carvalho - ICMC/USP 1/27/2018

23 Molecular Biology Scientists identified the gene related to breast cancer (SATB1) Paper published on Nature (March, 2008) Healthy organism: organizer of other genes Cancerous organism: growth of tumors, controlling around other genes Gang leader, gang, mob Active role on formation of other cancer focus (metastasis) Most common cause of death in ill patients André de Carvalho - ICMC/USP 1/27/2018

24 Cells of breast cancer with a defective gene
Bioinformatics Experiments in rats: After the gene’s inactivation, the tumor cells exacerbated proliferation ends Cancer looses aggressiveness potential Allows more accurate and early diagnosis Cells of breast cancer with a defective gene 1/27/2018 André de Carvalho - ICMC/USP

25 Molecular Biology Proteins
Define structure, function and regulatory mechanism of cells Example of regulatory mechanisms: control of cellular cycle, genic transcription Linear sequences 20 different amino acid combinations Three consecutive nucleotide (codon) form an amino acid André de Carvalho - ICMC/USP 1/27/2018

26 Size of Genomes Prokaryotes Viral Eukaryotes Organelles
0.5 to 12 megabases - MB - ( bp) Viral 5 to 50 kilobases - KB - (1.000 bp) Eukaryotes 8 megabases to 670 gigabases - GB- ( bp) High amount of repetitive DNA Organelles Majority of eukaryote also have a genome out of nuclei Probably rests of prokaryotes that lived in symbiosis André de Carvalho - ICMC/USP 1/27/2018

27 André de Carvalho - ICMC/USP
Virus X Bacteria André de Carvalho - ICMC/USP Bacteria Unicellular, prokaryotes Free living May be found isolated or in colonies Generally have a circular genome single- stranded Virus Smaller than bacteria Mandatory parasites Single or double- stranded Basically made of proteins Reproduce by invasion and control of auto replication cellular apparatus 1/27/2018

28 Probabilistic Models of Sequences
The majority of computational genomic studies uses statistical methods Ex.: find structures of interest in sequences of millions bp The majority of the sequence does not have relevant information Need to obtain probabilistic models of DNA sequences André de Carvalho - ICMC/USP 1/27/2018

29 Probabilistic Models of Sequences
Abstraction of the 3D molecule to a symbol sequence (linear) Alphabet {A, C, T, G} Allows the use of powerful mathematic tools Lack of care with information of the tridimensional structure André de Carvalho - ICMC/USP 1/27/2018

30 Probabilistic Models of Sequences
Definition1.1 A DNA sequence s is a finite string of the alphabet N = {A, C, T, G} of nucleotides Genome is the set of all DNA sequences of an organism or organelle It allow us the use of statistical models of: Sequence evolution, sequence similarities, etc André de Carvalho - ICMC/USP 1/27/2018

31 Probabilistic Models of Sequences
Definition 1.2 The sequence elements s are denoted by s = s1,s2, ..., sn, where each si represents one element Given a set of indices K, it is possible to concatenate elements of s in its original order s(K) = si,sj, sk if K = {i, j, k} It is also possible to use K = [i, j] = (i:j) Specific symbol may be denoted by k = {i}, si = s(i) André de Carvalho - ICMC/USP 1/27/2018

32 Exercise Given a DNA sequence s = ATATGTCGTGCA, find: s{7} = s(2:6) =
André de Carvalho - ICMC/USP 1/27/2018

33 Probabilistic Models of Sequences
: Probabilistic Models of Sequences Almost all probabilistic methods of sequence analysis can be grouped into two categories: Multinomial Models Markov Models André de Carvalho - ICMC/USP 1/27/2018

34 Multinomial Models Simpler models
Assume a probability distribution p on the alphabet Nucleotides are independent and identically distributed (i.i.d.) along the sequence Ex.: For the DNA sequence p = (pa, pb, pc, pd), where Px = p(si = x) Independent of I position Pa + pb + pc + pd = 1 (normalization restriction) Define equal probabilities or based on the frequency of each nucleotide André de Carvalho - ICMC/USP 1/27/2018

35 Multinomial Models It is not expected that DNA sequences are truly random Model validity can be tested with real sequences Estimate frequency symbols in the sequence regions Test of independence violations checking correlations between neighbor individuals Regions where changes occur and there is interest in the correlations André de Carvalho - ICMC/USP 1/27/2018

36 Markov Models Provide more complex model of DNA sequences
Probability of observing a symbol depends on the previous symbol in the sequence Last symbol – order 1 Last two – order 2 No previous – order 0 (multinomial) Can shape local co-relations between nucleotides André de Carvalho - ICMC/USP 1/27/2018

37 Markov Models Transition Matrix PCA for De = A C G T
0.99 0.99 0.002 A C for 0.002 A C G T 0.99 0.002 0.006 0.006 0.006 0.002 G T De 0.002 0.99 0.99 Equal Probabilities = multinomial = A C G T Probabilities of each initial state André de Carvalho - ICMC/USP 27/01/2018

38 Markov Models Transition matrix entry are defined by:
pxy = p(si+1 = y/ si = x) p(s) = p(s1 s2 ... sn) p(s) = p(s1) p(s2) ... p(sn) - order 0 p(s) = p(sn/sn-1) p(sn-1/sn-2) ... p(s2/s1) (s1) - order 1 André de Carvalho - ICMC/USP 1/27/2018

39 Basic statistics of H. influenza (1.830.138 bp)
Genome Annotation Simple statistics may describe important characteristics Base Number Frequency A C G T Basic statistics of H. influenza ( bp) André de Carvalho - ICMC/USP 27/01/2018

40 Genome Annotation Base composition
Bases frequencies are different in genomes of different organisms Frequencies may vary in different parts Violates multinomial model supposition André de Carvalho - ICMC/USP 1/27/2018

41 Genome Annotation Look only to one of the strands K size window
André de Carvalho - ICMC/USP 27/01/2018

42 Genome Annotation Look only to one of the strands K size window
André de Carvalho - ICMC/USP 27/01/2018

43 Genome Annotation GC (C and G) content (frequency)
Most cited measure in papers C and G (A and T) have similar frequencies Aggregate frequency GC versus AT (AT = 1–GC) Organism GC content H. influenza 38.8 M. turbeculosis 65.8 S. Enteritidis 49.5 GC content for diferent organisms André de Carvalho - ICMC/USP 27/01/2018

44 Genome Annotation GC content (frequency)
May be used to detect external genetic material in a part of genome Species may acquire sub sequences from other organisms (ex. virus) Horizontal genetic transference André de Carvalho - ICMC/USP 1/27/2018

45 Genome Annotation Point of change analysis
Use this method to detect where the bases distribution (or GC) change These change regions divide the sequence in more uniform parts They may help to identify important biological signs Simpler measure: use threshold Threshold value definition (like the window size) is a statistical problem André de Carvalho - ICMC/USP 1/27/2018

46 Genome Annotation André de Carvalho - ICMC/USP 27/01/2018

47 Genome Annotation k-mer frequency is a motif bias
Other useful measure is the frequency of sequences size 2 and major Dimers, trimers, k-mers K-mers are not usual: any word that is in the genome with frequency highest or lowest than expected Bias in the position or frequency of these words may reveal important information about its function The number of k-mers is counted by a k-size window covering the sequence André de Carvalho - ICMC/USP 1/27/2018

48 Genome Annotation k-mer frequency and motif bias
It is also possible to plot the frequency of only some interest k-mers Ex. Dimers (dinucleotide) AT and CG There are examples of statistical bias of nucleotides Ex.: Low frequency of CGs in some organisms It is easy to see these bias in “genome signatures” Chaos-Game representation (CGR) André de Carvalho - ICMC/USP 27/01/2018

49 Genome Signature CGR represented by colors of observed frequencies of k-mers The darker, more frequent 2-mers 5-mers 8-mers André de Carvalho - ICMC/USP 1/27/2018

50 Genome Signature CGR exhibits frequencies of 4k words or strings
The quadrate image is sliced into 4 quadrant q1, one for each nucleotide A pixel indicating the frequency of all the qq size strings, ending in a given nucleotide, occur in this nucleotide quadrant André de Carvalho - ICMC/USP 1/27/2018

51 Exemple 2-mers C G CC GG A T CA GA CA GA AA TA TT AA TA
Quadrant with the frequencies of all the words that end with the A nucleotide André de Carvalho - ICMC/USP 27/01/2018

52 Genome Signature Each quadrant q1 is divided into 4 quadrants q2
One q2 for each present nucleotide in the last but one word position Each q2 is in the same relative position of the same nucleotide in the q1 quadrant Go on until you have the appropriate number of pixels (4K) Fill each square with the proportional color to the frequency of the k-mer C G A T André de Carvalho - ICMC/USP 27/01/2018

53 Genome Annotation The motifs (k-mers) frequency can bring relevant informations Sequence of frequent nucleotides that may have a biological relevance Simple statistical analysis may consider the nucleotide frequency It can find motifs high or low represented Which helps to decide when the bias is significant or not Non usual motifs may have biological relevance Ex.: motifs may frequently be associated with repetitive elements André de Carvalho - ICMC/USP 1/27/2018

54 Genome Annotation Look for dimers (dinucleotides) non usual in H. influenza Frequência observada A C G T A C G T 27/01/2018 André de Carvalho - ICMC/USP

55 Important to Distinguish
Pattern matching Given a motif, find its occurrence in a sequence Patter discovery Find interest patterns in a sequence Useful Novel André de Carvalho - ICMC/USP 1/27/2018

56 Genome DataBase The acquired knowledge in this course will be used in real sequences DNA and proteins Stored in DBs available on the internet It is necessary to know how: Access Manipulate Process The involved steps are standardized These data André de Carvalho - ICMC/USP 27/01/2018

57 Genome DataBase General Specialized
DNA, proteins e carbohydrates, 3-dimension structures , ... Specialized EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data, ...) André de Carvalho - ICMC/USP 27/01/2018

58 Genome DataBase All published genome sequence have to be available in a public DB Members of the International Nucleotide Sequence Database Collaboration are the main repositories Consortia made by 3 big DBs EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) GenBank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan) André de Carvalho - ICMC/USP 27/01/2018

59 Genome DataBase André de Carvalho - ICMC/USP 27/01/2018

60 GenBank Each sequence Is identified by a unique adhesion number
Includes a quantity of meta data Data about data or annotation Ex.: specie of the sequenced organism André de Carvalho - ICMC/USP 1/27/2018

61 Data Format and Annotation
There are several different formats to provide a sequence and its annotation EMBL, GenBank and DDBJ have their own standard format There are also formats that are not associated to a DB Generally to a sequence analyses program though FASTA André de Carvalho - ICMC/USP 1/27/2018

62 FASTA Format André de Carvalho - ICMC/USP 27/01/2018

63 FASTA Format First line: “>” followed by the annotation
>FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL First line: “>” followed by the annotation Any format, without breakline The information about the sequence starts in the following line Until another symbol “>” to show as the first line character André de Carvalho - ICMC/USP 27/01/2018

64 FASTA Format Accepted by the majority of the sequence analyses programs Provided by the majority of the online DBs Limits the amount of allowed annotation Other patterns are used to include more meta information Information about the sequence André de Carvalho - ICMC/USP 1/27/2018

65 GenBank Format An entry has several sections
LOCUS: identifies the sequence DEFINITION: define the sequence ACCESSION: only identifies the sequence Related in publications and used to cross reference to other DBs SOURCE and ORGANISM: identifiy biological origin of the sequece REFERENCE: lists articles related to the sequence ORIGIN: lists all the nucleotides Among others André de Carvalho - ICMC/USP 27/01/2018

66 GenBank Format ORIGIN Sequences are organized in lines content 6 blocks, each of them with 10 bases Simbol “//” indicates the entry end André de Carvalho - ICMC/USP 27/01/2018

67 GenBank Format André de Carvalho - ICMC/USP 27/01/2018

68 GenBank Format 27/01/2018

69 GenBank Format André de Carvalho - ICMC/USP 27/01/2018
LOCUS SCU bp DNA PLN JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U GI: KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE PUBMED REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), (1996) MEDLINE PUBMED REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" gene /gene="AXL2" CDS /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA " /db_xref="GI: " /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML BASE COUNT a c g t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 André de Carvalho - ICMC/USP 27/01/2018

70 Pattern Alphabet Sequences in different repositories follow the standard nucleotide alphabet Include symbols for ambiguous nucleotide Most common symbols: A Adenine N any (aNy) base C Citosine R A or G (puRine) G Guanine Y C or T (pYrimidine) T Thymine M A or C (aMino) André de Carvalho - ICMC/USP 27/01/2018

71 Conclusion Cells Molecular Biology Probabilistic Sequences Models
Multinomial Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018

72 Questions?

73 NCBI: National Center for
Biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

74 The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications.

75 DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank.

76 Fasta Protein Database Query
Provides sequence similarity searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the Fasta programs. Download Software


Download ppt "Dilvan Moreira (based on Prof. André Carvalho presentation)"

Similar presentations


Ads by Google