Dilvan Moreira (based on Prof. André Carvalho presentation)

Slides:

Advertisements

Similar presentations

DNA and Heredity. DNA and Heredity DNA is found in the cell’s __nucleus_______. DNA is found in the cell’s __nucleus_______. In the nucleus, we find the.

Advertisements

Introduction to molecular biology. Subjects overview Investigate how cells organize their DNA within the cell nucleus, and replicate it during cell division.

Nucleic Acids - Informational Polymers

CHAPTER 2 THE STRUCTURE AND FUNCTION OF MACROMOLECULES Section E: Nucleic Acids - Informational Polymers 1.Nucleic acids store and transmit hereditary.

Cell Structure and Function Chapter 3 Basic Characteristics of Cells Smallest living subdivision of the human body Diverse in structure and function.

Sequence Databases April 28, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.

DNA and Gene Expression. DNA Deoxyribonucleic Acid Deoxyribonucleic Acid Double helix Double helix Carries genetic information Carries genetic information.

. Class 1: Introduction. The Tree of Life Source: Alberts et al.

Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.

Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.

Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.

Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Protein synthesis and replication

Unit 7 Lesson 1 DNA Structure and Function

14-15 school year EOC review

CSE 6406: Bioinformatics Algorithms. Course Outline

Unit 4 Genetics Ch. 12 DNA & RNA.

A Biology Primer Part II: DNA, RNA, replication, and reproduction Vasileios Hatzivassiloglou University of Texas at Dallas.

CHAPTER 1 INTRODUCTION: THEMES IN THE STUDY OF LIFE.

Chapter 5 Part 5 Nucleic Acids 1. The amino acid sequence of a polypeptide is programmed by a discrete unit of inheritance known as a. A gene is a segment.

Unit 2 The Molecule of Life Genes and Heredity. What is a gene?

Microbiology Chapter 9 Genetics - Science of the study of heredity, variations in organisms that are transferable from generations to generation DNA is.

Have Your DNA and Eat It Too I will be able to describe the structure of the DNA molecule I will be able to explain the rules of base pairing I will understand.

Chapter 10 Part - 1 Molecular Biology of the Gene - DNA Structure and Replication.

Unit 2 Lesson 6 DNA Structure and Function

Unit Plant Science. Problem Area Cellular Biology and Agriculture.

Introduction to Studying DNA

Unit 2.1: BASIC PRINCIPLES OF HUMAN GENETICS

Unit 7 Lesson 1 DNA Structure and Function

Chapter 10 – DNA, RNA, and Protein Synthesis

Nucleic acid Dr. Sahar Al Shabane.

LO: SWBAT describe the connection between DNA and proteins

CHAPTER 5 THE STRUCTURE AND FUNCTION OF MACROMOLECULES

Unit 2 Lesson 6 DNA Structure and Function

Things that may help with comprehension of bioinformatics issues in general and Rosalind problems in particular.

Unit 2 Lesson 6 DNA Structure and Function

Biomedical Technology I

Cells and Their Functions Part 1

7.3 Translation udent_view0/chapter3/animation__how_translation_work s.html.

Course in Molecular Biology

Unit 8 – DNA Structure and Replication

DNA And PROTEIN SYNTHESIS.

Pharmacogenetics and Pharmacoepidemiology

Recommended Reading(s): OpenStax: Biology Unit 3: Genetics

Genomes and Their Evolution

Structure, Function, Replication

Chapter 10 Table of Contents Section 1 Discovery of DNA

Cells, Chromosomes, DNA and RNA

DNA, RNA and Proteins.

Introduction to Studying DNA

Unit 2.1: BASIC PRINCIPLES OF HUMAN GENETICS

The Cell Cycle and Protein Synthesis

Unit 2 Evolution Lesson 1 Genetic Change and Traits.

Year 12 Biology Macromolecules Unit

Cracking the Code What is DNA?

4.1 Structure and Replication of the Genetic Material

Genetics: From Genes to Genomes

Pharmacogenetics and Pharmacoepidemiology

Genetics Refresher Guide

The Study of Biological Information

THE DNA/PROTEIN CONNECTION

DNA and RNA Ch 12.

Different forms of a gene

Genetics 4.3 Notes.

The Structure of DNA.

Presentation transcript:

Dilvan Moreira (based on Prof. André Carvalho presentation) Sequence Analyzes Dilvan Moreira (based on Prof. André Carvalho presentation)

Reading Introduction to Computational Genomics: A Case Studies Approach Chapter1

Introduction Cells Molecular Biology Probabilistic Sequence Models Multinomiais Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018

Cells

Cells Cell Basic unity of all living being Compartment involved by membrane, filled with aqueous solution May have organelles with specific functions Mitochondria: energy generation Golgi Complex: accumulation of secretions Among others André de Carvalho - ICMC/USP 27/01/2018

Cells Cell Doctrines All living beings are made of cells and its products Cells have structure and function All the cells emerge from pre existing cells One cell can made copies of itself by replication and division André de Carvalho - ICMC/USP 27/01/2018

Cells Depending on the number of cells, an organism is classified as: Unicellular (bacteria, protozoa) Pluricellular (worms, mammals) According to the presence of a nucleus in the cells, an organism can be classified as: Eukaryote: has a nucleus defined by membrane Prokaryote: do not has a nucleus André de Carvalho - ICMC/USP 27/01/2018

Cells The fact that an organism is a prokaryote does not mean it is unicellular The majority lives as a unicellular organism Although some species group in a “bunch”, chains or other organization forms of multicellular structures Many unicellular organisms are eukaryotes André de Carvalho - ICMC/USP 1/27/2018

An animal cell Nuclei: DNA and RNA. Rough Endoplasmic Reticulum (ER): produces proteins Smooth ER: produce lipides Golgi Complex: cellular digestion as a basic function Mitochondria: Produces energy. It has its own DNA and auto duplication capability. 1/27/2018 André de Carvalho - ICMC/USP

Cells All the cells of the same organism have the same genes Not all the cells have the same organelles in equal proportions Cells vary in form and function Normally, the form is related to function Specific cell function and form are defined by its expressed genes André de Carvalho - ICMC/USP 1/27/2018

Cells The chemical processes that occur in a cell are basically the same For all the cell types and organisms Even though those cells present different forms and functions The DNA replication in a bacteria is similar to the DNA replication in a mammal It makes scientific advances easier Allowing experiments made with basal living to be used to infer results for other beings André de Carvalho - ICMC/USP 1/27/2018

Cytology X Molecular Biology Science that studies the cell (fixed) Studies the cellular organization, types, functioning, division mechanism, etc With science advances, it is possible to analyze living cells (in vivo) Molecular level Originated the term: Molecular Biology André de Carvalho - ICMC/USP 1/27/2018

DNA Deoxyribonucleic Acid May have single or double-stranded Double-stranded DNA two strands twisted around each other to form a double helix The long polymer compacts itself in a chromosome The DNA is composed by four different nucleotides (bases) Adenine, Cytosine, Guanine and Thymine (Uracil on RNA) The double-strand is caused by base pairing André de Carvalho - ICMC/USP 1/27/2018

DNA The DNA strands are kept together by links that connect each nucleotide of one strand to its complement in the other strand André de Carvalho - ICMC/USP 1/27/2018

DNA The DNA is always read from the 5‘ end to the 3‘ end in the transcriptional process 5’ ATTTAGGCC 3’ 3’ TAAATCCGG 5’ 27/01/2018 André de Carvalho - ICMC/USP

DNA 5’ end In one end, there is the first nucleotide. It has a phosphate group C5 projecting out. 3’end In the other end, there is the last nucleotide added to the DNA strand. It is the only one that still has the component C3–OH. 27/01/2018 André de Carvalho - ICMC/USP

Molecular Biology The genome is the set of all DNA from a cell (organism) Including the genes Genes carry the necessary information to produce the required proteins of an organism The proteins determine Organism’s appearance How the body metabolizes food or defend itself from infections Sometimes, the organism behavior André de Carvalho - ICMC/USP 1/27/2018

Fraction of yeast genome CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGCAATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTAACCGCAATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAGCTTTCAAGATATAAGATACGAAATAGGGGTTGATAATTGCATGACAGTAGCTTTAGATCAAAAAGGAAAGCATGGAGGGAAACAGTAAACAGTGAAAATTCTCTTGAGAACCAAAGTAAACCTTCATTGAAGAGCTTCCTTAAAAAATTTAGAATCTCCCATGTCAACGGGTTTCCATACCTCCCCAGCATCATACATCTTTTTTCAAAGAAACTTCAAATGCCTCTTTTATGCAAGGGGCAAAATCCTGAAATGACTTAAACTTAGCAGTTTCGTCTTTTTTCAAAGAGAATGGTTGAAGAAGAATTGTTTTGGACGCTTATTGACAATCTGTTGCATTGATAAAGTACCTACTATCCCAGACTATATTTGTATACAAGTACAAAATTAGGTTTGTTGAAACAACTTTCCGATCATTGGTGCCCGTATCTGATGTTTTTTTAGTAATTTCTTTGTAAATACAGGGAGTTGTTTCGAAAGCTTATGAGAAAAATACATGAATGACAGGTAAAAATATTGGCTCGAAAAAGAGGACAAAAAGAGAAATCATAAATGAGTAAACCCACTTGCTGGACATTATCCAGTAAAGGCTTGGTAGTAACCATAATATTACCCAGGTACGAAACGCTAAGAACCTTGAAAGACTCATAAAACTTCCAGGTTAAGCTATTTTTGAAAATATTCTGAGGTAAAAGCCATTAAGGTCCAGATAACCAAGGGACAATAAACCTATGCTTTTCTTGTCTTCAATTTCAGTATCTTTCCATTTTGATAATGAGCATGTGATCCGGAAAGCTACTTTATGATGTTTCAAGGCCTGAAGTTTGAATATTTATGTAGTTCAACATCAAATGTGTCTATTTTGTGATGAGGCAACCGTCGACAACCTTATTATCGAAAAAGAACAACAAGTTCACATGCTTGTTACTCTCTATAACTAGAGAGTACTTTTTTTGGAAGCAAGTAAGAATAAGTCAATTTCTACTTACCTCATTAGGGAAAAATTTAATAGCAGTTGTTATAACGACAAATACAGGCCCTAAAAAATTCACTGTATTCAATGGTCTACGAATCGTCAATCGCTTGCGGTTATGGCACGAAGAACAATGCAATAGCTCTTACAAGCCACTACATGACAAGCAACTCATAATTTAA André de Carvalho - ICMC/USP 27/01/2018

Molecular Biology Haploid Cells: 1 set of chromosomes Diploid Cells: 2 sets of chromosomes (pairs) André de Carvalho - ICMC/USP 1/27/2018

Molecular Biology Genes Subsequences of DNA Found in the chromosome They are the mold of protein or RNA production Between the genes there segments called are non- coding regions André de Carvalho - ICMC/USP 1/27/2018

Not all the DNA code genes 22.000 ? 3200 x 106 Human House fly 13601 180 x 106 Drosophila melanogaster worm 19099 95.5 x 106 C. elegans yeast 5885 12.1 x 106 Saccharomyces cerevisiae 4406 4639221 E. coli Ear infection 1738 1830138 Hemophilus influenzae Pneumonia 680 816394 Mycoplasma pneumoniae subcelular organell 37 16569 Human mitochondrion E.coli virus 10 5386 ФX-174 Descrição Genes Num. de pb Organismo Humans André de Carvalho - ICMC/USP 27/01/2018

Non-coding DNA It is not part of protein/RNA synthesis It was considered genetic “junk” It binds to a DNA strand One of its functions: blocking the transcriptional process The bound gene is not read It avoids the expression of the associated protein Inhibition of genes may prevent tumor cells growth Researchers were able to bind genes not related to tumor growth André de Carvalho - ICMC/USP 1/27/2018

Molecular Biology Scientists identified the gene related to breast cancer (SATB1) Paper published on Nature (March, 2008) Healthy organism: organizer of other genes Cancerous organism: growth of tumors, controlling around 1000 other genes Gang leader, gang, mob Active role on formation of other cancer focus (metastasis) Most common cause of death in ill patients André de Carvalho - ICMC/USP 1/27/2018

Cells of breast cancer with a defective gene Bioinformatics Experiments in rats: After the gene’s inactivation, the tumor cells exacerbated proliferation ends Cancer looses aggressiveness potential Allows more accurate and early diagnosis Cells of breast cancer with a defective gene 1/27/2018 André de Carvalho - ICMC/USP

Molecular Biology Proteins Define structure, function and regulatory mechanism of cells Example of regulatory mechanisms: control of cellular cycle, genic transcription Linear sequences 20 different amino acid combinations Three consecutive nucleotide (codon) form an amino acid André de Carvalho - ICMC/USP 1/27/2018

Size of Genomes Prokaryotes Viral Eukaryotes Organelles 0.5 to 12 megabases - MB - (1.000.000 bp) Viral 5 to 50 kilobases - KB - (1.000 bp) Eukaryotes 8 megabases to 670 gigabases - GB- (1.000.000.000 bp) High amount of repetitive DNA Organelles Majority of eukaryote also have a genome out of nuclei Probably rests of prokaryotes that lived in symbiosis André de Carvalho - ICMC/USP 1/27/2018

André de Carvalho - ICMC/USP Virus X Bacteria André de Carvalho - ICMC/USP Bacteria Unicellular, prokaryotes Free living May be found isolated or in colonies Generally have a circular genome single- stranded Virus Smaller than bacteria Mandatory parasites Single or double- stranded Basically made of proteins Reproduce by invasion and control of auto replication cellular apparatus 1/27/2018

Probabilistic Models of Sequences The majority of computational genomic studies uses statistical methods Ex.: find structures of interest in sequences of millions bp The majority of the sequence does not have relevant information Need to obtain probabilistic models of DNA sequences André de Carvalho - ICMC/USP 1/27/2018

Probabilistic Models of Sequences Abstraction of the 3D molecule to a symbol sequence (linear) Alphabet {A, C, T, G} Allows the use of powerful mathematic tools Lack of care with information of the tridimensional structure André de Carvalho - ICMC/USP 1/27/2018

Probabilistic Models of Sequences Definition1.1 A DNA sequence s is a finite string of the alphabet N = {A, C, T, G} of nucleotides Genome is the set of all DNA sequences of an organism or organelle It allow us the use of statistical models of: Sequence evolution, sequence similarities, etc André de Carvalho - ICMC/USP 1/27/2018

Probabilistic Models of Sequences Definition 1.2 The sequence elements s are denoted by s = s1,s2, ..., sn, where each si represents one element Given a set of indices K, it is possible to concatenate elements of s in its original order s(K) = si,sj, sk if K = {i, j, k} It is also possible to use K = [i, j] = (i:j) Specific symbol may be denoted by k = {i}, si = s(i) André de Carvalho - ICMC/USP 1/27/2018

Exercise Given a DNA sequence s = ATATGTCGTGCA, find: s{7} = s(2:6) = André de Carvalho - ICMC/USP 1/27/2018

Probabilistic Models of Sequences : Probabilistic Models of Sequences Almost all probabilistic methods of sequence analysis can be grouped into two categories: Multinomial Models Markov Models André de Carvalho - ICMC/USP 1/27/2018

Multinomial Models Simpler models Assume a probability distribution p on the alphabet Nucleotides are independent and identically distributed (i.i.d.) along the sequence Ex.: For the DNA sequence p = (pa, pb, pc, pd), where Px = p(si = x) Independent of I position Pa + pb + pc + pd = 1 (normalization restriction) Define equal probabilities or based on the frequency of each nucleotide André de Carvalho - ICMC/USP 1/27/2018

Multinomial Models It is not expected that DNA sequences are truly random Model validity can be tested with real sequences Estimate frequency symbols in the sequence regions Test of independence violations checking correlations between neighbor individuals Regions where changes occur and there is interest in the correlations André de Carvalho - ICMC/USP 1/27/2018

Markov Models Provide more complex model of DNA sequences Probability of observing a symbol depends on the previous symbol in the sequence Last symbol – order 1 Last two – order 2 No previous – order 0 (multinomial) Can shape local co-relations between nucleotides André de Carvalho - ICMC/USP 1/27/2018

Markov Models Transition Matrix PCA for De = A C G T 0.99 0.99 0.002 A C for 0.002 A C G T 0.99 0.002 0.006 0.006 0.006 0.002 G T De 0.002 0.99 0.99 Equal Probabilities = multinomial = A C G T Probabilities of each initial state André de Carvalho - ICMC/USP 27/01/2018

Markov Models Transition matrix entry are defined by: pxy = p(si+1 = y/ si = x) p(s) = p(s1 s2 ... sn) p(s) = p(s1) p(s2) ... p(sn) - order 0 p(s) = p(sn/sn-1) p(sn-1/sn-2) ... p(s2/s1) (s1) - order 1 André de Carvalho - ICMC/USP 1/27/2018

Basic statistics of H. influenza (1.830.138 bp) Genome Annotation Simple statistics may describe important characteristics Base Number Frequency A 567.623 0.3102 C 350.723 0.1916 G 347.436 0.1898 T 564.241 0.3083 Basic statistics of H. influenza (1.830.138 bp) André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation Base composition Bases frequencies are different in genomes of different organisms Frequencies may vary in different parts Violates multinomial model supposition André de Carvalho - ICMC/USP 1/27/2018

Genome Annotation Look only to one of the strands K size window André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation Look only to one of the strands K size window André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation GC (C and G) content (frequency) Most cited measure in papers C and G (A and T) have similar frequencies Aggregate frequency GC versus AT (AT = 1–GC) Organism GC content H. influenza 38.8 M. turbeculosis 65.8 S. Enteritidis 49.5 GC content for diferent organisms André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation GC content (frequency) May be used to detect external genetic material in a part of genome Species may acquire sub sequences from other organisms (ex. virus) Horizontal genetic transference André de Carvalho - ICMC/USP 1/27/2018

Genome Annotation Point of change analysis Use this method to detect where the bases distribution (or GC) change These change regions divide the sequence in more uniform parts They may help to identify important biological signs Simpler measure: use threshold Threshold value definition (like the window size) is a statistical problem André de Carvalho - ICMC/USP 1/27/2018

Genome Annotation André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation k-mer frequency is a motif bias Other useful measure is the frequency of sequences size 2 and major Dimers, trimers, k-mers K-mers are not usual: any word that is in the genome with frequency highest or lowest than expected Bias in the position or frequency of these words may reveal important information about its function The number of k-mers is counted by a k-size window covering the sequence André de Carvalho - ICMC/USP 1/27/2018

Genome Annotation k-mer frequency and motif bias It is also possible to plot the frequency of only some interest k-mers Ex. Dimers (dinucleotide) AT and CG There are examples of statistical bias of nucleotides Ex.: Low frequency of CGs in some organisms It is easy to see these bias in “genome signatures” Chaos-Game representation (CGR) André de Carvalho - ICMC/USP 27/01/2018

Genome Signature CGR represented by colors of observed frequencies of k-mers The darker, more frequent 2-mers 5-mers 8-mers André de Carvalho - ICMC/USP 1/27/2018

Genome Signature CGR exhibits frequencies of 4k words or strings The quadrate image is sliced into 4 quadrant q1, one for each nucleotide A pixel indicating the frequency of all the qq size strings, ending in a given nucleotide, occur in this nucleotide quadrant André de Carvalho - ICMC/USP 1/27/2018

Exemple 2-mers C G CC GG A T CA GA CA GA AA TA TT AA TA Quadrant with the frequencies of all the words that end with the A nucleotide André de Carvalho - ICMC/USP 27/01/2018

Genome Signature Each quadrant q1 is divided into 4 quadrants q2 One q2 for each present nucleotide in the last but one word position Each q2 is in the same relative position of the same nucleotide in the q1 quadrant Go on until you have the appropriate number of pixels (4K) Fill each square with the proportional color to the frequency of the k-mer C G A T André de Carvalho - ICMC/USP 27/01/2018

Genome Annotation The motifs (k-mers) frequency can bring relevant informations Sequence of frequent nucleotides that may have a biological relevance Simple statistical analysis may consider the nucleotide frequency It can find motifs high or low represented Which helps to decide when the bias is significant or not Non usual motifs may have biological relevance Ex.: motifs may frequently be associated with repetitive elements André de Carvalho - ICMC/USP 1/27/2018

Genome Annotation Look for dimers (dinucleotides) non usual in H. influenza Frequência observada A C G T A 1.2491 0.8496 0.8210 0.9535 C 1.1182 1.0121 1.0894 0.8190 G 0.8736 1.4349 1.0076 0.8526 T 0.7541 0.8763 1.1204 1.2505 27/01/2018 André de Carvalho - ICMC/USP

Important to Distinguish Pattern matching Given a motif, find its occurrence in a sequence Patter discovery Find interest patterns in a sequence Useful Novel André de Carvalho - ICMC/USP 1/27/2018

Genome DataBase The acquired knowledge in this course will be used in real sequences DNA and proteins Stored in DBs available on the internet It is necessary to know how: Access Manipulate Process The involved steps are standardized These data André de Carvalho - ICMC/USP 27/01/2018

Genome DataBase General Specialized DNA, proteins e carbohydrates, 3-dimension structures , ... Specialized EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data, ...) André de Carvalho - ICMC/USP 27/01/2018

Genome DataBase All published genome sequence have to be available in a public DB Members of the International Nucleotide Sequence Database Collaboration are the main repositories Consortia made by 3 big DBs EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) GenBank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan) André de Carvalho - ICMC/USP 27/01/2018

Genome DataBase André de Carvalho - ICMC/USP 27/01/2018

GenBank Each sequence Is identified by a unique adhesion number Includes a quantity of meta data Data about data or annotation Ex.: specie of the sequenced organism André de Carvalho - ICMC/USP 1/27/2018

Data Format and Annotation There are several different formats to provide a sequence and its annotation EMBL, GenBank and DDBJ have their own standard format There are also formats that are not associated to a DB Generally to a sequence analyses program though FASTA André de Carvalho - ICMC/USP 1/27/2018

FASTA Format André de Carvalho - ICMC/USP 27/01/2018

FASTA Format First line: “>” followed by the annotation >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL First line: “>” followed by the annotation Any format, without breakline The information about the sequence starts in the following line Until another symbol “>” to show as the first line character André de Carvalho - ICMC/USP 27/01/2018

FASTA Format Accepted by the majority of the sequence analyses programs Provided by the majority of the online DBs Limits the amount of allowed annotation Other patterns are used to include more meta information Information about the sequence André de Carvalho - ICMC/USP 1/27/2018

GenBank Format An entry has several sections LOCUS: identifies the sequence DEFINITION: define the sequence ACCESSION: only identifies the sequence Related in publications and used to cross reference to other DBs SOURCE and ORGANISM: identifiy biological origin of the sequece REFERENCE: lists articles related to the sequence ORIGIN: lists all the nucleotides Among others André de Carvalho - ICMC/USP 27/01/2018

GenBank Format ORIGIN Sequences are organized in lines content 6 blocks, each of them with 10 bases Simbol “//” indicates the entry end André de Carvalho - ICMC/USP 27/01/2018

GenBank Format André de Carvalho - ICMC/USP 27/01/2018

GenBank Format 27/01/2018

GenBank Format André de Carvalho - ICMC/USP 27/01/2018 LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 PUBMED 7871890 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 PUBMED 8846915 REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" gene 687..3158 /gene="AXL2" CDS 687..3158 /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 André de Carvalho - ICMC/USP 27/01/2018

Pattern Alphabet Sequences in different repositories follow the standard nucleotide alphabet Include symbols for ambiguous nucleotide Most common symbols: A Adenine N any (aNy) base C Citosine R A or G (puRine) G Guanine Y C or T (pYrimidine) T Thymine M A or C (aMino) André de Carvalho - ICMC/USP 27/01/2018

Conclusion Cells Molecular Biology Probabilistic Sequences Models Multinomial Models Markov Models Genome Annotation Data Base André de Carvalho - ICMC/USP 1/27/2018

Questions?

NCBI: National Center for Biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.

DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank.

Fasta Protein Database Query Provides sequence similarity searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the Fasta programs. Download Software