Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Biology Ch. 12 Review.
DNA replication—when? Where? Why? What else does a cell do?
Phylogenies Preliminaries Distance-based methods Parsimony Methods.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
DNA and Gene Expression. DNA Deoxyribonucleic Acid Deoxyribonucleic Acid Double helix Double helix Carries genetic information Carries genetic information.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Basic Biology for CS262 OMKAR DESHPANDE (TA) Overview Structures of biomolecules How does DNA function? What is a gene? How are genes regulated?
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation.
RNA = RiboNucleic Acid Synthesis: to build
DNA, RNA, and Protein Section Objectives: By the end of this section of notes your should be able to: Relate the concept of the gene to the sequence of.
SC.L.16.3 Describe the basic process of DNA replication and how it relates to the transmission and conservation of the genetic information.
Gene expression.
RNA Ribonucleic Acid.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
DNA Chapter 10.
How does DNA work? Building the Proteins that your body needs.
Biology 10.1 How Proteins are Made:
CSE 6406: Bioinformatics Algorithms. Course Outline
DNA.
Intelligent Systems for Bioinformatics Michael J. Watts
RNA Structure and Transcription Mrs. MacWilliams Academic Biology.
Transcription and Translation
GENETIC CONTROL OF PROTEIN SYNTHESIS, CELL FUNCTION, AND CELL REPRODUCTION PART 1.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
 We know that DNA is the genetic material and its sequence of nucleotide bases carry some sort of code. This code holds instructions that tell a cell.
DNA Notes DAY 2 Replication, overview of transcription, overview of translation WARM UP What is the base pairing rule? Who created it?
From Gene to Protein A.P. Biology. Regulatory sites Promoter (RNA polymerase binding site) Start transcription DNA strand Stop transcription Typical Gene.
KEY CONCEPT DNA structure is the same in all organisms.
Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Lecture #3 Transcription Unit 4: Molecular Genetics.
BSC Developmental Biology Patterns of Inheritance EvolutionEcology.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Structure of RNA  Structure  Nucleic acid made up of nucleotides  composed of Ribose, phosphate group, and nitrogenous base  Nitrogenous bases  Adenine.
Transcription and mRNA Modification
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
What is central dogma? From DNA to Protein
RNA and Protein Synthesis Mr. Cobb GCA Fall 2011.
Chapter 11: DNA & Genes Sections 11.1: DNA: The Molecular of Heredity Subsections: What is DNA? Replication of DNA.
Bioinformatics and Computational Biology
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
Motif Search and RNA Structure Prediction Lesson 9.
Lesson 3 – Gene Expression
DNA – Chromosomes & DNA replication – RNA & Protein Synthesis – Mutations – Gene Regulation Chapter 12 Pages DNA & RNA.
DNA Deoxyribose Nucleic Acid – is the information code to make an organism and controls the activities of the cell. –Mitosis copies this code so that all.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
RNA, Transcription, and the Genetic Code. RNA = ribonucleic acid -Nucleic acid similar to DNA but with several differences DNARNA Number of strands21.
Gene Activity 1Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger RNA Translation  Transfer.
Unit-II Synthetic Biology: Protein Synthesis Synthetic Biology is - A) the design and construction of new biological parts, devices, and systems, and B)
Chapter 10: Nucleic Acids And Protein Synthesis Essential Question: What roles do DNA and RNA play in storing genetic information?
Chapter 13 Test Review.
Higher Human Biology Unit 1 Human Cells KEY AREA 3: Gene Expression.
Gene Expression DNA, RNA, and Protein Synthesis. Gene Expression Genes contain messages that determine traits. The process of expressing those genes includes.
Gene Activity Chapter 14. Gene Activity 2Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger.
RNA & Protein Synthesis
8.2 KEY CONCEPT DNA structure is the same in all organisms.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Chapter 10 – DNA, RNA, and Protein Synthesis
Pharmacogenetics and Pharmacoepidemiology
Ch 12 DNA and RNA.
How Proteins are Made Biology I: Chapter 10.
Pharmacogenetics and Pharmacoepidemiology
Biology, 9th ed,Sylvia Mader
12-3 RNA and Protein Synthesis
Molecular Genetics Glencoe Chapter 12.
Presentation transcript:

Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation

Biological Background

Outline Biological Background  Cell  Protein  DNA & RNA  Central Dogma  Gene Expression Bioinformatics  Sequence Analysis  Phylogentic Trees  Data Mining

Biological Background – Cell Basic unit of organisms  Prokaryotic (lacks a cell nucleus)  Eukaryotic A bag of chemicals Metabolism controlled by various enzymes Correct working needs  Suitable amounts of various proteins Picture taken from

Biological Background – Protein Polymer of 20 types of Amino Acids Folds into 3D structure Shape determines the function Many types  Transcription Factors  Enzymes  Structural Proteins  … Picture taken from

Biological Background – DNA & RNA DNA  Double stranded  Adenine, Cytosine, Guanine, Thymine  A-T, G-C  Those parts coding for proteins are called genes RNA  Single stranded  Adenine, Cytosine, Guanine, Uracil Picture taken from Chromosome

Chromatin Structure Super compact packaging euchromatinheterochromatin

Biological Background – Genes Genes – protein coding regions 3 nucleotides code for one amino acid There are also start and stop codons

Biological Background — in a nutshell Abstractions—the Central Dogma Functional Units: Proteins Templates: RNAs Blueprints: DNAs Templates: RNAs Blueprints: DNAs Not only the information (data), but also the control signals about what and how much data is to be sent Proteins (TFs) so help

Biological Background …acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein Intergenic region “Non-coding region” Gene

Biological Background …acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein (malfunctioning) Protein Intergenic region “Non-coding region” Gene Genetic Disease caused by a single mutation

Biological Background There can be multiple mutations that cause diseases (increase risks of diseases) … DNA from different people Normal Disease! A A A C C C T T T G G G AT CG … … … … SNP (single nucleotide polymorphism)

Biological Background – Sequences Abstractions Sequences …acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAaccta ctggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaata ctggatacagggcatataaaacaggggcaaggcacagactc… FT intron <1..28 FT /gene="CREB" FT /number=3 FT /experiment="experimental evidence … FT recorded" FT exon FT /gene="CREB" FT /number=4 FT /experiment="experimental evidence … FT recorded" FT intron 175..>189 FT /gene="CREB" FT /number=4 Annotations Visualizations

Biological Background – DNA  RNA  Protein Picture taken from gene

Biological Background – DNA  RNA  Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions

Complex Interactions between Genes, TFs and TFBSs

Biological Background – DNA  RNA  Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C pairing Can monitor expression of many genes Picture taken from

Gene Expression Microarray Data Picture taken from Genes Time points/Condiditions Colors: Expression (RNA) Levels

Bioinformatics

Bioinformatics — Sequence Analysis Alignments  a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequencesDNARNAproteinstructural evolutionary

Bioinformatics — Sequence Analysis Pair-wise alignments  Method: dynamic programming! No penalty for the consecutive ‘-’s before and after the sequence to be aligned \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

Bioinformatics — Sequence Analysis Multiple (global) sequence alignment  Also dynamic programming (but can’t scale up!)

Bioinformatics — Sequence Analysis Multiple local sequence alignment  i.e. Motif (pattern) discovery >seq1 acatggccgatcagctggtttttgtgtgcctgtttctgaatc >seq2 ttctattttacgtaaatcagcttgaacatgtacctactggtg >seq3 atgcacctttgatcaataccagctagacaaacgtgtgttg >seq4 agtccaaagatcagggctggctgaatactggatcagct >seq5 cagctacagggcatataaaggggcaaggcacagactc Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes). TFBSs are the controlling key holes in gene regulation!

DNA motifs Similar DNA fragments across individuals and/or species  TFBS Motifs: DNA fragments similar to “TATAA” are common in order to recruit the polymerase to initiate transcription in eukaryotes  Expensive and time-consuming to try a large set of candidates in biological experiments Transcription RNA Translation Protein TATAA TFBS (controlling) Gene (functioning) TF Transcription Factor DNA

Motif discovery CGATTGA f Similar controlled functions e.g. cancer gene activities Maximized TFBS Motif Discovery Motif discovery usually refers to TFBS motifs But motif is a general term meaning “pattern”: Sequence motifs, structural motifs, network motifs…

ChIP-Seq motif discovery Same to traditional TFBS motif discovery in principle Data input precision and scale are different  Genome-wide: tens of thousands of sequences  Short: bp  Each sequence measured by some enrichment score (a peak)

Introduction ChIP-Seq technology  Peak-calling … High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo Too many sequences for old-day methods

Enrichment Introduction ChIP-Seq technology  Motifs? … Old-day methods reapplied

Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods Adopted from: Fundamental Concepts of Bioinformatics Michael L. Raymer Computer Science, Biomedical Sciences Wright State University birg.cs.wright.edu/text/Tutorial.ppt

Phylogenetic Trees Hypothesis about the relationship between organisms Can be rooted or unrooted ABCDE AB C D E Time Root birg.cs.wright.edu/text/Tutorial.ppt

Tree proliferation SpeciesNumber of Rooted TreesNumber of Unrooted Trees ,459,4252,027, ,458,046,767,8757,905,853,580,625 88,200,794,532,637,891,559,375221,643,095,476,699,771,875 birg.cs.wright.edu/text/Tutorial.ppt

An ongoing didactic Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states. Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony. birg.cs.wright.edu/text/Tutorial.ppt

Parsimony methods Belong to the broader class of character based methods of phylogenetics Emphasize simpler, and thus more likely evolutionary pathways Enumerate all possible trees Note the number of substitutions events invoked by each possible tree  Can be weighted by transition/transversion probabilities, etc. Select the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt

Branch and Bound methods Key problem – number of possible trees grows enormous as the number of species gets large Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded Requires a “ good guess ” at the cost of the best tree birg.cs.wright.edu/text/Tutorial.ppt

Parsimony – Branch and Bound Use the UPGMA tree for an initial best estimate of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible trees Replace the best estimate as better trees are found Choose the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt

Bioinformatics — Data mining Clustering (Unsupervised learning)  Similar things go together  Similarity measure is critical  Types: Hierarchical clustering (UPGMA) Partitional clustering (K-means)

Bioinformatics — Data mining Classification (Supervised Learning)  To predict!  Pre-processing—tidy up your materials!  Feature selection—the key points to go over  Classifier—the thinking style/manner of how to combine the key points and get some answer  Training—your practice of your thinking manner with answers known  Validation—mock quiz to evaluate what you’ve learnt from the training  Testing—your examination! \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf Underfitting & Overfitting

Bioinformatics — Data mining Evaluation (scores!)  Confusion Matrix  Binary Classification Performance Evaluation Metrics  Accuracy  Sensitivity/Recall/TP Rate  Specificity/TN Rate  Precision/PPV  … \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

Bioinformatics — Data mining Evaluation  ROC (Receiver Operating Characteristics)  Trade-off between positive hits (TP) and false alarms (FP)

Statistical Tests Many different kinds of tests You should choose the appropriate ones

Where to get data Databases  Transfac—TF and TFBS sequence data  Protein Data Bank—protein and protein-DNA, protein-ligand complexes 3D structures (sequences and atoms included as well)  There are thousands more… find the ones that fit your topic

Where to get data Typical format:  tags + descriptions in plain text

Where to get data We have to parse and pre-process data before using  Tedious and time-consuming process  Some packages can help accelerate this: BioPerl, BioJava, BioPython…  Besides data, sometimes evaluation has to be done with literature evidence (manual!)

Where to get papers (published) A difficult question…  Your research quality, your writing and organization, plus some luck…  知己知彼 : learn from the published papers and compare your research topic and level to them Where to find papers to read  Play on the CS side: IEEE Transactions, ACM Transactions IEEE and ACM top conferences  Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology…  Aim high: Nature (series), Science PNAS, Cell, …

Roadmap

Not The End Your corresponding tutor will have more project-specific stuff to tell you Thanks Q & A