Download presentation
Presentation is loading. Please wait.
1
Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation
2
Biological Background
3
Outline Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression Bioinformatics Sequence Analysis Phylogentic Trees Data Mining
4
Biological Background – Cell Basic unit of organisms Prokaryotic (lacks a cell nucleus) Eukaryotic A bag of chemicals Metabolism controlled by various enzymes Correct working needs Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
5
Biological Background – Protein Polymer of 20 types of Amino Acids Folds into 3D structure Shape determines the function Many types Transcription Factors Enzymes Structural Proteins … Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid
6
Biological Background – DNA & RNA DNA Double stranded Adenine, Cytosine, Guanine, Thymine A-T, G-C Those parts coding for proteins are called genes RNA Single stranded Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene Chromosome
7
Chromatin Structure Super compact packaging euchromatinheterochromatin
8
Biological Background – Genes Genes – protein coding regions 3 nucleotides code for one amino acid There are also start and stop codons
9
Biological Background — in a nutshell Abstractions—the Central Dogma Functional Units: Proteins Templates: RNAs Blueprints: DNAs Templates: RNAs Blueprints: DNAs Not only the information (data), but also the control signals about what and how much data is to be sent Proteins (TFs) so help
10
Biological Background …acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein Intergenic region “Non-coding region” Gene
11
Biological Background …acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein (malfunctioning) Protein Intergenic region “Non-coding region” Gene Genetic Disease caused by a single mutation
12
Biological Background There can be multiple mutations that cause diseases (increase risks of diseases) … DNA from different people Normal Disease! A A A C C C T T T G G G AT CG … … … … SNP (single nucleotide polymorphism)
13
Biological Background – Sequences Abstractions Sequences …acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAaccta ctggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaata ctggatacagggcatataaaacaggggcaaggcacagactc… FT intron <1..28 FT /gene="CREB" FT /number=3 FT /experiment="experimental evidence … FT recorded" FT exon 29..174 FT /gene="CREB" FT /number=4 FT /experiment="experimental evidence … FT recorded" FT intron 175..>189 FT /gene="CREB" FT /number=4 Annotations Visualizations
14
Biological Background – DNA RNA Protein Picture taken from http://en.wikipedia.org/wiki/Gene gene
15
Biological Background – DNA RNA Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions
16
Complex Interactions between Genes, TFs and TFBSs
17
Biological Background – DNA RNA Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions
18
Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C pairing Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
19
Gene Expression Microarray Data Picture taken from http://en.wikipedia.org/wiki/DNA_microarray Genes Time points/Condiditions Colors: Expression (RNA) Levels
20
Bioinformatics
21
Bioinformatics — Sequence Analysis Alignments a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequencesDNARNAproteinstructural evolutionary http://en.wikipedia.org/wiki/Sequence_alignment
22
Bioinformatics — Sequence Analysis Pair-wise alignments Method: dynamic programming! No penalty for the consecutive ‘-’s before and after the sequence to be aligned \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures
23
Bioinformatics — Sequence Analysis Multiple (global) sequence alignment Also dynamic programming (but can’t scale up!)
24
Bioinformatics — Sequence Analysis Multiple local sequence alignment i.e. Motif (pattern) discovery >seq1 acatggccgatcagctggtttttgtgtgcctgtttctgaatc >seq2 ttctattttacgtaaatcagcttgaacatgtacctactggtg >seq3 atgcacctttgatcaataccagctagacaaacgtgtgttg >seq4 agtccaaagatcagggctggctgaatactggatcagct >seq5 cagctacagggcatataaaggggcaaggcacagactc Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes). TFBSs are the controlling key holes in gene regulation!
25
DNA motifs Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to recruit the polymerase to initiate transcription in eukaryotes Expensive and time-consuming to try a large set of candidates in biological experiments Transcription RNA Translation Protein TATAA TFBS (controlling) Gene (functioning) TF Transcription Factor DNA
26
Motif discovery CGATTGA f Similar controlled functions e.g. cancer gene activities Maximized TFBS Motif Discovery Motif discovery usually refers to TFBS motifs But motif is a general term meaning “pattern”: Sequence motifs, structural motifs, network motifs…
27
ChIP-Seq motif discovery Same to traditional TFBS motif discovery in principle Data input precision and scale are different Genome-wide: tens of thousands of sequences Short: 50-100bp Each sequence measured by some enrichment score (a peak)
28
Introduction ChIP-Seq technology Peak-calling … High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo Too many sequences for old-day methods
29
Enrichment Introduction ChIP-Seq technology Motifs? … Old-day methods reapplied
30
Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods Adopted from: Fundamental Concepts of Bioinformatics Michael L. Raymer Computer Science, Biomedical Sciences Wright State University birg.cs.wright.edu/text/Tutorial.ppt
31
Phylogenetic Trees Hypothesis about the relationship between organisms Can be rooted or unrooted ABCDE AB C D E Time Root birg.cs.wright.edu/text/Tutorial.ppt
32
Tree proliferation SpeciesNumber of Rooted TreesNumber of Unrooted Trees 211 331 4153 510515 634,459,4252,027,025 7213,458,046,767,8757,905,853,580,625 88,200,794,532,637,891,559,375221,643,095,476,699,771,875 birg.cs.wright.edu/text/Tutorial.ppt
33
An ongoing didactic Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states. Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony. birg.cs.wright.edu/text/Tutorial.ppt
34
Parsimony methods Belong to the broader class of character based methods of phylogenetics Emphasize simpler, and thus more likely evolutionary pathways Enumerate all possible trees Note the number of substitutions events invoked by each possible tree Can be weighted by transition/transversion probabilities, etc. Select the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt
35
Branch and Bound methods Key problem – number of possible trees grows enormous as the number of species gets large Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded Requires a “ good guess ” at the cost of the best tree birg.cs.wright.edu/text/Tutorial.ppt
36
Parsimony – Branch and Bound Use the UPGMA tree for an initial best estimate of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible trees Replace the best estimate as better trees are found Choose the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt
37
Bioinformatics — Data mining Clustering (Unsupervised learning) Similar things go together Similarity measure is critical Types: Hierarchical clustering (UPGMA) Partitional clustering (K-means)
38
Bioinformatics — Data mining Classification (Supervised Learning) To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the key points and get some answer Training—your practice of your thinking manner with answers known Validation—mock quiz to evaluate what you’ve learnt from the training Testing—your examination! \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf Underfitting & Overfitting
39
Bioinformatics — Data mining Evaluation (scores!) Confusion Matrix Binary Classification Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV … \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf
40
Bioinformatics — Data mining Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alarms (FP)
41
Statistical Tests Many different kinds of tests You should choose the appropriate ones
42
Where to get data Databases Transfac—TF and TFBS sequence data Protein Data Bank—protein and protein-DNA, protein-ligand complexes 3D structures (sequences and atoms included as well) There are thousands more… find the ones that fit your topic
43
Where to get data Typical format: tags + descriptions in plain text
44
Where to get data We have to parse and pre-process data before using Tedious and time-consuming process Some packages can help accelerate this: BioPerl, BioJava, BioPython… Besides data, sometimes evaluation has to be done with literature evidence (manual!)
45
Where to get papers (published) A difficult question… Your research quality, your writing and organization, plus some luck… 知己知彼 : learn from the published papers and compare your research topic and level to them Where to find papers to read Play on the CS side: IEEE Transactions, ACM Transactions IEEE and ACM top conferences Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology… Aim high: Nature (series), Science PNAS, Cell, …
46
Roadmap
47
Not The End Your corresponding tutor will have more project-specific stuff to tell you Thanks Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.