Download presentation
Presentation is loading. Please wait.
1
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation
2
Outline Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression Bioinformatics Sequence Analysis Phylogentic Trees Data Mining
3
Biological Background – Cell Basic unit of organisms Prokaryotic Eukaryotic A bag of chemicals Metabolism controlled by various enzymes Correct working needs Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
4
Biological Background – Protein Polymer of 20 types of Amino Acids Folds into 3D structure Shape determines the function Many types Transcription Factors Enzymes Structural Proteins … Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid
5
Biological Background – DNA & RNA DNA Double stranded Adenine, Cytosine, Guanine, Thymine A-T, G-C Those parts coding for proteins are called genes RNA Single stranded Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene
6
Biological Background – Genes Genes – protein coding regions 3 nucleotides code for one amino acid There are also start and stop codons
7
Biological Background — in a nutshell Abstractions Functional Units: Proteins Templates: RNAs Blueprints: DNAs Templates: RNAs Blueprints: DNAs Not only the information (data), but also the control signals about what and how much data is to be sent Proteins (TFs) so help
8
Biological Background – Sequences Abstractions Sequences acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctact ggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatact ggatacagggcatataaaacaggggcaaggcacagactc FT intron <1..28 FT /gene="CREB" FT /number=3 FT /experiment="experimental evidence … FT recorded" FT exon 29..174 FT /gene="CREB" FT /number=4 FT /experiment="experimental evidence … FT recorded" FT intron 175..>189 FT /gene="CREB" FT /number=4 Annotations Visualizations
9
Biological Background – DNA RNA Protein Picture taken from http://en.wikipedia.org/wiki/Gene gene
10
Biological Background – DNA RNA Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions
11
Complex Interactions between Genes, TFs and TFBSs
12
Biological Background – DNA RNA Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions
13
Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C pairing Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
14
Gene Expression Microarray Data Picture taken from http://en.wikipedia.org/wiki/DNA_microarray Genes Time points/Condiditions Colors: Expression (RNA) Levels
15
Bioinformatics — Sequence Analysis Alignments a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequencesDNARNAproteinstructural evolutionary http://en.wikipedia.org/wiki/Sequence_alignment
16
Bioinformatics — Sequence Analysis Pair-wise alignments Method: dynamic programming! No penalty for the consecutive ‘-’s before and after the sequence to be aligned \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures
17
Bioinformatics — Sequence Analysis Multiple (global) sequence alignment Also dynamic programming (but can’t scale up!)
18
Bioinformatics — Sequence Analysis Multiple local sequence alignment i.e. Motif (pattern) discovery >seq1 acatggccgatcagctggtttttgtgtgcctgtttctgaatc >seq2 ttctattttacgtaaatcagcttgaacatgtacctactggtg >seq3 atgcacctttgatcaataccagctagacaaacgtgtgttg >seq4 agtccaaagatcagggctggctgaatactggatcagct >seq5 cagctacagggcatataaaggggcaaggcacagactc Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes). TFBSs are the controlling key holes in gene regulation!
19
DNA motifs Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to make genes functioning Expensive and time-consuming to try a large set of candidates in biological experiments Transcription RNA Translation Protein TATAA TFBS (controlling) Gene (functioning) TF Transcription Factor DNA
20
Motif discovery CGATTGA f Similar controlled functions e.g. cancer gene activities Maximized TFBS Motif Discovery SNP (single nucleotide polymorphism) Motif Discovery … DNA from different people Normal Disease! A A A C C C T T T G G G AT CG … … … … f Normal Disease! distinguish Maximized
21
Bioinformatics — Data mining Classification To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the key points and get some answer Training—your practice of your thinking manner with answers known Validation—mock quiz to evaluate what you’ve learnt from the training Testing—your examination! \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf Underfitting & Overfitting
22
TRANSFAC Project TF-Transcription Factors, important regulators TFBS-Transcription Factor Binding Site, major regulatory elements TRANSFAC-The most representative DB for TFs and TFBSs Modeling: statistical models, representations, Markov chains; Discovery: stochastic searching, indexing (suffix trees) 1 Relationship: TF-TFBS; TFBS- Gene… (understanding, prediction) Mining: text mining, approximate matching 2 Annotations: accurate wet-lab candidates (reduced labor and costs); Computation: large scale data processing; parallel computing 3 Representative Publications [1] Gang Li, Tak-Ming Chan, Kwong-Sak Leung and Kin-Hong Lee, A Cluster Refinement Algorithm for Motif Discovery, IEEE/ACM Transaction on Computational Biology and Bioinformatics (accepted) [2] Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics, 2008, 24(3), pp. 341-349
23
Bioinformatics — Data mining Evaluation (scores!) Confusion Matrix Binary Classification Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV … \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf
24
Bioinformatics — Data mining Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alarms (FP)
25
Not The End Your corresponding tutor will have more project-specific stuff to tell you Thanks Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.