Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation.

Similar presentations


Presentation on theme: "Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation."— Presentation transcript:

1 Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation

2 Biological Background

3 Outline Biological Background  Cell  Protein  DNA & RNA  Central Dogma  Gene Expression Bioinformatics  Sequence Analysis  Phylogentic Trees  Data Mining

4 Biological Background – Cell Basic unit of organisms  Prokaryotic (lacks a cell nucleus)  Eukaryotic A bag of chemicals Metabolism controlled by various enzymes Correct working needs  Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

5 Biological Background – Protein Polymer of 20 types of Amino Acids Folds into 3D structure Shape determines the function Many types  Transcription Factors  Enzymes  Structural Proteins  … Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid

6 Biological Background – DNA & RNA DNA  Double stranded  Adenine, Cytosine, Guanine, Thymine  A-T, G-C  Those parts coding for proteins are called genes RNA  Single stranded  Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene Chromosome

7 Chromatin Structure Super compact packaging euchromatinheterochromatin

8 Biological Background – Genes Genes – protein coding regions 3 nucleotides code for one amino acid There are also start and stop codons

9 Biological Background — in a nutshell Abstractions—the Central Dogma Functional Units: Proteins Templates: RNAs Blueprints: DNAs Templates: RNAs Blueprints: DNAs Not only the information (data), but also the control signals about what and how much data is to be sent Proteins (TFs) so help

10 Biological Background …acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein Intergenic region “Non-coding region” Gene

11 Biological Background …acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata…. RNA Protein (malfunctioning) Protein Intergenic region “Non-coding region” Gene Genetic Disease caused by a single mutation

12 Biological Background There can be multiple mutations that cause diseases (increase risks of diseases) … DNA from different people Normal Disease! A A A C C C T T T G G G AT CG … … … … SNP (single nucleotide polymorphism)

13 Biological Background – Sequences Abstractions Sequences …acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAaccta ctggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaata ctggatacagggcatataaaacaggggcaaggcacagactc… FT intron <1..28 FT /gene="CREB" FT /number=3 FT /experiment="experimental evidence … FT recorded" FT exon 29..174 FT /gene="CREB" FT /number=4 FT /experiment="experimental evidence … FT recorded" FT intron 175..>189 FT /gene="CREB" FT /number=4 Annotations Visualizations

14 Biological Background – DNA  RNA  Protein Picture taken from http://en.wikipedia.org/wiki/Gene gene

15 Biological Background – DNA  RNA  Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions

16 Complex Interactions between Genes, TFs and TFBSs

17 Biological Background – DNA  RNA  Protein Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS). Other functions Transcription Factors Binding sites GenesPromoter regions

18 Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C pairing Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

19 Gene Expression Microarray Data Picture taken from http://en.wikipedia.org/wiki/DNA_microarray Genes Time points/Condiditions Colors: Expression (RNA) Levels

20 Bioinformatics

21 Bioinformatics — Sequence Analysis Alignments  a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequencesDNARNAproteinstructural evolutionary http://en.wikipedia.org/wiki/Sequence_alignment

22 Bioinformatics — Sequence Analysis Pair-wise alignments  Method: dynamic programming! No penalty for the consecutive ‘-’s before and after the sequence to be aligned \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

23 Bioinformatics — Sequence Analysis Multiple (global) sequence alignment  Also dynamic programming (but can’t scale up!)

24 Bioinformatics — Sequence Analysis Multiple local sequence alignment  i.e. Motif (pattern) discovery >seq1 acatggccgatcagctggtttttgtgtgcctgtttctgaatc >seq2 ttctattttacgtaaatcagcttgaacatgtacctactggtg >seq3 atgcacctttgatcaataccagctagacaaacgtgtgttg >seq4 agtccaaagatcagggctggctgaatactggatcagct >seq5 cagctacagggcatataaaggggcaaggcacagactc Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes). TFBSs are the controlling key holes in gene regulation!

25 DNA motifs Similar DNA fragments across individuals and/or species  TFBS Motifs: DNA fragments similar to “TATAA” are common in order to recruit the polymerase to initiate transcription in eukaryotes  Expensive and time-consuming to try a large set of candidates in biological experiments Transcription RNA Translation Protein TATAA TFBS (controlling) Gene (functioning) TF Transcription Factor DNA

26 Motif discovery CGATTGA f Similar controlled functions e.g. cancer gene activities Maximized TFBS Motif Discovery Motif discovery usually refers to TFBS motifs But motif is a general term meaning “pattern”: Sequence motifs, structural motifs, network motifs…

27 ChIP-Seq motif discovery Same to traditional TFBS motif discovery in principle Data input precision and scale are different  Genome-wide: tens of thousands of sequences  Short: 50-100bp  Each sequence measured by some enrichment score (a peak)

28 Introduction ChIP-Seq technology  Peak-calling … High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo Too many sequences for old-day methods

29 Enrichment Introduction ChIP-Seq technology  Motifs? … Old-day methods reapplied

30 Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods Adopted from: Fundamental Concepts of Bioinformatics Michael L. Raymer Computer Science, Biomedical Sciences Wright State University birg.cs.wright.edu/text/Tutorial.ppt

31 Phylogenetic Trees Hypothesis about the relationship between organisms Can be rooted or unrooted ABCDE AB C D E Time Root birg.cs.wright.edu/text/Tutorial.ppt

32 Tree proliferation SpeciesNumber of Rooted TreesNumber of Unrooted Trees 211 331 4153 510515 634,459,4252,027,025 7213,458,046,767,8757,905,853,580,625 88,200,794,532,637,891,559,375221,643,095,476,699,771,875 birg.cs.wright.edu/text/Tutorial.ppt

33 An ongoing didactic Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states. Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony. birg.cs.wright.edu/text/Tutorial.ppt

34 Parsimony methods Belong to the broader class of character based methods of phylogenetics Emphasize simpler, and thus more likely evolutionary pathways Enumerate all possible trees Note the number of substitutions events invoked by each possible tree  Can be weighted by transition/transversion probabilities, etc. Select the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt

35 Branch and Bound methods Key problem – number of possible trees grows enormous as the number of species gets large Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded Requires a “ good guess ” at the cost of the best tree birg.cs.wright.edu/text/Tutorial.ppt

36 Parsimony – Branch and Bound Use the UPGMA tree for an initial best estimate of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible trees Replace the best estimate as better trees are found Choose the most parsimonious birg.cs.wright.edu/text/Tutorial.ppt

37 Bioinformatics — Data mining Clustering (Unsupervised learning)  Similar things go together  Similarity measure is critical  Types: Hierarchical clustering (UPGMA) Partitional clustering (K-means)

38 Bioinformatics — Data mining Classification (Supervised Learning)  To predict!  Pre-processing—tidy up your materials!  Feature selection—the key points to go over  Classifier—the thinking style/manner of how to combine the key points and get some answer  Training—your practice of your thinking manner with answers known  Validation—mock quiz to evaluate what you’ve learnt from the training  Testing—your examination! \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf Underfitting & Overfitting

39 Bioinformatics — Data mining Evaluation (scores!)  Confusion Matrix  Binary Classification Performance Evaluation Metrics  Accuracy  Sensitivity/Recall/TP Rate  Specificity/TN Rate  Precision/PPV  … \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

40 Bioinformatics — Data mining Evaluation  ROC (Receiver Operating Characteristics)  Trade-off between positive hits (TP) and false alarms (FP)

41 Statistical Tests Many different kinds of tests You should choose the appropriate ones

42 Where to get data Databases  Transfac—TF and TFBS sequence data  Protein Data Bank—protein and protein-DNA, protein-ligand complexes 3D structures (sequences and atoms included as well)  There are thousands more… find the ones that fit your topic

43 Where to get data Typical format:  tags + descriptions in plain text

44 Where to get data We have to parse and pre-process data before using  Tedious and time-consuming process  Some packages can help accelerate this: BioPerl, BioJava, BioPython…  Besides data, sometimes evaluation has to be done with literature evidence (manual!)

45 Where to get papers (published) A difficult question…  Your research quality, your writing and organization, plus some luck…  知己知彼 : learn from the published papers and compare your research topic and level to them Where to find papers to read  Play on the CS side: IEEE Transactions, ACM Transactions IEEE and ACM top conferences  Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology…  Aim high: Nature (series), Science PNAS, Cell, …

46 Roadmap

47 Not The End Your corresponding tutor will have more project-specific stuff to tell you Thanks Q & A


Download ppt "Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation."

Similar presentations


Ads by Google