STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Slides:



Advertisements
Similar presentations
1 Phenotype Prediction by Integrative Network Analysis of SNP and Gene Expression Microarrays Hsun-Hsien Chang 1, Michael McGeachie 1,2 1 Children’s Hospital.
Advertisements

Genomics, Cancers & Infectious Diseases Qunyuan Zhang Division of Statistical Genomics Washington University School of Medicine.
CSE 591 (99689) Application of AI to molecular Biology (5:15 – 6: 30 PM, PSA 309) Instructor: Chitta Baral Office hours: Tuesday 2 to 5 PM.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 6/17/20151.
The Cell, Central Dogma and Human Genome Project.
Introduction to BioInformatics GCB/CIS535
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Molecular Biology Background. Schematic view of DNA organization in a cell.
The Central Dogma of Molecular Biology (Things are not really this simple) Genetic information is stored in our DNA (~ 3 billion bp) The DNA of a.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Chromosomes carry genetic information
Epistasis Analysis Using Microarrays Chris Workman.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
BIO337 Systems Biology/Bioinformatics (course # 50524) Spring 2014 Tues/Thurs 11 – 12:30 PM BUR 212 Edward Marcotte/Univ. of Texas/BIO337/Spring 2014.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Whole Genome Expression Analysis
A brief Introduction to Bioinformatics Y. SINGH NELSON R. MANDELA SCHOOL OF MEDICINE DEPARTMENT OF TELEHEALTH Content licensed under.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
CS 790 – Bioinformatics Introduction and overview.
Today: Genetic Technology Wrap-up Exam Review Remember: Final Exam is Wednesday, 12/13 at 1 pm!
A little about how DNA works David Sloane, MD Special Studies, HGSE Brigham and Women’s Hospital Harvard Medical School 2/10/2014David.
Finish up array applications Move on to proteomics Protein microarrays.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
27 MAR 2007 Antiviral Drugs: An Overview Chris Brooks CHEM 5398 Medicinal Chemistry Prof. Buynak.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Protein Synthesis: DNA CONTAINS THE GENETIC INFORMATION TO PRODUCE PROTEINS BUT MUST FIRST BE CONVERTED TO RND TO DO SO.
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Chromosome Abnormalities Non-disjunction during meiosis can cause a gamete to have an extra chromosome Trisomy = three copies of the same chromosome. Most.
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Bioinformatics and Computational Biology
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Artificial Intelligence Project 1 Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
BCH339N Systems Biology/Bioinformatics (course # 54040) Spring 2016 Tues/Thurs 11 – 12:30 PM BUR 212.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Bioinformatics Overview
Introduction to Bioinformatics
Gil McVean Department of Statistics
Data-intensive Computing: Case Study Area 1: Bioinformatics
Statistical Applications in Biology and Genetics
Microarray Technology and Applications
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Relationship between Genotype and Phenotype
Genomes and Their Evolution
Proteomics Informatics David Fenyő
In these studies, expression levels are viewed as quantitative traits, and gene expression phenotypes are mapped to particular genomic loci by combining.
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
(Really) Basic Molecular Biology
Proteomics Informatics David Fenyő
Presentation transcript:

STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu

STAT1152 Outline Course information Computational biology problems revolve around the Central Dogma of Molecular BiologyComputational biology problems Course structure (syllabus) Q&A

STAT1153 STAT115 Lectures Instructor: –Jun Liu: , –Xiaole Shirley Liu: , Lecture: Tuesdays and Thursdays 11:30-1 –NWB, B-108 (Cambridge); Kresge 213 (Boston) –Selected lecture notes available online after lecture Office hours –J Liu: Tu 1-3 PM, SC 715 –XS Liu: Thu 2-4 PM, CLSB (3 Blackfan Circle) 11022, Boston

STAT1154 STAT115 Labs and Web Teaching Fellows: –Alejandro Zarat: –Daniel Fernandes: –Lab in Science Center FL 418D, Harvard Yard, W 6-8 pm (google map link in the course syllabus). Course website: Lecture notes (also in the course website):

STAT1155 STAT115 Recommended Texts

STAT1156 STAT115 Recommended Texts

STAT1157 STAT115 Grading Homework:80 pts –6 HW, 14*5+10=80 pts each –Problems to be solved by hand, running some software online to obtain results, and some coding (python and R) –6 total late days, <= 3 days for a single HW Quiz at selected lectures 2*10=20 pts –10 highest normalized scores, 2 pts each –All short answers, true/false, multiple choice

Genome and gene

Nucleic acid and proteins

1 cctcttttcc gtggcgcctc ggaggcgttc agctgcttca agatgaagct gaacatctcc 61 ttcccagcca ctggctgcca gaaactcatt gaagtggacg atgaacgcaa acttcgtact 121 ttctatgaga agcgtatggc cacagaagtt gctgctgacg ctctgggtga agaatggaag 181 ggttatgtgg tccgaatcag tggtgggaac gacaaacaag gtttccccat gaagcagggt 241 gtcttgaccc atggccgtgt ccgcctgcta ctgagtaagg ggcattcctg ttacagacca 301 aggagaactg gagaaagaaa gagaaaatca gttcgtggtt gcattgtgga tgcaaatctg 361 agcgttctca acttggttat tgtaaaaaaa ggagagaagg atattcctgg actgactgat 421 actacagtgc ctcgccgcct gggccccaaa agagctagca gaatccgcaa acttttcaat 481 ctctctaaag aagatgatgt ccgccagtat gttgtaagaa agcccttaaa taaagaaggt 541 aagaaaccta ggaccaaagc acccaagatt cagcgtcttg ttactccacg tgtcctgcag 601 cacaaacggc ggcgtattgc tctgaagaag cagcgtacca agaaaaataa agaagaggct 661 gcagaatatg ctaaactttt ggccaagaga atgaaggagg ctaaggagaa gcgccaggaa 721 caaattgcga agagacgcag actttcctct ctgcgagctt ctacttctaa gtctgaatcc 781 agtcagaaat aagatttttt gagtaacaaa taaataagat cagactctg RPS6 (ribosomal protein S6) gene The information in a gene is encoded by its DNA sequence

1 mklnisfpat gcqklievdd erklrtfyek rmatevaada lgeewkgyvv risggndkqg 61 fpmkqgvlth grvrlllskg hscyrprrtg erkrksvrgc ivdanlsvln lvivkkgekd 121 ipgltdttvp rrlgpkrasr irklfnlske ddvrqyvvrk plnkegkkpr tkapkiqrlv 181 tprvlqhkrr rialkkqrtk knkeeaaeya kllakrmkea kekrqeqiak rrrlsslras 241 tsksessqk RPS6 (ribosomal protein S6) protein sequence: The structure of a protein is encoded by its amino acids sequence

Nucleotide codes

The Four Nucleosides of DNA dA dG dC dT A nucleoside is a sugar, here deoxyribose, plus a base dA = deoxyadenosine, etc. PYRIMIDINESPURINES DNA is built from nucleotides

Structure of DNA: Double helix

Base Pairing

A nucleotide is a phospate, a sugar, and a purine or a pyramidine base. The monomeric units of nucleic acids are called nucleotides.

Amino acid codes Protein are built from amino acids

/proteins/peptidebond.html

The diversity of protein structure

Anfinsen 1961 ribonuclease re-naturing experiments: Sequence determines structure

STAT11522 Central Dogma of Molecular Biology DNA replication DNA RNA Transcription Physiology Folded with function Protein Translation

STAT11523 Central Dogma of Molecular Biology DNA  RNA  Protein Genome sequencing, assembly and annotation –Sequence alignment (pairwise & multiple) –Gene prediction Genome variation: –Single base difference (SNP) and big copy number duplication / deletions –Association studies Comparative genomics and phylogenies

STAT11524 Case Study I The Human Genome Race Human Genome Project: –Originally –Boosted by technology improvement (automation improved throughput and quality with reduced cost) –Competition from Celera Informatics essential for both the public and private sequencing efforts –Sequence assembly and gene prediction –Working draft finished simultaneously spring 2000

STAT11525 Competing Sequencing Strategies Clone-by-clone and whole-genome shotgun

Retail DNA Test TIME's Best Inventions (2008) 26 “Your genome used to be a closed book. Now a simple, affordable (399 USD) test can shed new light on everything from your intelligence to your biggest health risks. Say hello to your dna — if you dare” -- time.com

1000 Genome Project Sequencing the genomes of at least a thousand people from around the world to create the most detailed and medically useful picture to date of human genetic variation 27

STAT11528 Central Dogma of Molecular Biology DNA  RNA  Protein RNA structure prediction Differential gene expression: –Gene expression microarray and analysis, normalization, clustering, gene ontology and classification Transcription regulation –Transcription factor motif finding, epigenetic regulation, transcription regulatory network Post-transcriptional regulation: mi/siRNA

STAT11529 Case Study II Cancer Classifications Using Microarrays Microarray contains hundreds to millions of tiny probes Simultaneously detect how much each gene is “on” Cancer type classification –AML: acute myeloid leukemia –ALL: acute lymphoblastic leukemia –Check multiple samples of each type on microarrays –Find good gene markers

STAT11530 ALL vs AML Golub et al, Science 1999.

STAT11531 ALL vs AML

STAT11532 Central Dogma of Molecular Biology DNA  RNA  Protein Protein sequence motifs Protein structure prediction Mass spectrometry proteomics Protein interaction networks

STAT11533 Case Study III Is Tamiflu for you? Roche’s Oseltamivir (Tamiflu) is the only available orally application drug for avian influenza (bird flu) 75 pediatric severe adverse events –Fatalities, neuropsychiatric, and skin –69 in Japan Inhibit neuraminidase of flu –The structure of its active site is homologous to human sialidases (HsNEU2) –An Asian-specific SNP (~10%) changes R41 to Q

STAT11534 Is Tamiflu for you? Tamiflu binds to R41Q much stronger –Molecular simulations –Decreased sialidase activity  severe side effect –Li et al, Cell Res, 2007

Study of HIV drug resistance STAT11535 Protease Inhibitors (PIs) target HIV-1 protease enzyme which is responsible for the posttranslational processing of the viral gag- and gag-pol-encoded poly proteins to yield the structural proteins and enzymes of the virus.

36 Data: can we detect drug resistance mutations? Protease sequences from treated patients (949 cases) VVTIRIGGQLKEALLDTGAD IVTIRIGGQLKEALLDTGAD RVTIRIGGQLREALLDTGAD Sequences from untreated patients (4146 controls) LVTIRIGGQLREALLDTGAD IVTIRIGGQLKEALLDTGAD LVTIRIGGQLKEALLDTGAD Which ones contributes to drug resistance?

37 Drug resistance mutations The IAS-USA Drug Resistance Mutations list in HIV-1 updated in Fall 2006 For IDV, mutations on the list are 10, 20, 24, 32, 36, 46, 54, 71, 73, 77, 82, 84, 90 The ones we detect 10, 24, 32, 46, 54, 71, 73, 82, 90

38 Interactions  What is known: The occurrence of changes at L10, L24, M46, I54, A71, V82, I84, L90 was highly significantly correlated with phenotypic resistance. Minor mutations influence drug resistance only in combination with other mutations , 32+47, 84+90, , 88+90,  Our results are consistent with above.  The story about the mutation combination {46,54,82} Conditional independence: 46 – 82 – 54. Single mutation at 54 has no effect V82A mutation is the key – without it others have small effect

39 Zhang et al. (2010, PNAS)

Human genome sequencing Human genome project: 13 years ( ), $3 billions, 6 countries, thousands of researchers and technicians 2011: 4 genomes in 8 days, costing $3000 each. In 2-3 years, each genome for 1-2 days, hundreds $, huge data Bioinformatics: turn data to knowledge 40

Gene expression microarrays In the 90s, gene chip, $2000/sample 2011: chips for multiple copies of 1000 genes, $5-10/sample Using computational approach to infer gene expressions of ~20K genes from the observed expressions of the 1000 genes. Used for medical diagnosis, large scale drug target screening

Statistics? 42

9/18/ Quotes True logic of this world is in the calculus of probabilities --- J. C. Maxwell What we see is the solution to a computational problem, our brains compute the most likely causes from the photon absorptions within our eyes --- H. Helmholtz

Beauty, Mathematics, Statistics, and Science Statistics: the only systematic way (that I know of) to connect mathematics with ordinary life activities Focus: studying and quantifying uncertainty; optimally extracting information; prediction Models: All models are wrong, but –Even those imperfect ones are very useful! –Used as a powerful mathematical framework for organizing our thoughts and integrating information Mathematicians and physicists take care of the “beauty- only” part, and we take care of the rest 44

Recent Success Stories Mapping disease genes – genetics and genomics Random walk, Markov, page rank and Jim Simons making many billions of $$$ Compressive sensing, sparsity, random matrix and … 45 Obama

Two schools of thoughts in statistics Bayesian: using probability distribution as a direct measure of uncertainty –Bayes Theorem: Frequentist: embedding the observed event in a sequence of “imaginary replications” – like a false positive false negative evaluation 46

STAT11547 Q&A Is this course for me? –Upper undergraduate and entry graduate students interested in computational biology Do I have the background? –Biology knowledge is easy to accumulate –Statistics: basic stats tests, probability, some linear algebra helps –Programming: prior programming helps although good logic and willingness to learn and work for it are more important

Q&A STAT115 or STAT215? –STAT215 if: –You want to work on an exploratory research problem (either from the professors or on your own) –You have better coding skills STAT11548

All biology is becoming computational, much the same way it has became molecular … Otherwise “low input, high throughput and no output science” --- Sydney Brenner 2002 Nobel Prize