STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In Sheet
Bioinformatics and Computational Biology Interdisciplinary Statistics, Biology, Computer Science Applied From freshman to postdocs Useful training for many The more you practice, the better you get Moves with technology development STAT115
The Protein Sequence and Structure Wave 1955: Sanger sequenced bovine insulin 1970: Smith-Waterman algorithm 1973: PDB 1990: BLAST 1994: BLOCKS database 1994-: CASP 1997-: Proteomics STAT115
The Microarray Wave Microarray contains hundreds to millions of tiny probes Simultaneously detect how much each gene is expressed STAT115
ALL vs AML Golub et al, Science 1999. STAT115
ALL vs AML STAT115
“Microarrays” Today Infer the expression value of all the genes from 1000 probes High throughput drug screen STAT115
The DNA Sequencing Wave 1953: DNA structure 1972: Recombinant DNA 1977: Sanger sequencing 1985: PCR 1988: NCBI 1990: BLAST STAT115
Sequencing in the 1970s STAT115
The Human Genome Race Human Genome Project: 1990-2003 Originally 1990-2005 Boosted by technology improvement and automation Competition from Celera STAT115
Human Genome Sequencing Clone-by-clone and whole-genome shotgun STAT115
The Human Genome Race Human Genome Project: 1990-2003 Originally 1990-2005 Boosted by technology improvement and automation Competition from Celera Informatics essential for both the public and private sequencing efforts Sequence assembly and gene prediction Working draft finished simultaneously spring 2000 STAT115
Sequencing in 2001 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{CE8AAEAA-A22F-47FE-A1F8-66CBC3CDB6FC}}
Sequencing in 2007 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{010D7619-E070-4F7B-BC99-6011AA639C8D}}
Sequencing Today Personal genome sequencing HiSeq X 900GB data / flow cell in < 3 days, 10 * 30X human genomes, at ~$1.5-2K / sample STAT115
Personalized Disease Susceptibility Test and Treatment STAT115
Big Data Challenges STAT115
--- Sydney Brenner 2002 Nobel Prize All biology is becoming computational, much the same way it has became molecular … Otherwise “low input, high throughput and no output science” --- Sydney Brenner 2002 Nobel Prize
STAT115
Class Information Course website: Roughly 3 modules (2 HW each) http://stat115.org/ Video recording / slides online Office hours, auditing Background: CS, Stats, Biology Roughly 3 modules (2 HW each) Transcriptome (microarrays and RNA-seq) Gene regulation (transcriptional & epigenetic regulation) Human genetics and disease (GWAS / cancer) STAT115
Class Information Teaching Fellows Yang Li Stephanie Chan Labs: Wed 6 – 8pm, Science Center B09 Tue 6-8pm, HSPH Kresge 209, Boston First Lab: Fri 1/30 3-5pm (Odyssey)! STAT115
HW and Grading Discussion forum: stat115.slack.com Submission email: harvard.stat115@gmail.com HW 6 * 10 or 6 * 12 Final exams 20 Class participation: 20 Algorithm videos: 5 Lecture notes: extra 5 points Late days STAT115
STAT115
Gene Expression Microarrays
Expression Microarrays Grow cells at certain condition, collect mRNA population, and label them Microarray has high density (thousands to millions) sequence specific probes with known location for each gene/RNA Sample hybridized to microarray probes by DNA (A-T, G-C) base pairing, wash non-specific binding Measure sample mRNA value by checking labeled signals at each probe location
Affymetrix GeneChip Arrays
Labeled Samples Hybridize to DNA Probes on GeneChip
Shining Laser Light Causes Tagged Fragments to Glow
Perfect Match (PM) vs MisMatch (MM) (control for cross hybridization)
NimbleGen Arrays
Agilent Arrays
Microarrays Array comparison: # probes / array, # probes / gene, probe length Flexibility vs data reuse Why do we bother learning about microarrays now? RNA-seq is probably preferred in new expression experiments The amount of useful public data The data analysis techniques STAT115