JMC CGEMS SUMMER GENOMICS TRAINING WORKSHOPS Genomics in Education JMC CGEMS SUMMER GENOMICS TRAINING WORKSHOPS Jason Williams – Education, Outreach, Training Lead Cold Spring Harbor Laboratory williams@cshl.edu @JasonWilliamsNY
CyVerse evolution iPlant 2013 CyVerse 2016 Cyberinfrastructure for Life Sciences funding renewal CyVerse 2016 Transforming Science Through Data-Driven Discovery iPlant 2008 Empowering a New Plant Biology 2017 2006 public launch 2010 2015
Transforming science through data-driven discovery CyVerse vision Transforming science through data-driven discovery More than 40K users, PBs of data, and hundreds of publications, courses, and discoveries
What is Cyberinfrastructure? Data storage Software High-performance computing People organized into systems that solve problems of size and scope that would not otherwise be solvable.
What is Cyberinfrastructure? Platforms, tools, datasets Storage and compute Training and support
Genomics in Education
Big data biology – Education and Research 100K fold costs decrease in sequencing Hand-held sequencers Drones Biological sensors Biology is swimming in data Image Credits: Genome sequencing costs: http://www.genome.gov/images/content/costpergenome2015_4.jpg Oxford nanopore sequencer: https://www.nanoporetech.com/ Fitbit: http://www.fitbit.com/force Agricultural drone: http://purdue.imodules.com/s/1461/images/gid1001/editor/alumnus/2014_mar/drones_main.jpg
Big data biology – Too fast to keep up? “Essentially, all models are wrong, but some are useful” – George E.P. Box
Big data biology – Too fast to keep up?
Big data biology – Too fast to keep up? 1866 – Mendel publishes work on inheritance 1869 – DNA discovered 1915 – Hunt Morgan describes linkage and recombination 1953 – Structure of DNA described 1956 – Human chromosome number determined 1968 – First gene mapped to autosome 1977 – Dideoxy sequencing 1983 – PCR 1986 – Human Genome Project proposed
Big data biology – Too fast to keep up? 1993 – First MicroRNAs described 2003 – First ‘Gold Standard’ human genome sequence 2005 – First draft of human haplotype map (HapMap) 2007 – ENCODE project
Big data biology – Too fast to keep up?
Challenge – bringing students into the fold Research Education Students can work with the same data at the same time and with the same tools as research scientists. How do scientists share their data and make it publically available? How do scientists extract maximum value from the datasets they generate? How can students and educators (who will need to come to grips with data-intensive biology) be brought into the fold?
Can you navigate the tools? What are your challenges in teaching bioinformatics in the classroom?
Take the Subway
DNA Subway Faculty identified guiding requirements Classroom friendly bioinformatics Faculty identified guiding requirements that shaped the development of CyVerse educational platforms: Mix lecture and lab – have a wet bench “hook” Student-scientist partnerships – someone has to care about the data Co-investigation – projects should potentially lead to publications Scale – platforms should support projects multiple classrooms can join.
DNA Subway Red Line Analyze up to 150 KB of DNA sequence Red Line: Genome annotation Red Line Analyze up to 150 KB of DNA sequence De novo gene prediction Construct evidence-based gene models Visualize genome sequence in browser
DNA Subway Yellow Line Analyze DNA or protein sequence Yellow Line: Genome prospecting Yellow Line Analyze DNA or protein sequence Search plant genomes using TARGeT Explore gene duplications, transposons, and non-coding sequences not detectable in conventional BLAST searches
DNA Subway Blue Line Analyze DNA or protein sequence Blue Line: DNA barcoding, and phylogenetics Blue Line Analyze DNA or protein sequence Search plant genomes using TARGeT Explore gene duplications, transposons, and non-coding sequences not detectable in conventional BLAST searches
DNA Subway Green Line Examine RNA-Seq data for differential expression Green Line: Transcriptome analysis Green Line Examine RNA-Seq data for differential expression Use High-performance computing to analyze complete datasets Generate lists of genes and fold-changes; add results to Red Line projects
CyVerse Executive Team