Download presentation
Presentation is loading. Please wait.
Published byNatalie Holmes Modified over 8 years ago
1
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis
2
The three I’s of Big Data Big Data is: - Ill-defined (what is it?) - Immediate (we need to do something about it now) - Intimidating (what if we don’t) (loosely adapted from Forbes)
3
Big Data: Volume Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB
4
Big Data: Volume Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB 30KB One page of text 5 MB One song 5 GB One movie 6 million books 1 TB 55 storeys of DVD 1 PB Data up to 2003 5 EB Data in 2011 1.8 ZB NSA data center 1 YB
5
Big Data: Volume Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB 1s 20 mins 11 days 30 years 300 30 million 30 billion …. centuries years years 30KB One page of text 5 MB One song 5 GB One movie 6 million books 1 TB 55 storeys of DVD 1 PB Data up to 2003 5 EB Data in 2011 1.8 ZB NSA data center 1 YB
6
Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected 1.3 million 30 hours videos viewed videos uploaded (Intel, 2013) 204 million e-mails sent 50 GB of data generated at the Large Hadron Collider 640 TB IP data transferred
7
Big Data: Volume, Velocity, Variety text Numbers Images sound
8
Big Data: Challenges Volume and Velocity Variety Structured, Unstructured…. Images, Sound, Numbers, Tables,… Security Reliability, Integrity, Validity
9
Big Data: Challenges Large N: “Any dataset that is collected by a scientist whose data collection skills are far superior to her analysis skills” Computing issues: Data transfer Scalability of algorithms Memory limitations Distributed computing
10
Big Data: Challenges Vizualization issues: The black screen problem (Matloff, 2013)
11
Big Data: Challenges and Opportunities Fourth Paradigm: data driven science Holistic approaches to major research efforts New paradigms in computing Digital Humanities DataKnowledgeSocietal Benefit BasicTranslational
12
Big Data Dreams: Genomics
14
Genomics: Sequencing costs Cost per Mbase http://www.genome.gov Cost per Human Genome
15
Genomics: Game changing technologies Illumina HiSeq 2000 Capable of 600 Gb per run -> 1,000+ Gb 55 Gb/day 6 billion paired-end reads <$4,000 per human/plant genome <$200 per transcriptome Multiplex 384 pathogen isolates/lane $10 (+ $50 library construction)/isolate Gary Schroth (Illumina): “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days”. Challenges: Library preparation & data analysis
16
Genomics @ UC Davis Massively parallel DNA sequencing 2 Illumina Genome Analyzers 1 Illumina Hiseq 2000, 2 Miseq 1 Roche 454 Junior 1 Pacific Biosystems RS GoldenGate SNP genotyping iScan, BeadArray & BeadExpress
17
Cancer Genomics: Molecular Diagnostics
18
“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina) Genomics: actual costs
19
“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina) Genomics: actual costs Assembling 22GB conifer genome: Data: -16 billion pair reads (100 bases) Processing: -10 days for error correction -11 days for assembling “super-reads” -60 days to build contigs/scaffold -8 days to fill in gaps http://www.homolog.us/blogs/2013/05/11/ steven-salzberg-at-bog13-assembling-22gb-conifer-genome/
20
Social Consequences of Commodity Sequencing The danger of misuse predict sensitivities to various industrial or environmental agents discrimination by employers? The impact of information that is likely to be incomplete an indication of a 25 percent increase in the risk of cancer? Reversal of knowledge paradigm Are the "products" of the Human Genome Project to be patented and commercialized? Myriad genetics and BRCA1/2 How to educate about genetic research and its implications?
21
Social Consequences of Commodity Sequencing
23
How to Approach Big Data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.