Download presentation
Presentation is loading. Please wait.
Published byEleanor Goodman Modified over 8 years ago
1
BigData@Chalmers Big Data Analytics with Hadoop and Spark Introduction: Fourth Paradigm Devdatt Dubhashi LAB (Machine Learning. Algorithms, Computational Biology) Computer Science and Engineering Chalmers
2
Fourth Paradigm Data-driven scientific discovery as opposed to classical hypothesis driven Enabled by dramatic advances in data generating technologies
3
Astronomy Australian Square Kilometre Array Pathfinder (ASKAP) project currently acquires 7.5 terabytes/second of sample image data, a rate projected to increase 100-fold to 750 terabytes/second (~25 zettabytes per year) by 2025 Scientific question: what is “Dark Matter”? Galaxy classification and morphology.
4
Social Media: YouTube & Twitter YouTube currently has 300 hours of video being uploaded every minute, and this could grow to 1,000–1,700 hours per minute (1–2 exabytes of video data per year) by 2025 if we extrapolate from current trends (S1 Note). Today, Twitter generates 500 million tweets/day, each about 3 kilobytes including metadata Scientific Question: social habits and behaviour
7
Why Language is difficult.. He sat on the river bank and counted his dough. She went to the bank and took out some money. Lexical Layer Concept Layer synonymous polysemous
8
Word Embeddings
9
Deep Learning (Neural Networks) Revolutionized vision and speech systems Dramatic improvements in image classification – near human level. Skype real time translation from English to Chinese.
10
Word Embeddings capture meaning
11
Word sense induction M. Kageback, F. Johansson et al, “Neural context embeddings for automatic discovery of word senses”, (NAACL 2015 workshop on Vector Space Modeling for NLP) Used an innovative clustering technique Exploited word and context vectors. Ongoing work using LSTMs
12
Adaptive Natural Language Processing Adaptive to collection and to human user Adaptive Machine Learning Use annotations as features Text Data Find regularities Annotate regularities in data WWW enhance background by focused crawling Structure Discovery Algorithms adaptive annotationrich representation jaguar ISA vehicle car brand company SIM mercedes bmw cadillac jeep ISA animal species wildlife predator SIM tiger leopard lion cougar semantic model
13
Dealing with information overload
14
Document summarization Word vectors + Multiple Kernel learning + Submodular optimization M. Kågeback, O. Mogren et al, “Extractive Summarization using Continuous Vector Space Models”, Workshop on (CVSC) EACL 2014 Olof Mogren, et ql, “Extractive Summarization by Aggregating Multiple similarities” RANLP 2015
16
Genomics: Next Gen Sequencing Next Generation Sequencing (NGS) technologies The Cancer Genome Atlas: one of the largest and most complete cancer genomics datasets: a petabyte in size single-cell RNA sequencing (scRNA-seq)
17
Fig 1. Growth of DNA sequencing. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195 http://journals.plos.org/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002195 Genomics: genetic basis of diseases like cancer
18
Table 1. Four domains of Big Data in 2025. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195 http://journals.plos.org/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002195
19
Exercise (for Application Folks) What is your field and scientific question? Can “Big Data” help you, if so, how? What are the data acquisition technologies? Data sources? What is the nature of the data? Structured/unstructured? What is the size of your data (ballpark)? What kind of analytics on data would be useful?
20
Exercise (for Methods Folks) What kind of methods do you develop? Algorithms, systems, hardware? What application domains do you target? What is the size of data you have worked with and which you anticipate in the future? In what way do you think Spark analytics could help your methods?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.