Download presentation
Presentation is loading. Please wait.
1
Big Data A Quick Review on Analytical Tools
Jie Ding Oct. 2, 2017 (Note: it is about technical tools but not methodology)
2
Languages SAS R Python Designed by statisticians
Good at statistical modeling because of abundant packages Not good at computational efficiency as it deviates a lot from classical CS-type language Python Flexible as Matlab and C++ Favored by data analysts in industry Not good at analytics SAS Classical in data analysis Commercial software, especially good at computing outside internal memory Not good at parallel computing and graphics
3
Database Relational database
(commercial) Oracle, MS SQL Server, DB2, Teradata (open source) MySQL, MariaDB NoSQL database (triggered by the needs of Web 2.0 companies) MongoDB Cassandra Redis Hadoop database Hbase Hive
4
Big Data Infrastructure
Server PC server based on x86 Architecture is the most popular Customized chips seem to be the future trend? Software Open source OS such as CentOS, Ubuntu CPU-based Hadoop + Hive as database, and MapReduce as data-analytical framework Spark + SparkR for real-time data analysis GPU-based CUDA Tensor Flow as a third-party framework for deep learning
5
Hadoop A framework that allows for the distributed processing of large data sets across clusters of computers designed to scale up from single servers to thousands Docker an open-source project that automates the deployment of code inside software containers, for easy setup R-Hadoop R-based tools Cloudera Commercialized Hadoop service provider (2008)
6
Parallel Computing A type of computation in which many calculations or the execution of processes are carried out simultaneously Explicit paralellism User-specific, usually need to adapt the algorithms Under some framework (e.g. MapReduce) it is easy to adapt the algorithm Implicit paralellism Paralellism automatically implemented by the system E.g. “foreach” in R and “parfor” in Matlab Parallel computing using R Communication using Socket or MPI Using platform such as Hadoop or Spark GPU based
7
Artificial Intelligence
“Design machine that behaves like human” Neural network = Deep learning ⊂ Machine learning ⊂ Applied statistics ⊂ Artificial intelligence Biological neuron model
8
Deep Learning Infrastructure
Theano Python package for deep learning Good at symbolic computation Not good at debugging and performance Tensor Flow Deep learning library developed by Google and released on 2015 Nov. Python and C++ API Plan to have Java API RStudio released R API Customized Chips
9
Visualization The combination of science and art Concise Interactive
10
Visualization Open source tools UI/UX design tool in R
(Baidu) (Google) (JavaScript based) UI/UX design tool in R Shiny: product of Rstudio, a web-based interface HTML 5 The newest markup language used for structuring and presenting content on web
11
Text Analysis Extract useful information from text data
NLP: the intersection of linguistics and machine learning language understanding language generation Translation Dialog system Word2Vec Released by Google on 2013 Represent a word with a vector Reduce the storage and computation R package: wordVectors RNN Recurrent neural network (recurrent structure along time) Recursive neural network (recurrent structure along space) LSTM: Long short – term memory Voice recognition, modeling, translation, image description
12
Discussions Where Do We Come From? Where Are We Going?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.