Big Data Yuan Xue CS 292 Special topics on
It All Starts with Data Big data- a growing torrent Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company
What is Big Data Volume size of the data Velocity latency of data processing relative to the growing demand for interactivity Variety diversity of sources, formats, quality, structures Veracity uncertainty, imprecision of data Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company
Put Data To Use Help domain scientists achieve new discoveries Help companies provide better services Help governments become more efficient And more.. The transformative potential of big data in five domains 37 3a. Health care (United States) 39 3b. Public sector administration (European Union) 54 3c. Retail (United States) 64 3d. Manufacturing (global) 76 3e. Personal location data (global) Big data: The next frontier for innovation, competition, and productivity - McKinsey & Company Need computer scientists and engineers to help manage the data
Data management and Analytics Data management (Data engineering) Storage, access, manipulation, integration Real-time update, access Ad hoc query Batch processing Distributed system design Data analytics (Data science) Extraction of knowledge from data Automatic,semi-automatic Structured, unstructured statistical estimation and prediction machine learning, data mining Visualization and Communication Data Data Management Data Analysis support
This course Learn how to use data management systems Understand how to build scalable data management systems Hands-on learning interesting facts from data Data Data Management Data Analysis support
This course Along Multiple Dimensions From small to big (in scale) Sql to nosql From simple to complex (in data modeling) Key-value column family document graph? (no plan to cover for now) From Disk to In-memory Redis Memcached, MapReduce Spark Method: Top down How to use How it works When to use SQL Data Model Operations System Design Performance Optimization NoSQL NewSQL
Tools and System Hands-on System mySQL MapReduce (YARN) HDFS Hbase DynamoDB Cassandra Memcached Redis MongoDB Pig HIVE Impala Mahout Spark Items that you can put on your resume! Design Knowledge BigTable Dynamo Dremel Spanner Storm Resource management YARN File System (HDFS) Database (SQL, NoSQL, NewSQL) Data Storage Data Processing and Analysis MapReduce PigHIVE Batch Processing/Analysis Interactive Access Impala/ Drill Storm Mahout Real time stream
Put This Course To Big Data Landscape Lecture Lab (guest )Lecture Project (define by you)
Background Required Strong programming and hands-on capability Lots of time-consuming system setup, development, debugging, etc.. Solid data structure and algorithm knowledge Hash Table, B-Tree, etc… Operating System Concurrency (e.g., race condition, lock, synchronization) Network Network delay, loss, bandwidth How data is transferred from one host to another Basic concepts in network programming (i.e., socket programming)
Course Information Check out our website: Presentation (team work) Comprehensive and concise introduction Demonstration based on example application Review and revision by me. 4 Labs (team work) Pick an application/data set 2 Quizes Project (team work) Pick your own topic Start early Start teaming asap!
Logistics Development Platform Local Environment – your choice, but Eclipse is recommendedEclipse Code repository -- GitHubGitHub Experiment Platform Your own machine EECS Linux system Amazon Web Services