CSCE 4013/5013 Big Data Analytics and Management Fall 2015.

CSCE 4013/5013 Big Data Analytics and Management Fall 2015

Overview Class hour 11:00am – 12:15pm, Tuesday&Thursday, JBHT 239 Office hour 3:00 – 5:00pm, Tuesday, JBHT 516 Instructor - Dr. Xintao Wu email - xintaowu@uark.eduxintaowu@uark.edu Office – JBHT 516 Webpage http://csce.uark.edu/~xintaowu/BDAM/bdam.htm Textbook No textbook is required Reading materials are posted on the course website.

Topic Description Traditional DBMS/DW revisited Classic data mining revisited Hadoop, AWS NoSQL NewSQL Big data analytics and machine learning (Spark) Real-time streaming (Storm) Crowdsourcing and human computation

Course Prerequisite CSCE 3193 Programming Paradigms and either INEG 2313 or STAT 3103 Familiarity with programming with Java or C++ is assumed Script languages (Ruby,Python) are preferred. Probability and statistics basic concept Knowledge of data mining or machine learning will be a plus

Grading Composition Homework & quiz 10% Group Project 30% Midterm 20% Final 40%

Project Reports Late policy: No acceptable. Hard copy is preferred Electronic submission (word or pdf) accepted

Project Big Data Management or Analytics Project Each group consists 2-4 students Develop/implement/apply big data management and analytics systems on real large data sets Individual Research Project More information http://csce.uark.edu/~xintaowu/BDAM/proj.htm

Midterm & Final Open books/notes/internet No discussion No help from any entity, e.g., by posting/uploading your questions on Web Cumulative No makeup Class attendance is not required Bonus is expected

9 9 Textbook & Reading Materials Textbook None is required Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, 2014. pdf downloadpdf download Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3 rd edition, Morgan Kaufmann, 2011. ISBN: 978-0-12-381479-1Data Mining: Concepts and Techniques, 3 rd editionMorgan Kaufmann Recommended reading materials http://csce.uark.edu/~xintaowu/BDAM/schedule.htm http://csce.uark.edu/~xintaowu/BDAM/reading.htm

Big Data Era Google: every 2 days we create as much data as we did up to 2003. Facebook: 500+ TB of new data every day including 2.5 billion items shared 2.7 billon Likes 300 million photos 100+ PB Hadoop cluster Twitter: 500 million tweets per day Many applications for streaming data, e.g., sensors

Drivers of Data Computing 11 6A’s Anytime Anywhere Access to Anything by Anyone Authorized 4V’s Volume Velocity Variety Veracity Reliability Security Privacy Usability

Big Data Computing Drowning in data  Volume, Velocity, Variety, and Veracity  2.5 Exabyte every day  Web data, healthcare, e-commerce, social network Advancing technology  Cheap storage/processing power  Growth in huge data centers  Data is in the “cloud”- Amazon AWS, Hadoop, Azure  Computing is in the “cloud” 12

AVC Denial Log Analysis 15 Volume and Velocity:1 million log files per day and each has thousands entries S3, Hive and EMR

NoSQL http://nosql-database.org/http://nosql-database.org/ Non relational Scalability Collection of structures No pre-defined schema No join operations CAP not ACID Consistency, Availability and Partitioning (but not all three at once!) Atomicity, Consistency, Isolation and Durability

Advantages of NoSQL Cheap, easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema scale up and down Can handle web-scale data Quickly process large amounts of data Relax the data consistency requirement (CAP)

Disadvantages of NoSQL Data is generally duplicated, potential for inconsistency Lack standard No standardized schema No standard format for queries No standard language Difficult to impose complicated structures Depend on the application layer to enforce data integrity No guarantee of support

NewSQL It is more of a movement than specific product The “New” refers to the Vendors and not the SQL Seek to provide the same scalable performance of NoSQL for OLTP read-write workloads while maintaining ACID Transactions are short-lived, access a small set of data, and are repetitive. H-Store, VoltDB, Amazon RDS, Microsoft SQL Azure, Google Spanner, SAP HANA

MapReduce Model DAG Model Graph Model BSP/Collective Model Storm Twister For Iterations/ Learning For Streaming For Query S4 Drill Hadoop MPI Dryad/ DryadLIN Q Pig/PigLatin Spark Shark Spark Streaming MRQL Hive Tez Giraph Hama GraphLab Harp GraphX HaLoop Samza The World of Big Data Tools Stratosphere Reef From Bingjing Zhang

Orchestration & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy Data Analytics Libraries: Machine Learning Mahout, MLlib, MLbase CompLearn (NA) Linear Algebra Scalapack, PetSc (NA) Statistics, Bioinformatics R, Bioconductor (NA) Imagery ImageJ (NA) MRQL (SQL on Hadoop, Hama, Spark) Hive (SQL on Hadoop) Pig (Procedural Language) Shark (SQL on Spark, NA) Hcatalog Interfaces Impala (NA) Cloudera (SQL on Hbase) Swazall (Log Files Google NA) High Level (Integrated) Systems for Data Processing Parallel Horizontally Scalable Data Processing Giraph ~Pregel Tez (DAG) Spark (Iterative MR) Storm S4 Yahoo Samza LinkedIn Hama (BSP ) Hadoop (Map Reduce) Pegasus on Hadoop (NA) NA: Twister Stratosphere Iterative MR Graph Batch Stream Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka ABDS Inter-process Communication Hadoop, Spark Communications MPI (NA) & Reductions Harp Collectives (NA) HPC Inter-process Communication Cross Cutting Capabilities Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Security & Privacy Monitoring: Ambari, Ganglia, Nagios, Inca (NA) from Geoffrey Fox

In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) …….. ABDS Cluster Resource Management HPC Cluster Resource Management ABDS File Systems User Level HPC File Systems (NA) HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated iRODS(NA) Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) Cross Cutting Capabilities Distributed Coordination : ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Security & Privacy Monitoring: Ambari, Ganglia, Nagios, Inca (NA) SQL MySQL (NA) SciDB (NA) Arrays, R,Python Phoenix (SQL on HBase) UIMA (Entities) (Watson) Tika (Content) Extraction Tools Cassandra (DHT) NoSQL: Column HBase (Data on HDFS) Accumulo (Data on HDFS) Solandra (Solr+ Cassandra) +Document Azure Table NoSQL: Document MongoDB (NA) CouchDB Lucene Solr Riak ~Dynamo NoSQL: Key Value (all NA) Dynamo Amazon Voldemort ~Dynamo Berkeley DB Neo4J Java Gnu (NA) NoSQL: General Graph RYA RDF on Accumulo NoSQL: TripleStore RDF SparkQL AllegroGraph Commercial Sesame (NA) Yarcdata Commercial (NA) Jena ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard File Management IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google Bare Metal Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) From Geoffrey Fox

CSCE 4013/5013 Big Data Analytics and Management Fall 2015.

Similar presentations

Presentation on theme: "CSCE 4013/5013 Big Data Analytics and Management Fall 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 4013/5013 Big Data Analytics and Management Fall 2015.

Similar presentations

Presentation on theme: "CSCE 4013/5013 Big Data Analytics and Management Fall 2015."— Presentation transcript:

Similar presentations

About project

Feedback