Big Data, Data Mining, Tools
N = ALL
CORRELATION vs CAUSATION
Data Sources...
Data Creation, Storage, Costs
Infrastructure
NoSQL Flavors https://www.youtube.com/watch?v=qI_g07C_Q5I
NoSQL https://www.youtube.com/watch?v=qI_g07C_Q5I Not Only SQL (sort of) Greater scalability Designed with distributed computing and commodity (not cheap) hardware. Variety of flavors https://www.youtube.com/watch?v=qI_g07C_Q5I
Topic: Algorithms
Tools
Speaking of the Cloud
High Level Flow Example
Hadoop MapReduce
HDFS Distributed file system. Write-once/read many Fault tolerance / Redundance Processing logic close to data http://www.ibm.com/developerworks/library/wa-introhdfs/
Traditional word count in Java http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code
Hive CREATE TABLE docs (line STRING); CREATE TABLE word_counts AS SELECT word, count(1) as count FROM (SELECT explode(split(line, ' ')) AS word FROM docs) w GROUP BY word ORDER BY word;
Hive with Some Structure Data 123 F 456 M 789 M 111 M 222 M 333 F 444 F 555 M create table if not exists p_genders ( p_id string, gender string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * from p_genders;
Pig Latin A = load 'S3://pmb4bucket/input/bleakhouse/bleakhouse.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into 's3://pmb4hadoop/output/bleakhouse';
Complex Event Processing
Tools
Data Scientist Not just a bean counter - it’s about modeling General skill set: Math (linear algebra, statistics, calculus, discrete math) Business sense Programming skills Communication etc, etc, etc https://www.youtube.com/watch?v=ceeiUAmbfZk
Our Schedule Setting the goals for a data mining project. Setting up KNime Gathering and preparing data. Visualization Machine Learning Naïve Bayes Clustering and Classification Dimension reduction
But first…