Download presentation
Presentation is loading. Please wait.
1
Big Data, Data Mining, Tools
2
N = ALL
3
CORRELATION vs CAUSATION
5
Data Sources...
6
Data Creation, Storage, Costs
7
Infrastructure
8
NoSQL Flavors
9
NoSQL https://www.youtube.com/watch?v=qI_g07C_Q5I
Not Only SQL (sort of) Greater scalability Designed with distributed computing and commodity (not cheap) hardware. Variety of flavors
10
Topic: Algorithms
11
Tools
12
Speaking of the Cloud
13
High Level Flow Example
14
Hadoop MapReduce
15
HDFS Distributed file system. Write-once/read many
Fault tolerance / Redundance Processing logic close to data
16
Traditional word count in Java
17
Hive CREATE TABLE docs (line STRING); CREATE TABLE word_counts AS
SELECT word, count(1) as count FROM (SELECT explode(split(line, ' ')) AS word FROM docs) w GROUP BY word ORDER BY word;
18
Hive with Some Structure
Data 123 F 456 M 789 M 111 M 222 M 333 F 444 F 555 M create table if not exists p_genders ( p_id string, gender string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * from p_genders;
19
Pig Latin A = load 'S3://pmb4bucket/input/bleakhouse/bleakhouse.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into 's3://pmb4hadoop/output/bleakhouse';
20
Complex Event Processing
21
Tools
22
Data Scientist Not just a bean counter - it’s about modeling
General skill set: Math (linear algebra, statistics, calculus, discrete math) Business sense Programming skills Communication etc, etc, etc
23
Our Schedule Setting the goals for a data mining project.
Setting up KNime Gathering and preparing data. Visualization Machine Learning Naïve Bayes Clustering and Classification Dimension reduction
24
But first…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.