Download presentation
Presentation is loading. Please wait.
Published byDonald Potter Modified over 9 years ago
1
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop
데이터마이닝연구실 김지연
2
Content Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Content Understanding the data analytics project life cycle Understanding data analytics problems Exploring web pages categorization Computing the frequency of stock market change Predicting the sale price of blue book for bulldozers
3
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Identifying the problem Designing data requirement Preprocessing data Performing analytics over data Visualizing data
4
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Identifying the problem business analytics trends change by performing data analytics over web datasets for growing business data analytical application needs to be scalable for collecting insights from their datasets If we want to know how to increase the business identify the important pages of our website by categorizing them based on these popular pages, their types, their traffic sources, and their content we will be able to decide the roadmap to improve business by improving web traffic(content)
5
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Designing data requirement to perform the data analytics, if needs datasets from related domains social media analytics (problem specification) use the data source as Facebook or Twitter For identifying the user characteristics, we need user profile information, likes, and posts as data attributes.
6
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Preprocessing data data cleansing data aggregation data augmentation data sorting data formatting Big Data the datasets need to be formatted and uploaded to HDFS used various nodes with Mappers and Reducers in Hadoop clusters.
7
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Performing analytics over data various machine learning(custom algorithmic concepts) Regression Classification Clustering model-based recommendation Big Data the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to the MapReduce job which is to be run over Hadoop clusters.
8
Understanding the data analytics project life cycle
Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle Visualizing data Ggplot2 rCharts
9
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Identifying the problem To identify the category of a web page of a website based on the visit count of the pages To identify the importance of web pages designed for websites based on the content, design, or visits of the lower popular pages can be improved or increased.
10
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Designing data requirement Use Google Analytics dataset date: This is the date of the day when the web page was visited source: This is the referral to the web page pageTitle: This is the title of the web page pagePath: This is the URL of the web page
11
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Designing data requirement the code for the extraction process from Google Analytics
12
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Designing data requirement the code for the extraction process from Google Analytics
13
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Preprocessing data
14
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Performing analytics over data Initialize by setting Hadoop variable & loading the RHadoop library Upload the datasets to HDFS
15
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Performing analytics over data MapReduce 1
16
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Performing analytics over data MapReduce 1
17
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Performing analytics over data MapReduce 2
18
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Performing analytics over data MapReduce 2
19
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Visualizing data the web page categorization output using the three categories if we have more information, such as sources, we can represent the web pages as nodes of a graph, colored by popularity with directed edges when users follow the links
20
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Identifying the problem it will calculate the frequency of past changes for one particular symbol of the stock market, such as a Fourier Transformation the investor can get more insights on changes for different time periods To calculate the frequencies of percentage change
21
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Designing data requirement Use Yahoo! Finance as the input dataset From month From day From year To month To day To year Symbol
22
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Preprocessing data To perform the analytics over the extracted dataset stock_BP <- read.csv(" write.csv(stock_BP,"table.csv", row.names=FALSE) uploading table.csv to hdfs bin/hadoop dfs -put /usr/jyk/table.csv /input/
23
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Performing analytics over data Mapper : stock_mapper.R options(warn=-1) input<-file("stdin","r") while(length(currentLine<-readLines(input,n=1,warn=FALSE))>0){ fields<-unlist(strsplit(currentLine,",")) open<-as.double(fields[2]) close<-as.double(fields[5]) change<-(close-open) write(paste(change,1,sep="\t"),stdout()) } close(input)
24
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Performing analytics over data Reducer: stock_reducer.R current.key<-NA current.val<-0.0 conn<-file("stdin","r") while(length(next.line<-readLines(conn,n=1))>0){ split.line<-strsplit(next.line,"\t") key<-split.line[[1]][1] val<-as.numeric(split.line[[1]][2]) if(is.na(current.key)){ current.key<-key current.val<-val } else{ if(current.key==key){ current.val<-current.val+val write(paste(current.key,current.val,sep="\t"),stdout()) close(conn)
25
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Performing analytics over data MapReduce /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming mr1-cdh5.3.3.jar \ -input input/table.csv \ -output outputs \ -file /home/jyk/Documents/stock_mapper.R \ -mapper /home/jyk/Documents/stock_mapper.R \ -file /home/jyk/Documents/stock_reducer.R \ -reducer /home/jyk/Documents/stock_reducer.R
26
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Performing analytics over data
27
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Computing the frequency of stock market change Performing analytics over data
28
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Exploring web pages categorization Visualizing data library(ggplot2) myStockData <- read.delim("stock_output.txt", header=F, sep="", dec=".") ggplot(myStockData, aes(x=V1, y=V2)) + geom_smooth() + geom_point()
29
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Identifying the problem How large datasets can be resampled & applied the random forest model with R and Hadoop To predict the sale price of a particular piece of heavy equipment at a usage auction based on its usage, equipment type, and configuration
30
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Designing data requirement Use Kaggle competition File name Description format (size) Train This is a training set that contains data for 2011. Valid This is a validation set that contains data from January 1, 2012 to April 30, 2012. Data dictionary This is the metadata of the training dataset variables. Machine_Appendix This contains the correct year of manufacturing for a given machine along with the make, model, and product class details. Test This tests datasets. random_forest_benchmark_test This is the benchmark solution provided by the host.
31
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Preprocessing data Loading Train.csv dataset & Machine_Appendix.csv
32
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Preprocessing data Add a few features & merge
33
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Random sampling N data points in our initial training set A set of M different models for an ensemble classifier Each of the M models will be fitted with K data points Poisson sampling KM < N: we are not using the full amount of data available to us KM = N: we can exactly partition our dataset to produce totally independent samples KM > N: we must resample some of our data with replacements
34
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Poisson sampling the generation of independent samples by using N training input points three parameters : N, M, and K where K is fixed T=K/N to eliminate the need for the value of N in advance K / N-average fraction of input data in each model 10% T = frac.per.model = 0.1 number of models M = num.models = 50
35
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Fitting random forests Under fitting Normal fitting Over fitting
36
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Mapper
37
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Reducer
38
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data MapReducer
39
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Performing analytics over data Each of the 50 samples produced a random forest with 10 trees, so the final random forest is a collection of 500 trees, fitted in a distributed fashion over a Hadoop cluster.
40
Learning Data Analytics with R and Hadoop
Understanding data analytics problems - Predicting the sale price of blue book for bulldozers Visualizing data
41
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.