Download presentation
Presentation is loading. Please wait.
1
R + R Tool for Visual Studio= Data Science
Bahrudin Hrnjica, MVP, daenet @bhrnjica
2
Agenda Quick intro to R Language Statistic Graphs Microsoft R Server
Microsoft R Server and RevoScaleR
3
What is R a programming language, a statistical package, an interpreter and Open Source Free Highly extensible Focused on statistics and machine learning Transparent and reproducible Single-threaded In-memory data
4
What R is not is not a database, but connects to DBMSs
has no click-point user interfaces, but connects to Java, TclTk language interpreter can be very slow, but allows to call own C/C++ code no spreadsheet view of data, but connects to Excel/MsOffice
5
Two distributions of R Microsoft R Application Network
CRAN – project – MRAN project – The R Project for Statistical Computing Original distribution since the beginning. Open Source project written in C, C++, FORTRAN R code is fully compatible for both distributions. Microsoft R Application Network Enhanced distribution of R Support of BLASS and LAPACK C++ libraries and multy-core processing.
6
Microsoft R Open Single-threaded In memory data manipulation Free
Open source, Cross-OS project Enhanced distribution of R supports a variety of: big data statistics, predictive modeling, machine learning capabilities. Single-threaded In memory data manipulation
7
Microsoft R portfolio
8
Microsoft R Server – MRS MRS extends open-source R to allow
Multi-threading Matrix operations, linear algebra, many other math operations run on all available cores. Parallel processing ScaleR functions utilize all available resources, local or distributed On-disk data storage RAM limitation lifted- Break Through Your memory Barrier! Working with data too big to fit your memory Building models that take too long to run Working with clusters and distributed systems
9
MRS’s Native Data Format: XDF File
Chunk –oriented Easy to distribute to nodes Fast to append Column-oriented Fast retrieval of variables Pre- computed metadata
10
Tool for write, edit, debug and run R code (script).
R Studio – – free, corss OS tool for R R Tools for Visual Studio – Visual Studio vnext will include this tool Today can be downloaded in Preview version Suited for Developer to run R script within VS Same Dev experience for editing, debugging and running R code.
11
Input and Output Input & Output Subset of variables Subset of Rows
inData CSV, SAS, SPSS, ODBC connection… outFile DXF file; returns a data frame if left blank Subset of variables varsToKeep varsToDrop Subset of Rows numRows rowSelection
12
Data Sources Text files SAS, SPSS Teradata HDFS Database in ODBC
Runs in-database in SQL Server 2016
13
Importing from Databases
Setup ODBC first Each data source (RxOdbcData) is one query (not databse) SQL Server 2016 can run MRS internaly; no ODBC required rxDataStep Subset rows within criteria (rosSelection) Select Columns by name (varsToKeep, varsToDrop) Create and modify variables (transformations) Pull data into in-memory data.frame
14
Working with Data Subsetting Rows Selecting Columns Transformation
rowSelection takes a logical vector, just like subset() Chain multiple criteria together with & and | numRows=N – to get the first N rows of a dataset Selecting Columns varsToKeep, varsToDrop One quirk: cant keep/drop when in Data == outFile Transformation Create new variables Modify existing vars Change the variable type Takes a list of named element – each a new variable Complex transformation Simple transformations depend on a single row of data Complex trans. Depends on multiple rows In distributed context, that means moving results between nodes
15
Managing Factors Factors count as complex because levels, level order and level encoding can vary across chunks Use rxFactors to create and modify factors The F() shortcut
16
How Algorithms Work in MS R Server:
Chank by Chunk – aka Parallel External Memory Algorithms (PEMAs) Data just needs fit on disk Chanks of data distributed to all available cores/nodes Intermediate results calculated in-memory for each chunk Final result assembled in-memory
17
Available ML Algorithms
Linear Regression – rxLinMod Generalized linear models – rxLogit, rxGLM Decision Tree – rxDTree Gradient boosted boosted decision tree –rxBTree Decision forest – rxDFores K-means – rxKmeans Naïve Bayes - rxNaiveBayes
18
PEMAs in Contex On Laptops Computer cluster
Chunks pulled from local disk All cores process chunks in parallel Computer cluster Chunks partitioned across nodes All cores on nodes process local chunks in parallel
19
Analyzing Data with MRS
Pre-computed metadata rxGetInfo, rxGetVarInfo Summary Statistics rxSummary, rxCube Predictive modelling Regressions: rxLogit, rxGLM Decision tree and forest: rxBTree, rxDTree, rxDForest K-means and Naïve Bayes: rxKmeans, rxNaiveBayes
20
Metadata Retrieval - numeric
rxSummary(~ arr_delay, data= flightsxdf) – one variable rxSummary(~ arr_delay + dep_delay, data= myXdf) – two variables rxSummary( arr_delay ~ day_of_week, data = myDxf) – groupwise Metadata Retrieval – categorical rxCrossTable for frequency tables rxCube for long tables Formula interface: rxCrossTabs( ~ day_of_week : dest_F, data = myDxf)
21
Modeling Workflow in MRS
Load Data (rxImport) Exploratory analysis (rxGetInfo, rxSummary, rxCube) Clean data (rxDataStep, rxFactors) Build model – (rxLinMod, rxGLM, etc) Evaluate and predict – (rxPredict)
22
Using formula Syntax in Models
One predictor rxLinMod(y ~ x, data = myDxf) Two predictors rxLinMod(y ~ x + z, data = myDxf) Two predictors with interaction term rxLinMod(y ~ x * z, data = myDxf) Sample rxLinMod(mpg ~ hp + wt, data = mtcars) rxLinMod(delayed ~ dep_time +* dayofweek, data = flightsXdf) rxNaiveBayes(Species ~ Sepal.Length + Sepal.Width, data = iris)
23
Model Evaluation and Prediction
MRS models don’t include fitted values or residuals by default Generated fitted values, residuals and prediction with rxPredict: #Fitted values: data used to fit model rxPredict(modelToObject = delayedMod, data = flightsxdf, outData = flightXdf) Other options Residual: computeResiduals = TRUE Standard Errors: computeStdErrors: T Confidence interval: interval: “confidence” Prediction intervals: intervals = “predictions” For binary clasifers: rxRocCurve Compares actual values to one or more predictions generated by rxPredict
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.