Intro to Data Mining for Data Science

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

1 Peter Fox Data Science – ITEC/CSCI/ERTH-4750/6750 Week 9, October 21, 2014 Intro to Data Mining for Data Science.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.

1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.

Data Mining – Intro.

Introduction to machine learning

Spectral contrast enhancement

Data Mining Techniques

Data Mining Chun-Hung Chou

COMP3503 Intro to Inductive Modeling

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 7, October 19, 2010 Data Mining.

1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 6, October 5, 2010 Introduction to Data Mining.

Analyzing and Interpreting Quantitative Data

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Image Classification 영상분류

1 Foundations VI: Discovery, Access and Semantic Integration Data Mining and Knowledge Discovery - Continued Deborah McGuinness and Joanne Luciano with.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

ITSC/University of Alabama in Huntsville ADaM System Architecture Rahul Ramachandran, Sara Graves and Ken Keiser Mathematical Challenges in Scientific.

An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.

ITSC/University of Alabama in Huntsville ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville.

An Exercise in Machine Learning

An Interoperable Framework for Mining and Analysis of Space Science Data (F-MASS) PI: Sara J. Graves Project Lead: Rahul Ramachandran Information Technology.

Data Mining and Decision Support

In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Detecting Web Attacks Using Multi-Stage Log Analysis

Data Science Credibility: Evaluating What’s Been Learned

A Generic Approach to Big Data Alarms Prioritization

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining – Intro.

SNS COLLEGE OF TECHNOLOGY

What Is Cluster Analysis?

Semi-Supervised Clustering

Chapter 7. Classification and Prediction

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Intro to “Data Mining” for Data Science

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Dipartimento di Ingegneria «Enzo Ferrari»,

Basic machine learning background with Python scikit-learn

Data Mining: Concepts and Techniques Course Outline

Using Tensorflow to Detect Objects in an Image

Fitting Curve Models to Edges

Classification Techniques: Bayesian Classification

REMOTE SENSING Multispectral Image Classification

Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.

Data Warehousing and Data Mining

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

iSRD Spam Review Detection with Imbalanced Data Distributions

Classification and Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Class exercise - collecting data - individual

Parametric Methods Berlin Chen, 2005 References:

Class exercise - collecting data - individual

Class exercise - collecting data - individual

Class exercise - collecting data - individual

Data Pre-processing Lecture Notes for Chapter 2

Data Mining CSCI 307, Spring 2019 Lecture 8

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Intro to Data Mining for Data Science Peter Fox, Greg Hughes Data Science – ITWS/CSCI/ERTH-4350/6350 Module 6, Week 8, October 25, 2016

Contents Data Mining what it is, is not, types Distributed applications – modern data mining Science example A specific toolkit and two examples Classifier Image analysis – clouds Week 9 reading – note is PRE-READING (only two articles)

Types of data http://hsc.uwe.ac.uk/dataanalysis/quantIssuesTypes.asp

Data Mining – What it is Rahul Ramachandran Extracting knowledge from large amounts of data Motivation Our ability to collect data has expanded rapidly It is impossible to analyze all of the data manually Data contains valuable information that can aid in decision making Uses techniques from: Pattern Recognition Machine Learning Statistics High Performance Database Systems OLAP Plus techniques unique to data mining (Association rules) Data mining methods must be efficient and scalable Rahul Ramachandran Information Technology and Systems Center University of Alabama in Huntsville National Space Science and Technology Center 256-824-6064 rramachandran@itsc.uah.edu http://www.itsc.uah.edu

Data Mining – What it isn’t Small Scale Data mining methods are designed for large data sets Scale is one of the characteristics that distinguishes data mining applications from traditional machine learning applications Foolproof Data mining techniques will discover patterns in any data The patterns discovered may be meaningless It is up to the user to determine how to interpret the results “Make it foolproof and they’ll just invent a better fool” Magic Data mining techniques cannot generate information that is not present in the data They can only find the patterns that are already there

Data Mining – Types of Mining Classification (Supervised Learning) Classifiers are created using labeled training samples Training samples created by ground truth / experts Classifier later used to classify unknown samples Clustering (Unsupervised Learning) Grouping objects into classes so that similar objects are in the same class and dissimilar objects are in different classes Discover overall distribution patterns and relationships between attributes Association Rule Mining Initially developed for market basket analysis Goal is to discover relationships between attributes Uses include decision support, classification and clustering Other Types of Mining Outlier Analysis Concept / Class Description Time Series Analysis

Data Mining in the ‘new’ Distributed Data/Services Paradigm Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair ----- Meeting Notes (10/23/15 16:46) ----- mining is iterations, changing parameters and rerunning

Science Motivation Study the impact of natural iron fertilization process (such as a dust storm) on plankton growth and subsequent dimethyl sulfide (DMS) production Plankton plays an important role in the carbon cycle Plankton growth is strongly influenced by nutrient availability (Fe/Ph) Dust deposition is important source of Fe over ocean Satellite data is an effective tool for monitoring the effects of dust fertilization Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Hypotheses In remote ocean locations there is a positive correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration There is a time lag between oceanic dust deposition and the photosynthetic activity Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Primary source of ocean nutrients OCEAN UPWELLING WIND BLOWNDUST SAHARA Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair SEDIMENTS FROM RIVER

CLOUDS SST CHLOROPHYLL DUST NUTRIENTS SAHARA Factors modulating dust-ocean photosynthetic effect SST CHLOROPHYLL Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair ----- Meeting Notes (10/23/15 16:46) ----- sea surface temperature DUST NUTRIENTS SAHARA

Objectives Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated. Determine physical processes responsible for observed relationship Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Data and Method Data sets obtained from two instruments: SeaWiFS and MODIS during 2000 – 2006 are employed MODIS derived AOT (Aerosol Optical Thickness) SeaWIFS - Sea-Viewing Wide Field-of-View Sensor MODIS – Moderate resolution Imaging Spectrometer AOT – Aerosol Optical Thickness Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

*Figure: annual SeaWiFS chlorophyll image for 2001 The areas of study *Figure: annual SeaWiFS chlorophyll image for 2001 8 7 6 1 2 5 3 4 Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair 1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3-Patagonia 4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean

Tropical North Atlantic Ocean  dust from Sahara Desert -0.17504 -0.0902 -0.328 -0.4595 -0.14019 -0.7253 -0.1095 Chlorophyll AOT Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair -0.68497 -0.15874 -0.85611 -0.4467 -0.75102 -0.66448 -0.72603

Arabian Sea  Dust from Middle East 0.59895 0.66618 0.37991 0.45171 0.52250 0.36517 0.5618 Chlorophyll AOT Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair 0.65211 0.76650 0.69797 0.4412 0.75071 0.708625 0.8495

Summary … Dust impacts oceans photosynthetic activity, positive correlations in some areas NEGATIVE correlation in other areas, especially in the Saharan basin Hypothesis for explaining observations of negative correlation: In areas that are not nutrient limited, dust reduces photosynthetic activity But also need to consider the effect of clouds, ocean currents. Also need to isolate the effects of dust. MODIS AOT product includes contribution from dust, DMS, biomass burning etc. Rahul Ramachandran, Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Data Mining – Types of Mining Classification (Supervised Learning) Classifiers are created using labeled training samples Training samples created by ground truth / experts Classifier later used to classify unknown samples Clustering (Unsupervised Learning) Grouping objects into classes so that similar objects are in the same class and dissimilar objects are in different classes Discover overall distribution patterns and relationships between attributes Association Rule Mining Initially developed for market basket analysis Goal is to discover relationships between attributes Uses include decision support, classification and clustering Other Types of Mining Outlier Analysis Concept / Class Description Time Series Analysis

Models/ types Trade-off between Accuracy and Understandability Models range from “easy to understand” to incomprehensible Decision trees Rule induction Regression models Neural Networks Harder

Qualitative and Quantitative Provide insight into the data you are working with If city = New York and 30 < age < 35 … Important age demographic was previously 20 to 25 Change print campaign from Village Voice to New Yorker Requires interaction capabilities and good visualization Quantitative Automated process Score new gene chip datasets with error model every night at midnight Bottom-line orientation http://www.thearling.com/dmintro/dmintro_2.htm

Management Creation of logical collections Physical data handling Interoperability support Security support Data ownership Metadata collection, management and access. Persistence Knowledge and information discovery Data dissemination and publication Derived from Data Management Systems for Scientific Applications IFIP Conference Proceedings; Vol. 188 Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software Pages: 273 - 284 Year of Publication: 2000 ISBN:0-7923-7339-1 Reagan Moore Kluwer, B.V. Deventer, The Netherlands, The Netherlands

Provenance* Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility * Fox 2007. http://www.brainwashed.com/coil/images/others/provenance.jpg

ADaM – System Overview Developed by the Information Technology and Systems Center at the University of Alabama in Huntsville Consists of over 75 interoperable mining and image processing components Each component is provided with a C++ application programming interface (API), an executable in support of scripting tools (e.g. Perl, Python, Tcl, Shell) ADaM components are lightweight and autonomous, and have been used successfully in a grid environment ADaM has several translation components that provide data level interoperability with other mining systems (such as WEKA and Orange), and point tools (such as libSVM and svmLight) Future versions will include Python wrappers and possible web service interfaces

ADaM 4.0 Components

ADaM Classification - Process Identify potential features which may characterize the phenomenon of interest Generate a set of training instances where each instance consists of a set of feature values and the corresponding class label Describe the instances using ARFF file format Preprocess the data as necessary (normalize, sample etc.) Split the data into training / test set(s) as appropriate Train the classifier using the training set Evaluate classifier performance using test set K-Fold cross validation, leave one out or other more sophisticated methods may also be used for evaluating classifier performance First three steps are problem / domain specific while others remain more or less the same

ADaM Classification - Example Starting with an ARFF file, the ADaM system will be used to create a Naïve Bayes classifier and evaluate it The source data will be an ARFF version of the Wisconsin breast cancer data from the University of California Irvine (UCI) Machine Learning Database: http://www.ics.uci.edu/~mlearn/MLRepository.html The Naïve Bayes classifier will be trained to distinguish malignant vs. benign tumors based on nine characteristics

Naïve Bayes Classification Classification problem with m classes C1, C2, … Cm Given an unknown sample X, the goal is to choose a class that is most likely based on statistics from training data P(Ci | X) can be computed using Bayes’ Theorem: [1] Equations from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Naïve Bayes Classification P(X) is constant for all classes, so finding the most likely class amounts to maximizing P(X | Ci) P(Ci) P(Ci ) is the prior probability of class i. If the probabilities are not known, equal probabilities can be assumed. Assuming attributes are conditionally independent: P(xk | Ci) is the probability density function for attribute k [1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Naïve Bayes Classification P(xk | Ci) is estimated from the training samples Categorical Attributes (non-numeric attributes) Estimate P(xk | Ci) as percentage of samples of class i with value xk Training involves counting percentage of occurrence of each possible value for each class Numeric attributes Also use statistics of the sample data to estimate P(xk | Ci) Actual form of density function is generally not known, so Gaussian density is often assumed Training involves computation of mean and variance for each attribute for each class

Naïve Bayes Classification Gaussian distribution for numeric attributes: Where is the mean of attribute k observed in samples of class Ci And is the standard deviation of attribute k observed in samples of class Ci [1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.

Sample Data Set – ARFF Format ----- Meeting Notes (10/23/15 16:46) ----- attribute relation

Data management Metadata? Data? File naming? Documentation?

Splitting the Samples ADaM has utilities for splitting data sets into disjoint groups for training and testing classifiers The simplest is ITSC_Sample, which splits the source data set into two disjoint subsets

Splitting the Samples For this demo, we will split the breast cancer data set into two groups, one with 2/3 of the patterns and another with 1/3 of the patterns: ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66 The –i argument specifies the input file name The –o and –t arguments specify the names of the two output files (-o = output one, -t = output two) The –p argument specifies the portion of data that goes into output one (trn.arff), the remainder goes to output two (tst.arff) The –c argument tells the sample program which attribute is the class attribute

Provenance? For this demo, we will split the breast cancer data set into two groups, one with 2/3 of the patterns and another with 1/3 of the patterns: ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66 What needs to be recorded and why? What about intermediate files and why? How are they logically organized? ITSC_Sample -c class -i bcw.arff -o train_p.66.arff –t bcw_test.33.arff –p 0.66

Training the Classifier ADaM has several different types of classifiers Each classifier has a training method and an application method ADaM’s Naïve Bayes classifier has the following syntax:

Training the Classifier For this demo, we will train a Naïve Bayes classifier: ITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txt The –i argument specifies the input file name The –c argument specifies the name of the class attribute The –b argument specifies the name of the classifier file:

Applying the Classifier Once trained, the Naïve Bayes classifier can be used to classify unknown instances The syntax for ADaM’s Naïve Bayes classifier is as follows:

Applying the Classifier For this demo, the classifier is run as follows: ITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o res_tst.arff The –i argument specifies the input file name The –c argument specifies the name of the class attribute The –b argument specifies the name of the classifier file The –o argument specifies the name of the result file:

Evaluating Classifier Performance By applying the classifier to a test set where the correct class is known in advance, it is possible to compare the expected output to the actual output. The ITSC_Accuracy utility performs this function:

Confusion matrix Classified\ Actual 1 TRUE POSITIVES FALSE POSITIVES FALSE NEGATIVES TRUE NEGATIVES Gives a guide to accuracy but samples (i.e. bias) are important to take into account http://en.wikipedia.org/wiki/Confusion_matrix

Evaluating Classifier Performance For this demo, ITSC_Accuracy is run as follows: ITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txt

Python Script for Classification

How would you modify this?

What is the provenance?

ADaM Image Classification Classification of image data is a bit more involved, as there is an additional set of steps that must be performed to extract useful features from the images before classification can be performed. In addition, it is also useful to transform the data back into image format for visualization purposes. As an example problem, we will consider detection of cumulus cloud fields in GOES satellite images GOES satellites produce a 5 channel image every 15 minutes The classifier must label each pixel as either belonging to a cumulus cloud field or not based on the GOES data Algorithms based on spectral properties often miss cumulus clouds because of the low resolution of the IR channels and the small size of clouds Texture features computed from the GOES visible image provide a means to detect cumulus cloud fields.

GOES Images - Preprocessing Segmentation is based only on the high resolution (1km) visible channel. In order to remove the effects of the light reflected from the Earth’s surface, a visible reference background image is constructed for each time of the day. The reference image is subtracted from the visible image before it is segmented. GOES image patches containing cumulus cloud regions, other cloud regions, and background were selected Independent experts labeled each pixel of the selected image patches as cumulus cloud or not The expert labels were combined to form a single “truth” image for each of the original image patches. In cases where the experts disagreed, the truth image was given a “don’t know” value

GOES Images - Example GOES Visible Image Expert Labels

Image Quantization Some texture features perform better when the image is quantized to some small number of levels before the features are computed. ITSC_RelLevel performs local image quantization

Image Quantization For this demo, we will reduce the number of levels from 256 to just three using local image statistics: ITSC_RelLevel –d -s 30 –i src.bin –o q4.bin –k The –i argument specifies the input file name The –o argument specifies the output file name The –d argument tells the program to use standard deviation to set the cutoffs instead of a fixed value The –k option tells the program to keep values in the range 0, 1, 2 rather than normalizing to 0..1. The –s argument indicates the size of the local area used to compute statistics

Computing Texture Features ADaM is currently able to compute five different types of texture features: gray level cooccurrence, gray level run length, association rules, Gabor filters, and MRF models The syntax for gray level run length computation is:

Computing Texture Features For this demo, we will compute gray level run length features using a tile size of 25: ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25 The –i argument specifies the input file name The –o argument specifies the output file name The –l argument tells the program the number of levels in the input image The –B option tells the program to write a binary version of the ARFF file (default is ASCII) The –t argument indicates the size of the tiles used to compute the gray level run length features

Provenance alert! For this demo, we will compute gray level run length features using a tile size of 25: ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25 What needs to be documented here and why? ITSC_Glrl ???? Tell me.

Converting the Label Images Since the labels are in the form of images, it is necessary to convert them to vector form ITSC_CvtImageToArff will do this:

Converting ?????? Since the labels are in the form of images, it is necessary to convert them to vector form Consequences? Do you save them? Discussion?

Converting the Label Images The labels can be converted to vector form using: ITSC_CvtImageToArff –i lbl.bin –o lbl.arff -B The –i argument specifies the input file name The –o argument specifies the output file name The –B argument tells the program to write the output file in binary form (default is ASCII)

Labeling the Patterns Once the labels are in vector form, they can be appended to the patterns produced by ITSC_Glrl ITSC_LabelPatterns will do this:

Labeling the Patterns The labels are assigned to patterns as follows: ITSC_LabelPatterns –i glrl.arff –c class –l lbl.bin –L lbl.arff –o all.arff –B The –i argument specifies the input file name (patterns) The –o argument specifies the output file name The –c argument The –c argument specifies the name of the class attribute in the pattern set The –l argument specifies the name of the label attribute in the label set The –L argument specifies the name of the input label file The –B argument tells the program to write the output file in binary form (default is ASCII)

Eliminating “Don’t Know” Patterns Some of the original pixels were classified differently by different experts and marked as “don’t know” The corresponding patterns can be removed from the training set using ITSC_Subset:

Eliminating “Don’t Know” Patterns ITSC_Subset is used to remove patterns with unclear class assignment. The subset is generated based on the value of the class attribute: ITSC_Subset –i all.arff –o subset.arff –a class –r 0 1 -B The –i argument specifies the input file name The –o argument specifies the output file name The –a argument tells which attribute to test The –r argument tells the legal range of the attribute The –B argument tells the program to write the output file in binary form (default is ASCII)

Selecting Random Samples Random samples are selected from the original training data using the same ITSC_Sample program shown in the previous demo The program is used in a slightly different way: ITSC_Sample –i subset.arff –c class –o s1.arff –n 2000 The –i argument specifies the input file name The –o argument specifies the output file name The –c argument specifies the name of the class attribute The –n option tells the program to select an equal number of random samples (in this case 2000) from each class.

Python Script for Sample Creation

What modifications here??

Merging Samples / Multiple Images The procedure up to this point has created a random subset of points from a particular image. Subsets from multiple images can be combined using ITSC_MergePatterns:

Merging Samples / Multiple Images Multiple pattern sets are merged using the following command: ITSC_MergePatterns –c class –o merged.arff –i s1.arff s2.arff The –i argument specifies the input file names The –o argument specifies the output file name The –c argument specifies the name of the class attribute

Python Script for Training

Results of Classifier Evaluation The results of running this procedure using five sample images of size 500x500 is as follows:

Applying the Classifier to Images Once the classifier is trained, it can be applied to segment images. One further program is required on the end to convert the classified patterns back into an image:

Python Function for Segmentation

Sample Image Results Expert Labels Segmentation Result

Remarks http://datamining.itsc.uah.edu/adam/ The procedure illustrated here is one specific example of ADaM’s capabilities There are many other classifiers, texture features and other tools that could be used for this problem Since all of the algorithms of a particular type work in more or less the same way, the same general procedure could be used with other tools DOWNLOAD the ADaM Toolkit http://datamining.itsc.uah.edu/adam/

Numpy, scipy http://scikit-learn.org/stable/ http://orange.biolab.si

R http://www.rdatamining.com R Studio

Management What did you learn? Provenance elements? How to deal with both?

Reading this week – (week 4)

Project Teams (A4) Byrne, Corey; Cheng, Jie; Deb, Arijit; De Los Santos, Hannah; Zednik, Stephan Cattarin, Lee; Del Priore, James; Dwivedi, Aayush; Gadomski, Paula; Wang, Jixuan; Xu, Zhe Grivas, Genevieve; Gentyala, Vijay; Ho, Chien-Wei; Khachaturyan, David; Renus, Christopher Falk, Jeremy; Kolankowski, Sophia; Kuzmin, Konstantin; Peshin, Ankur; Yu, Guo Li, Lilli; Liu, Guohui; Russo, Robert; Sharma, Sidharth; Thorne, Brandon Francis, Kyle; Liu, Yi; Llewellyn, Maxwell; Sedlacek, Aaron; Troughia, Chandan Singh; You, Charles Gass, Damon; Lu, Yongqian; Machado, Ylonka; Norris, Spencer; Subhedar Shradda, Arun Maeser, Adrienne; Minster, Zachary; Mukherjee Partha, Sarathi; Pan, Xiaoman; Tse, Timothy Boulter, James; Nunez, Nicole; Prabhakaran, Sidharth; Qui, Yichen; Ward, David Anyone missing?