applications and skills required Data Science: applications and skills required
A Society that is “Always On” Society, organizations, and people are “Always On”. Your “data plan” keeps you (always) in the touch Data are collected about anything, at any time, and at any place. Register for courses Check/post on social media Pay toll with your PeachPass Track your exercises with FitBit
Internet of Events
Examples of Big Data Bit (0 or 1) and Byte (8 bits: big enough for a char.) Kilo-Byte ~= 1000 Bytes (1024 to be exact) Mega-, Giga-, and Tera- are common now Peta-, Exa-, and Zetta- An IDC study estimates that the amount of digital information stored in 2014 already exceeded 4 Zettabytes and predicts that the “digital universe” will to grow to 44 Zettabytes in 2020. The study characterizes 44 Zettabytes as “6.6 stacks of iPads from Earth to the Moon”. Twitter produces over 90 million tweets per day. eBay uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.
Big Data – characteristics Volume - The quantity of generated and stored data. Variety - The type and nature of the data. Velocity - The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Veracity - The quality of captured data can vary greatly, affecting accurate analysis. Variability - Inconsistency of the data set can hamper processes to handle and manage it.
The Big Data Mindset Design marketing processes with data in mind: reengineer marketing processes to collect relevant data Engage in R&D everywhere: promote a culture of testing throughout the organization Use predictive analytics: identify customer patterns and generate targeted offers Challenge conventional wisdom: data analytics can provide definitive answers, there’s no excuse for using the status quo as a default
A Big Challenge … One of the main challenges of today’s organizations is to extract information and value from data stored in their information systems.
Data Science: definition
Data Science: illustration Data science aims to turn data into real value… Data Value Extract Trasform-ation Learn-ing Structured: DB;Spreadseet Un-Structured: Email; text Big Small Static Streaming Any type of visualization delivering insights
Contributing Disciplines Courses required in the CSC/CPS programs Programming STA/MAT courses Database DM/ML methods Graphics & Visualization
Data Scientists: what they do Assist organizations in turning data into value. A data scientist answers questions, like • (Reporting) What happened? • (Diagnosis) Why did it happen? • (Prediction) What will happen? • (Recommendation) What is the best that can happen?
Positions in a DS Team Data analyst Data engineer (data wrangler) Data scientist Specialists Algorithms & Performance Visualization Big data tools
8 Skills You Need to Be a Data Scientist http://blog.udacity.com/2014/11/data-science-job-skills.html
Machine Learning/Data Mining Tasks Classification (map data into predefined groups) Regression (map a data item to a real valued prediction variable) Prediction (similar to classification, but deal with a future state) Clustering (similar to classification, but the groups are defined by the data) Association rules (identifies association among data) Sequence discovery (determine sequential patterns in data)
What can we do in class? Data science awareness in CSC 125 Formulas, functions, charts (w/ Excel), queries (Access) Importing, transforming, sorting and filtering CSC/CPS courses Programming, DB, Software Engineering, Visualization, HPC Intro to Data Science Data exploration and processing Machine learning
Sample Application: Customer Attrition With a customer attrition analysis (telecom) Churn or not churn Dataset with 3333 rows (customers) and 21 columns State, area code, phone number AccountLength, IntlPlan, VMailPlan, VMailMessage Minutes, # calls, and charge for Day/Eve/Night time IntlMins, IntlCalls, IntlCharge CustServCalls
Going Through the Process Data exploration Correlations btw DayMins, DayCalls, DayCharge
Going Through the Process Data exploration: ratios Impact of International plan Impact of # of service calls
Visual Data Mining Impact of multiple variables # of service calls Day minutes
Machine Learning Models Decision tree Divide data into training and test groups (/w similar dist.) Training group (80% or 2666/3333) build model Test group (the rest 20%) evaluate model
Machine Learning Models The tree model Evaluation Correlation is 0.7496375
Sample Application: process mining Adds process perspective to ML and DM Seeks for confrontation btw event data (observed) and process model (hand-made or discovered)
Merging Framework
Affinity Function How to evaluate the strength of the bindings between antigen and antibody occurrence frequency (AOF) temporal relation (OLT) event attibute value (EAV)
Occurrence Frequency <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> An execution sequence
Occurrence Frequency <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> 18 juni 2018 Occurrence Frequency <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> occurrence frequency of <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> is 2 in the fragment of log.
Occurrence Frequency occurrence frequency of <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> is 2 18 juni 2018 Occurrence Frequency occurrence frequency of <𝑎,𝑏,𝑐,𝑑,𝑒,𝑓> is 2 occurrence frequency of <𝐼,𝐽,𝐿,𝑀,𝑄,𝑅> is 2 When two cases match, the occurrence frequencies of their execution sequence are ‘equivalent’ statistically.
Temporal Relation If two cases match, there is some time overlap between them. IN000001 2016-06-06T23:08:04 2016-06-07T14:03:02 TK00005 2016-06-07T09:18:32 2016-06-07T16:55:37
Temporal Relation
Event Attribut Value In real life processes, it often happens that some values are passed from event to event between two cases belonged to two different logs but identical whole process.
Process Mining Apps in Healthcare What happened? What is the typical treatment of patients having acute myeloid leukemia? What is the typical working day of a surgeon? Why did it happen? What caused the unusual amount of incidents in the department? Why was the service level agreement not reached? What caused the long waiting list? What will happen? Is this patient likely to deviate from the normal treatment plan? How many beds are needed tomorrow? Is it possible to handle these five new cases in time? What is the best that can happen? Which check should be done first to reduce flow time? How many physicians are needed to reduce the waiting list by 50%?
Processes for Experiment