CS910: Foundations of Data Analytics Graham Cormode Introduction.

Slides:



Advertisements
Similar presentations
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Advertisements

2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
Brainstorm About Computer Networks Take 3-4 minutes to write –Include your name (I’ll collect and read, but not grade) What are some network applications?
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
CS/CMPE 535 – Machine Learning Outline. CS Machine Learning (Wi ) - Asim LUMS2 Description A course on the fundamentals of machine.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Applied Business Forecasting and Regression Analysis Introduction.
Data Mining – Intro.
CS346: Advanced Databases Graham Cormode Term 2.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Spring 2012 MATH 250: Calculus III. Course Topics Review: Parametric Equations and Polar Coordinates Vectors and Three-Dimensional Analytic Geometry.
Section 01Resources1 HSQ - DATABASES & SQL 01 Resources And Franchise Colleges Name :MANSHA NAWAZ room :G 0/32
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Ryann Kramer EDU Prof. R. Moroney Summer 2010.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
COMP 151: Computer Programming II Spring Course Topics Review of Java and basics of software engineering (3 classes. Chapters 1 and 2) Recursion.
Machine Learning Queens College Lecture 1: Introduction.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Business Administrators of today and tomorrow need, along with their business knowledge, analytic insight and understanding, as well the ability.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
CS 103 Discrete Structures Lecture 01 Introduction to the Course
Dept. of Computing Science, University of Aberdeen1 CS4031/CS5012 Data Mining and Visualization Yaji Sripada.
ITCS 6157/8157 Visual Database Fall 2015
WXGE 6103 Digital Image Processing Semester 2, Session 2013/2014.
Updated Today's talk should help you to understand better  what your responsibilities for this module  how you will be taught  how you.
CS525 DATA MINING COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
CS346: Advanced Databases Alexandra I. Cristea Term 1.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
DATABASES Southern Region CEO Wednesday 13 th October 2010.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Computer Sciences at NYU Open House January 2004 l Graduate Study at New York University l The MS in Computer Sciences l The MS in Information Systems.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Fall 2014 MATH 250: Calculus III. Course Topics Review: Parametric Equations and Polar Coordinates Vectors and Three-Dimensional Analytic Geometry.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
CS910: Foundations of Data Analytics Graham Cormode Introduction.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Summary of Assessments By the Big Island Team: (Sherry, Alan, John, Bess) CCSS SBAC PARCC AP CCSSO.
OMIS 694, Big Data Analytics
Mining of Massive Datasets Edited based on Leskovec’s from
Big Data Yuan Xue CS 292 Special topics on.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Water and people in a changing world Yhd ; Spring
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
CS 784: Advanced Topics in Data Management This semester’s focus: Data Science AnHai Doan.
Audit Analytics --An innovative course at Rutgers Qi Liu Roman Chinchila.
Data Mining – Intro.
CS6501 Advanced Topics in Information Retrieval Course Policy
ITCS 6157/8157: Visual Database
COMP24111 Course Unit Overview
INF 103 Education for Service-- snaptutorial.com.
INF 103 Teaching Effectively-- snaptutorial.com
INF 103 Education for Service-- tutorialrank.com
Data Mining: Concepts and Techniques Course Outline
Introduction to Data Programming
Predictive Modeling using Python
Introducing Qwory, a Business-to-Business Search Engine That’s Powered by Microsoft Azure and Detects Vital Contact Information for Businesses MICROSOFT.
Data Warehousing and Data Mining
COMP24111 Course Unit Overview
Course Introduction CSC 576: Data Mining.
CS1301 – Where it Fits Institute for Personal Robots in Education
CS1301 – Where it Fits Institute for Personal Robots in Education
Big DATA.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

CS910: Foundations of Data Analytics Graham Cormode Introduction

Agenda  Introductions  Introduction to Foundations of Data Analytics  Course Admin  Marketplace Survey CS910 Foundations of Data Analytics 2

Data Analytics  What is Data Analytics? – The science of studying data to draw conclusions  Why? – More organizations are collecting more data than ever before Business, Government, Healthcare, Charity – everything! – This data holds many insights into their operations and beyond – Data Analytics is required to extract these insights Requires analytical, statistical and computational skills – Lot of focus (investment) on analytics/big data/data science CS910 Foundations of Data Analytics 3

Analytics in Action: Flu Trends

Digging deeper into Flu Trends  Privacy concerns raised: users did not consent to this use of data – The slippery slope argument: what else will data be used for?  Accuracy concerns raised: will it remain accurate? – Initial report of 0.97 correlation with official CDC data – Prevalence overestimated by 50% in 2013 flu season – Possible explanation: media speculation about flu epidemic caused more searches for related terms – Models need to be continually tuned and refined  Lesson: data analytics can be a moving target… CS910 Foundations of Data Analytics 5

This Module  What is Foundations of Data Analytics about? – The tools to manipulate and aggregate data – Dealing with data problems (missing values, changing format) – Models to represent data – Building and testing hypotheses about the data – Algorithms to analyze data – Ways to scale up analytics to big data – Privacy consequences of data analytics  This module emphasizes the foundations – Will focus on the theoretical underpinnings of these methods – Will require some mathematical and computational thinking CS910 Foundations of Data Analytics 6

Topics in Detail 1.Data basics, the different kinds of data – Refresher on probability, distributions, significance tests 2.Introduction to analytics, case studies – How analytics is used in practice. Examples from YouTube, Facebook, Kaggle, and Twitter. 3.Basic tools: command line, plotting, programming tools 4.Modeling data via regression: – linear regression, least squares, logistic regression 5.Classification to predict values – Decision tree, Naive Bayes, Support Vector Machines 6.Clustering methods – Finding clusters in data (hierarchical, k-means, k-center) CS910 Foundations of Data Analytics 7

Topics in Detail 7.Recommender systems – Making recommendations (movies, music, products) for people 8.Time series data – Predicting data from a sequence of observations 9.Database issues – Data quality, data cleaning, Relational data, SQL, NoSQL 10.Data Structures for big data and data streams. – The Bloom filter and sketch data structures 11.Graphs and networks – Graph representations of data (application to social networks) CS910 Foundations of Data Analytics 8

Course Administration  Lectures start at 5 past the hour, should finish by 5 to the hour – To allow time to get to next lecture/get held up by traffic  Attendance is not taken  Phones off/silent in lectures – No one wants to hear your “wacky” ringtone  Laptops/Tablets/phones permitted but not recommended – Too easy to get distracted messaging/surfing  Questions welcomed in lectures – Quick clarifications at any point – Detailed queries best saved for the end, or via CS910 Foundations of Data Analytics 9

Course Assessment  Exam in 2015 – 2 hours, contributes 50% to final grade  Project worth 40% due 12 January 2015 (Wk 2 Term 2) – Project briefing lecture in a couple of weeks  5 assessed homeworks applying skills from lectures (10%) – Due dates: noon, Weeks 2, 4, 6, 8, 10 – Lab drop-in sessions: 10am, Weeks 2, 4, 6, 8, 10 – Lab tutors: Faiz Sayyid and Bo GaoFaiz Sayyid Bo Gao  Updates/news on course webpage and via – www2.warwick.ac.uk/fac/sci/dcs/teaching/modules/cs910 CS910 Foundations of Data Analytics 10

First piece of coursework  Warm-up exercise in using Weka – Load a data set, explore it, make observations – Hopefully will not be taxing – Can do whenever you like, wherever you like – Lab session: tutors on hand to help and advise  Submission: complete the worksheet, hand in to CS reception – Deadline: next Wednesday 12 noon CS910 Foundations of Data Analytics 11

Course Material  A developing topic, so no textbook covers everything  Slides will be put on the course webpage after lectures – Handouts available at the start of each section  Plenty of material on the web on each topic – Wikipedia is a good place to start (but not to finish)  Data Mining: Concepts and Techniques 3 rd ed. Han, Kanber, Pei – Good coverage of many core data analytic ideas – Text available online via Warwick Library (ebook) – Also useful for CS909 Data Mining  Other sources will be linked to from slides, course page CS910 Foundations of Data Analytics 12

DESIRABLE SKILLS IN DATA ANALYTICS CS910 Foundations of Data Analytics 13

Senior Data Scientist - Expedia The successful candidate will have the following skills and Experience:  A (Masters or PhD) background in computer science or statistics with strong machine learning component.  Will have expert knowledge of at least one of the following programming languages or equivalents; Ruby, Python, R, and or functional languages such as Lisp, Haskel or Erlang.  Have very good understanding of database technologies; Hadoop, Mongo or equivalent, and standard relational database structures along with query languages such as Hive, Pig and SQL.  As well as these programming skills, the candidate should be able to demonstrate a very good understanding of one of the following; Bayesian networks, Neural networks, Heuristics, Support vector machines, genetic algorithms, or PAC learning. Along with good knowledge of statistical classification techniques such as k means and hierarchical clustering, partition trees, and logistic regression. CS910 Foundations of Data Analytics 14

Yahoo! Experienced Data Analyst  We are looking for a Data Analyst with industry experience who is able to take large datasets and analyze them using statistical methods to draw out insights and data trends. They will be experienced at analysis techniques using Excel or R and comfortable using Unix, working with big data, scripting with Perl/Python and be able to quickly construct SQL queries to interrogate databases.  Independence, logical reasoning, and motivation is important. Being able to work in an Agile environment is very important. The candidate should demonstrate the ability to learn new technologies and be happy to take on responsibility. They should have excellent communication skills and be able to present their findings in a clear and concise way. CS910 Foundations of Data Analytics 15

Google Statistician/Engineering Analyst  MS or PhD in Statistics or other quantitative disciplines such as Engineering, Applied Mathematics, etc.  Broad work experience with large data sets.  Considerable practical experience in quantitative analysis.  Specific positions can benefit from experience in one or more of: Operations Research, Online advertising, search, commerce Machine Learning Languages such as Python, JavaScript Forecasting, Time-series modeling Proficiency in foreign languages.  Excellent written and verbal presentation skills. CS910 Foundations of Data Analytics 16

“One of the largest global tech companies” The company buys ad impressions in real time auctions and algorithmically deliver the most relevant ad possible.  Identify and work with large datasets from multiple sources  Visualize and analyse data, developing hypotheses and ideas for experiments.  Run experiments to improve the relevance and efficiency of all advertising.  Identify relevant research from industry and academia. Preferred Qualifications  Masters in a relevant field and/or experience is highly regarded  Iteratively analysing data, integrating new data, experimenting and optimizing  Near real-time data analysis, feeding into decisioning systems.  Practical experience in a variety of machine learning and modelling techniques including time series forecasting, decision trees, multi-linear/logistic regression and Bayesian analysis.  Presenting data effectively.  Experience using R, SAS or equivalent. CS910 Foundations of Data Analytics 17

Facebook Quantitative Engineer Requirements  MS/PhD in computer science, computational statistics, computational econometrics, operations research or related field.  Hands-on, deep knowledge of Python as a user of scientific libraries (numpy, scipy, pandas, scikit-learn, etc.) and as a generalist. Alternatively, R or MATLAB with strong C++ or Java experience.  2+ years experience and an excellent understanding of machine learning techniques (classification, clustering, dimensionality reduction)  2+ years hands on experience working with large datasets (>10TB) on distributed systems.  Good understanding of fundamentals of statistics.  Good understanding of fundamentals of SQL. CS910 Foundations of Data Analytics 18

Recommended Reading  Data Mining Concepts and Techniques, Chapter 1: Introduction –  “Detecting influenza epidemics using search engine query data” Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski & Larry Brilliant – cting-influenza-epidemics.pdf cting-influenza-epidemics.pdf  “When Google got flu wrong”, Nature (news) – CS910 Foundations of Data Analytics 19

Related Modules CS910 Foundations of Data Analytics 20 CS910: Foundations of Data Analytics (T1) CS912: Sensor Networks and Mobile Data (T2): data collection CS909: Data Mining (T2): more advanced analytics tools CS911: OR and optimization (T1): decisions from data CS411: Dynamic Web (T1): putting data on the web CS915: Advanced Computer Security (T2): protecting data CS402: High Performance Computing (T2): scaling to big data CS413: Image & Video Processing (T1): multimedia data CS404: Agent Based Systems (T1): users and data as entities Low-level/systemsHigh-level/software CS916: Social Informatics (T2): data and society

Picking MSc Modules For Data Analytics: need 4 optional modules to meet 180 CATS  Give broad coverage of data collection/processing/usage  Suggest 2 T1, 2 T2 to balance workload  Engineering options: data communications (optical/radio/sensor) – Suit physics/engineering background, interest in hardware  CS options: – User oriented/modeling: Agent based systems, dynamic web – Systems-level: HPC, Security, Multimedia  Usual advice: sit in on a few lectures to get a feel for course – Contact course organizer if really stuck: CS910 Foundations of Data Analytics 21