A Journey into the Dark Side Kevin Li

Slides:



Advertisements
Similar presentations
Machine Learning and Data Mining Linear regression
Advertisements

INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Introduction to Machine Learning Anjeli Singh Computer Science and Software Engineering April 28 th 2008.
SOFTWARE SYSTEMS SOFTWARE APPLICATIONS SOFTWARE PROGRAMMING LANGUAGES.
Traffic Sign Recognition Using Artificial Neural Network Radi Bekker
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
M Machine Learning F# and Accord.net.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Delivering Business Value through IT Face feature detection using Java and OpenCV 1.
CS 1110/1111 The Case for Computer Science CS 1110/1111 – Introduction to Programming.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Mining of Massive Datasets Edited based on Leskovec’s from
A Heuristic Hillclimbing Algorithm for Mastermind Alexandre Temporel and Tim Kovacs.
AZURE MACHINE LEARNING Bringing New Value To Old Data SQL Saturday #
11 Making Decisions in a Program Session 2.3. Session Overview  Introduce the idea of an algorithm  Show how a program can make logical decisions based.
Course Project Lists for ITCS6157 Jianping Fan. Project Implementation Lists Automatic Image Clustering You can download 1,000,000 images from You can.
Energy Management Solution
Collective Intelligence Week 11: k-Nearest Neighbors
DNS Security Advanced Network Security Peter Reiher August, 2014
Entity Relationship Diagrams - 1
Measuring Where CPU Time Goes
Information Systems in Organizations 4
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
AI Powered ADS A STEP BY STEP GUIDE TO EXTREME PERSONALIZATION
Intro to Computer Science CS1510 Dr. Sarah Diesburg
Spring 2003 Dr. Susan Bridges
Analytics and OR DP- summary.
Erasmus University Rotterdam
Energy Management Solution
Neural Network Decoders for Quantum Error Correcting Codes
Entity Relationship Diagrams - 1
Entity Relationship Diagrams - 1
Budgeting with Power Pivot
Machine Learning & Data Science
Cse 344 May 30th – analysis.
Conditions and Ifs BIS1523 – Lecture 8.
CHAPTER 14: Confidence Intervals The Basics
Intro to Computer Science CS1510 Dr. Sarah Diesburg
Information Systems in Organizations 4
Name: Form: What Are Bar-codes
Information Systems in Organizations 4
Pixels.
Entity Relationship Diagrams - 2
Topic 1: Problem Solving
Entity Relationship Diagrams - 2
Overfitting and Underfitting
Amazon Machine Learning
CS 5310 Data Mining Hong Lin.
Creative Project Revised 3/02
Intro to Computer Science CS1510 Dr. Sarah Diesburg
CS 322 week 3 Gold standard of science: testing (randomized programming and testing, one major component of software engineering). Sample stat reasoning.
Evaluating Classifiers
CSE 802. Prepared by Martin Law
Lecture 14 Learning Inductive inference
Review of Previous Lesson
Shih-Yang Su Virginia Tech
Coventry University, UK
Welcome to Microsoft Azure for Research Training!
The Scientific Method.
Martin Rinard, Jiasi Shen, Varun Mangalick
An Introduction to Data Science using Python
R for Data Science Data science Data science is a booming field in today’s world. Since Artificial Intelligence is the main focus of today’s technology,
Presentation transcript:

A Journey into the Dark Side Kevin Li Big Data Fallacies A Journey into the Dark Side Kevin Li

Big Data Visualization Databases Data Mining Machine Learning Information Visualization Databases Artificial Intelligence Big Data Statistical Learning Optimization Data Structures Massive Data Sets Data Mining Machine Learning Modeling Cloud Computing

CS 46N CS 145 CS 448B STATS 202 CS 229 CS 124 CS 221 CS 166 CS 341 CS 229T CME 375 CS 166 CS 341 STATS 202 CS 229 CS 264 CS 309A

Could it ever go wrong?

Bigger ≠ Better Source: http://www.smartdatacollective.com/charles-settles/199906/big-data-big-money-roi-business-intelligence

Source: http://techcrunch

Source: http://siliconangle

How to find the best model? Find out if a student will major in CS Given 50,000 student profiles with their major Construct a major “predictor” How should we use the data? How complex should the model be? How do we tell if our model is good?

Rote Learning 0 error algorithm Training: store data set Model: If student in data, return major Otherwise, crash Focus on improving unforeseen future performance

How to prevent overfitting? Focus on relevant parts of data - select fewer features Keep the model simple - restrict the predictor’s complexity Test your model - use validation sets

Say... we processed the data correctly, what else can go wrong?

Statistics can lie. Study that collected data on income and education Found that white Americans need a higher level of education to achieve the same level of income as black Americans Conclusion: reverse discrimination??

Graphs can also lie. Source: http://data.heapanalytics.com/how-to-lie-with-data-visualization/

Graphs can also lie. Source: https://en.wikipedia.org/wiki/Misleading_graph

Source: http://www. politifact

Source: http://www. politifact

Source: http://www. politifact

Source: http://www. politifact

Perfect data + Correct analysis = Happy ending?

No.

Twitter auto-tagging

Can machines be racist? Princeton Review uses big data to determine quotes Pricing determined by ZIP code Asians twice as likely to be offered higher price Even in lower income neighborhoods http://www.propublica.org/article/asians-nearly-twice-as-likely-to-get-higher-price-from-princeton-review Racist? Preventable?

What is the takeaway? Big Data is not easy to use Big Data isn’t always trustworthy Big Data can’t immediately solve everything