Knowledge Discovery and Data Mining (COMP 5318) S1, 2013.

Slides:



Advertisements
Similar presentations
CS6501: Text Mining Course Policy
Advertisements

IS5152 Decision Making Technologies
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Machine Learning Bob Durrant School of Computer Science
Machine Learning Reading: Chapter 18, Agenda and Announcements Machine Learning assignment will go out on Thursday. Tutorial in class on tool for.
Machine Learning (Extended) Dr. Ata Kaban
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Clementine Server Clementine Server A data mining software for business solution.
Chapter 14 The Second Component: The Database.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
CS6501 Information Retrieval Course Policy Hongning Wang
Lab2 CPIT 440 Data Mining and Warehouse.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
B OTNETS T HREATS A ND B OTNETS DETECTION Mona Aldakheel
Energy Issues in Data Analytics Domenico Talia Carmela Comito Università della Calabria & CNR-ICAR Italy
First... Background Topics Schedule Self Study Me Willem de Bruijn PhD candidate at Vrije Universiteit.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Xiaoying Sharon Gao Mengjie Zhang Computer Science Victoria University of Wellington Introduction to Artificial Intelligence COMP 307.
COMP Introduction to Programming Yi Hong May 13, 2015.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Dept. of Computing Science, University of Aberdeen1 CS4031/CS5012 Data Mining and Visualization Yaji Sripada.
Chapter 1 Introduction to Data Mining
COMP 175 | COMPUTER GRAPHICS Remco Chang1/ Introduction Lecture 01: Introduction COMP 175: Computer Graphics January 15, 2015.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Machine Learning Lecture 1. Course Information Text book “Introduction to Machine Learning” by Ethem Alpaydin, MIT Press. Reference book “Data Mining.
Data Mining: An Introduction Billy Mutell. “The Library of Babel” Analogy Network of bookshelves with every book ever written All the books one could.
1 Machine Learning (Extended) Dr. Ata Kaban Algorithms to enable computers to learn –Learning = ability to improve performance automatically through experience.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
Distributed Pattern Recognition System, Web-based by Nadeem Ahmed.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.
Mining of Massive Datasets Edited based on Leskovec’s from
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Business Intelligence Overview. What is Business Intelligence? Business Intelligence is the processes, technologies, and tools that help us change data.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
DATA MINING: LECTURE 1 By Dr. Hammad A. Qureshi Introduction to the Course and the Field There is an inherent meaning in everything. “Signs for people.
1 SBM411 資料探勘 陳春賢. 2 Lecture I Class Introduction.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
What types of problems we study, Part 1: Statistical problemsHighlights of the theoretical results What types of problems we study, Part 2: ClusteringFuture.
Intro to Machine Learning
School of Computer Science & Engineering
Data Analytics for ICT.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Sangeeta Devadiga CS 157B, Spring 2007
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
CSE591: Data Mining by H. Liu
Data Science introduction.
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Data Mining: Introduction
Welcome! Knowledge Discovery and Data Mining
Azure Machine Learning
Presentation transcript:

Knowledge Discovery and Data Mining (COMP 5318) S1, 2013

The Lecturing Team CoordinatorSanjay Chawla, SIT LecturerSanjay Chawla and Wei Liu (NICTA) TutorsDidi Surian, Linsey Pang and Fei Wang (PhD Students)

Material and Lectures ● Lectures will be posted on usyd.edu.au/~comp5318 ● We will mainly follow the textbook from Rajaram, Leskovic and Ullman from Stanford which is available online: ● However the ordering will be different

Assessment Package In-Class Test (15%)Week 6 Group Assignment (20%)Week 10 Research Paper PresentationWeek (15%) Final Exam (50%)See Exam Calendar To pass the class you must get at least 50% in the final exam; and at least 40% (combined) in other assessments

So what is this class about Building large knowledge discovery pipelines to solve real-world problems (a.k.a "Data Analytic" pipelines) 2. Learning about techniques to analyze and algorithms to mine data 3. Learn how to read original research in data mining and machine learning 4. Learn how to solve large data problems in the cloud

Coordinates of Data Mining Algorithms/ Statistics and Distributed &Machine Linear ParallelLearning Algebra Computing Data Mining Database Information Computer Management Retrieval Vision Systems

Abstract Tasks in Data Mining ● Clustering and Segmentation: how to automatically group objects into clusters ○ Take photographs from Flickr and automatically create categories ● Classification and Regression: how to make statistical models for prediction. ○ Predict whether an online user will click on a banner advertisement ○ Predict the currency exchange rate tomorrow (AUS/USD) ○ Predict who will win the NBA champion in 2013

Abstract Tasks in Data Mining....cont ● 3. Anomaly Detection: Identify entities which are different from the rest in the group ○ While galaxy is different in an astronomical database ○ Which area has an unusual flu rate ○ Is this credit card transaction fraudulent ? ○ Identify cyber attacks: Denial of Service (DOS) and Portscan ○ Identify genes which are likely to cause a certain disease

Knowledge Discovery Pipeline Data Source Data Integration Presentation Data Mining Task of Results Data Source

Example: Large Scale Advertising Systems 3) Advertiser 1 1) Publisher 3) Web Page Ad ExchangeAdvertiser n 4) 7) Demand 3) Side Platform Ad Server

Lets do something tangible... Underlying all data mining tasks...is the notion of similarity.. 1. When are two images similar ? 2. When are two documents similar ? 3. When are two patients similar ? 4. When are two shopping-baskets similar ? 5. When are two job candidate similar ? 6. When are two galaxies similar ? 7. When is network traffic similar ?

Data Vector In Data Mining, data is often transformed to a vector of numbers. e.g., D1: computer science and physics have a lot in common. In the former, we build models of computation and in the latter, models of the physical world. athebrain latter ofworld cheese in What is the length of this vector ?

Data Vector....cont 700 x

Data Vector...cont

Similarity ● Once we have data vectors, we can start the computation process...for example,..when are two data vectors similar While pair of currency trades are more similar ?

Similarity Computation Suppose want to compute similarity between two vectors: x = ; y = Step 1: compute the length of each vector: ||x||= ( )1/2 = ( )1/2 = 5.48 ||y|| = ( )1/2= ( )1/2 = 3.87 Step 2: compute the dot product: x.y = = = 16 Step 3: (x.y/||x|| ||y||) = (16/(5.48)(3.87)) = 0.75

Cosine Similarity 1. Thus similarity (sim(x,y)) between two data vectors x and y is given by x.y/(||x||.||y||) 2. This is called cosine similarity (Why ?) 3. This is a very general concept and underpins much of data-driven computation 4. We will be coming back to it..over and over again

More examples x= ; y = sim(x,y) = 0 x= ; y = sim(x,y) = 1 If all elements of data vector are non-negative, then: 0 <= sim(x,y) <= 1

Cost of Computation x = ; y = Step 1: compute the length of each vector ||x||= = = 30 [4 mult; 3 adds + 1 sqrt] ||y|| = = =15 [4 mult; 3 adds + 1 sqrt] Step 2: compute the dot product: x.y = = = 16 [4 mult; adds] Step 3: (x.y/||x|| ||y||) = (16/30.15) = [1 mult, 1 divide] Total FLOPS (assuming 1 FLOP per operation) = d+d + d+ d + d + (d-1) = 6d + 1

Cost of Computation Cost of Similarity between two vectors of length d, is 6d+1 or ~ 6d. Suppose want to find the similarity between all Wikipedia documents. Number of Articles: ~4,000,000 Length of data vector: ~ [# of words in dictionary] Number of pairwise combinations: ~ 8 x 1012 Number of flops: ~ 8 x 1012 x 6 x 105 = 48 x 1017 ~ 1018 World's fastest computer (2012); Titan at Oak Ridge Labs: 27 peta flops (27 thousand, trillion flops; 1016 ) [ 100 seconds] Current desktop: 3 Ghz; ~109 flops per second. Thus 109 seconds ~ 33 yrs.

Summary We use data mining to build knowledge discovery pipelines. Data Mining is the process of applying algorithms to data. A key concept is that of defining similarity between entities.