What types of problems we study, Part 1: Statistical problemsHighlights of the theoretical results What types of problems we study, Part 2: ClusteringFuture.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Introduction to Support Vector Machines (SVM)
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
Watermarking in WSNs Anuj Nagar CS 590. Introduction WSNs provide computational and Internet interfaces to the physical world. They also pose a number.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
User Profiling for Intrusion Detection in Windows NT Tom Goldring R23.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
A Secure Protocol for Computing Dot-products in Clustered and Distributed Environments Ioannis Ioannidis, Ananth Grama and Mikhail Atallah Purdue University.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Chapter 13 Genetic Algorithms. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks Chapter.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
Self organizing maps 1 iCSC2014, Juan López González, University of Oviedo Self organizing maps A visualization technique with data dimension reduction.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Lecture 1 Page 1 CS 239, Fall 2010 Distributed Denial of Service Attacks and Defenses CS 239 Advanced Topics in Computer Security Peter Reiher September.
CS Statistical Machine learning Lecture 24
Machine Learning Saarland University, SS 2007 Holger Bast Marjan Celikik Kevin Chang Stefan Funke Joachim Giesen Max-Planck-Institut für Informatik Saarbrücken,
Link Building and Communities in Large Networks Martin Olsen University of Aarhus Link Building Link Building is NP-Hard The dashed links show the set.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
MotivationLocating the k largest subsequences: Main ideasResults Problem definitions Problem instance ( k=5 ) Bibliography
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Mathematics Courses  Specialist (ATAR) – strongly recommended to be studied concurrently with Methods  Methods (ATAR)  Applications (ATAR)  Essential.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
SketchVisor: Robust Network Measurement for Software Packet Processing
Advanced Algorithms Analysis and Design
COMP541 Memories II: DRAMs
OPERATING SYSTEMS CS 3502 Fall 2017
Machine Learning for Computer Security
12. Principles of Parameter Estimation
QianZhu, Liang Chen and Gagan Agrawal
Classification of unlabeled data:
Data Mining Lecture 11.
CS6220: Data Mining Techniques
SEC 572 Competitive Success/snaptutorial.com
SEC 572 Enthusiastic Studysnaptutorial.com
SEC 572 Education for Service-- snaptutorial.com.
SEC 572 Teaching Effectively-- snaptutorial.com
Hash Table.
Jianping Fan Dept of CS UNC-Charlotte
Advanced Topics in Data Management
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Objective of This Course
Data Science introduction.
Chapter 3 Hardware and software 1.
Parallel Analytic Systems
Chapter 3 Hardware and software 1.
October 6, 2011 Dr. Itamar Arel College of Engineering
Greedy clustering methods
Distributed Computing:
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Introduction to Stream Computing and Reservoir Sampling
Computer Systems Performance Evaluation
Junghoo “John” Cho UCLA
Text Categorization Berlin Chen 2003 Reference:
Mathematical Foundations of BME Reza Shadmehr
Information Theoretical Analysis of Digital Watermarking
New Jersey, October 9-11, 2016 Field of theoretical computer science
12. Principles of Parameter Estimation
Design and Analysis of Algorithms
IE 360: Design and Control of Industrial Systems I
Introduction to Machine learning
Jia-Bin Huang Virginia Tech
CSE 190D Database System Implementation
Presentation transcript:

What types of problems we study, Part 1: Statistical problemsHighlights of the theoretical results What types of problems we study, Part 2: ClusteringFuture directions Streaming Algorithms for Clustering and Learning MADALGO – Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation Kevin L. Chang Max-Planck-Institute for Informatics Trade-off between memory and passes We focus on algorithms that solve statistical and clustering problems with a strong trade-off between the number of passes over the data stream required and the amount of memory required. Example: if a 2 pass algorithm would require say 100 MB of memory, then a 4 pass algorithm may only require ~10 MB. Optimality of trade-off For many of our algorithms, we can mathematically prove that our trade-off is optimal. So, it’s impossible to achieve a significantly better trade- off! Motivation Data mining and machine learning techniques have become very important for automatic data analysis. Our goal is to design algorithms for performing these computations that are suitable for Massive Data Sets. The streaming model of computation models many of the important constraints imposed by computation on large data sets. Streaming algorithms data stream read only left to right Read/Write memory CPU Details of the model  The large input is a read-only array. The array may only be accessed through a few sequential passes.  Main memory is modeled as extra space that supports random access, reads and writes. Important resources to minimize  Ideally, the number of passes should be small.  Ideally, the amount of memory required should be small. Suppose you have many data points, and want to find a small number of points that ”represent” the large set of data points. Input points 2 representative points and their ”clusters” Two natural applications Document Clustering: points represent web pages of news web sites.  Clustering is used to group news pages with similar topics.  i.e. a cluster for all pages with European Central Bank news, a cluster for the US election, etc. Intrusion detection: points represent internet connections to computers in your network.  Clustering is used to group connections into different categories.  i.e. clusters could represent ”denial of service attack”, ”password guessing”, ”normal connection” For the statistical problem, can we design streaming algorithms for learning distributions that are more general? One approach to accomplish this may be mathematical programming over data streams. A linear program is an example of a mathematical program: Find numbers x, y that Maximize 50 x + 7 y Subject to 100x + 20 y ≤ 10 x, y ≥ 0 The distribution learning problem in more general contexts can be cast as a mathematical program. Suppose the data in your data stream consists of random numbers drawn from some probability distribution, but we don’t know what the distribution is. Can we ”learn” the distribution from the data? Suppose your data look like this We want to recover the distribution that generated the data 0 1 We focus on the special case where the data are known to be drawn from a mixture of distributions. Mixtures of Distributions  Here, there are k different probability distributions, but don’t know what exactly they are..  Each data point is drawn from one of these distributions, but you don’t know which one.  Can you reconstruct the k different distributions from the data?  These are very popular models.