Sensors, networks, and massive data Michael W. Mahoney Stanford University May 2012 ( For more info, see: cs.stanford.edu/people/mmahoney/ or Google.

Slides:



Advertisements
Similar presentations
Google News Personalization Scalable Online Collaborative Filtering
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
INTRODUCTION TO MODELING
Jure Leskovec (Stanford) Kevin Lang (Yahoo! Research) Michael Mahoney (Stanford)
Geometric Network Analysis Tools Michael W. Mahoney Stanford University MMDS, June 2010 ( For more info, see: cs.stanford.edu/people/mmahoney/
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta and Michael Mahoney Yahoo! Research.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
What is Science? We are going to be studying science all year long! Take a moment and write down on your paper in several sentences what you think science.
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
Community Structure in Large Social and Information Networks Michael W. Mahoney Stanford University (Joint work with Kevin Lang and Anirban Dasgupta of.
Three kinds of learning
A scalable multilevel algorithm for community structure detection
Approximate computation and implicit regularization in large-scale data analysis Michael W. Mahoney Stanford University Sept 2011 (For more info, see:
Community Structure in Large Social and Information Networks Michael W. Mahoney (Joint work at Yahoo with Kevin Lang and Anirban Dasgupta, and also Jure.
1 Clustering Applications at Yahoo! Deepayan Chakrabarti
Community Structure in Large Social and Information Networks Michael W. Mahoney Yahoo Research ( For more info, see:
Approximation Algorithms as “Experimental Probes” of Informatics Graphs Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Extracting insight from large networks: implications of small-scale and large-scale structure Michael W. Mahoney Stanford University ( For more info, see:
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Science and Engineering Practices
BIG Biomedicine and the Foundations of BIG Data Analysis Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley May 2014 (For more info, see:
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Optimal Placement and Selection of Camera Network Nodes for Target Localization A. O. Ercan, D. B. Yang, A. El Gamal and L. J. Guibas Stanford University.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Data Mining Techniques
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Community Detection by Modularity Optimization Jooyoung Lee
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Highline Class, BI 348 Basic Business Analytics using Excel, Chapter 01 Intro to Business Analytics BI 348, Chapter 01.
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Network Analysis Diffusion Networks. Social Network Philosophy Social structure is visible in an anthill Movements & contacts one sees are not random.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Google News Personalization: Scalable Online Collaborative Filtering
Implicit regularization in sublinear approximation algorithms Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see:
The Nature of Science Hello my future scientists!!!
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
To return to the chapter summary click Escape or close this document. Chapter Resources Click on one of the following icons to go to that resource. Image.
 We are going to be studying science all year long! Take a moment and write down on your paper in several sentences what you think science is.  Be Prepared.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
User-Centric Data Dissemination in Disruption Tolerant Networks Wei Gao and Guohong Cao Dept. of Computer Science and Engineering Pennsylvania State University.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
داده های عظیم در دوران پساژنوم Big Data in Post Genome Era مهدی صادقی پژوهشگاه ملی مهندسی ژنتیک و زیست فناوری پژوهشکده علوم زیستی، پژوهشگاه دانش های بنیادی.
Jure Leskovec Kevin J. Lang Anirban Dasgupta Michael W. Mahoney WWW’ 2008 Statistical Properties of Community Structure in Large Social and Information.
ROLE AND MISSION OF ACADEMIC LIBRARIES: PRESENT AND FUTURE Paula Kaufman November 18, 2005.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Mining of Massive Datasets Edited based on Leskovec’s from
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Book web site:
Statistical properties of network community structure
Integrating Deep Learning with Cyber Forensics
MIS2502: Data Analytics Clustering and Segmentation
(For more info, see: Approximate computation and implicit regularization in large-scale data analysis Michael W.
Presentation transcript:

Sensors, networks, and massive data Michael W. Mahoney Stanford University May 2012 ( For more info, see: cs.stanford.edu/people/mmahoney/ or Google on “Michael Mahoney”)

Lots of types of “sensors” Examples: Physical/environmental: temperature, air quality, oil, etc. Consumer: RFID chips, SmartPhone, Store Video, etc. Health care: Patient Records, Images & Surgery Videos, etc. Financial: Transactions for regulations, HFT, etc. Internet/e-commerce: clicks, , etc. for user modeling, etc. Astronomical/HEP: images, experiments, etc. Common theme: easy to generate A LOT of data Questions: What are similarities/differences i.t.o. funding drivers, customer demands, questions of interest, time sensitivity, etc. about “sensing” in these different applications? What can we learn from one area and apply to another area?

BIG data??? MASSIVE data???? NYT, Feb 11, 2012: “The Age of Big Data” “What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. …” Why are big data big? Generate data at different places/times and different resolutions Factor of 10 more data is not just more data, but different data

BIG data??? MASSIVE data???? MASSIVE data: Internet, Customer Transactions, Astronomy/HEP = “Petascale” One Petabyte = watching 20 years of movies (HD) = listening to 20,000 years of MP3 (128 kbits/sec) = way too much to browse or comprehend massive data: 10 5 people typed at 10 6 DNA SNPs; 10 6 or 10 9 node social network; etc. In either case, main issues: Memory management issues, e.g., push computation to the data Hard to answer even basic questions about what data “looks like”

How do we view BIG data?

Algorithmic vs. Statistical Perspectives Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations. Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists) Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. Goal: is to extract information about the world from noisy data. Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model. Lambert (2000), Mahoney (2010)

Thinking about large-scale data Data generation is modern version of microscope/telescope: See things couldn't see before: e.g., movement of people, clicks and interests; tracking of packages; fine-scale measurements of temperature, chemicals, etc. Those inventions ushered new scientific eras and new understanding of the world and new technologies to do stuff Easy things become hard and hard things become easy: Easier to see the other side of universe than bottom of ocean Means, sums, medians, correlations is easy with small data Our ability to generate data far exceeds our ability to extract insight from data.

Many challenges... Tradeoffs between prediction & understanding Tradeoffs between computation & communication, Balancing heat dissipation & energy requirements Scalable, interactive, & inferential analytics Temporal constraints in real-time applications Understanding “structure” and “noise” at large-scale (*) Even meaningfully answering “What does the data look like?”

Micro-markets in sponsored search 10 million keywords 1.4 Million Advertisers Gambling Sports Sports Gambling Movies Media Sport videos What is the CTR and advertiser ROI of sports gambling keywords? Goal: Find isolated markets/clusters (in an advertiser-bidded phrase bipartite graph) with sufficient money/clicks with sufficient coherence. Ques: Is this even possible?

What about sensors? Vector space model - analogous to “bag-of-words” model for documents/terms. Each sensor is a “document,” a vector in a high-dimensional Euclidean space Each measurement is a “term”, describing the elements of that vector (Advertisers and bidded-phrases--and many other things--are also analogous.) Can also define sensor-measurement graphs : Sensors are nodes, and edges are between sensors with similar measurements m documents (sensors) n terms (measurements) A ij = frequency of j-th term in i-th document (value of j-th measurement at i-th sensor) = =

Cluster-quality Score: Conductance S S’ 11 How cluster-like is a set of nodes? Idea: balance “boundary” of cluster with “volume” of cluster Need a natural intuitive measure: Conductance (normalized cut)  (S) ≈ # edges cut / # edges inside  Small  (S) corresponds to better clusters of nodes

Graph partitioning A family of combinatorial optimization problems - want to partition a graph’s nodes into two sets s.t.: Not much edge weight across the cut (cut quality) Both sides contain a lot of nodes Standard formalizations of the bi-criterion are NP-hard! Approximation algorithms: Spectral methods* - (compute eigenvectors) Local improvement - (important in practice) Multi-resolution - (important in practice) Flow-based methods* - (mincut-maxflow) * comes with strong underlying theory to guide heuristics

Comparison of “spectral” versus “flow” Spectral: Compute an eigenvector “Quadratic” worst-case bounds Worst-case achieved -- on “long stringy” graphs Embeds you on a line (or Kn) Flow: Compute a LP O(log n) worst-case bounds Worst-case achieved -- on expanders Embeds you in L1 Two methods: Complementary strengths and weaknesses What we compute will depend on approximation algorithm as well as objective function.

Analogy: What does a protein look like? Experimental Procedure: Generate a bunch of output data by using the unseen object to filter a known input signal. Reconstruct the unseen object given the output signal and what we know about the artifactual properties of the input signal. Three possible representations (all-atom; backbone; and solvent-accessible surface) of the three-dimensional structure of the protein triose phosphate isomerase.

Popular small networks Zachary’s karate club Newman’s Network Science Meshes and RoadNet-CA

Large Social and Information Networks

Typical example of our findings General relativity collaboration network (pretty small: 4,158 nodes, 13,422 edges) 17 Community size Community score Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)

Large Social and Information Networks LiveJournal Epinions Focus on the red curves (local spectral algorithm) - blue (Metis+Flow), green (Bag of whiskers), and black (randomly rewired network) for consistency and cross-validation. Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)

Interpretation: “Whiskers” and the “core” of large informatics graphs “Whiskers” maximal sub-graph detached from network by removing a single edge contains 40% of nodes and 20% of edges “Core” the rest of the graph, i.e., the 2-edge-connected core Global minimum of NCPP is a whisker BUT, core itself has nested whisker-core structure Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)

Local “structure” and global “noise” Many (most/all?) large informatics graphs (& massive data in general?) have local structure that is meaningfully geometric/low-dimensional does not have analogous meaningful global structure Intuitive example: What does the graph of you and your 10 2 closest Facebook friends “look like”? What does the graph of you and your 10 5 closest Facebook friends “look like”?

Many lessons... This is problematic for MANY things people want to do: statistical analysis that relies on asymptotic limits recursive clustering algorithms analysts who want a few meaningful clusters More data need not be better if you: don’t have control over the noise want “islands of insight” in the “sea of data” How does this manifest itself in your “sensor” application? Needles in haystack; correlations; time series -- “scientific” apps Historically, CS & database apps did more summaries & aggregates

Big changes in the past... and future Consider the creation of: Modern Physics Computer Science Molecular Biology These were driven by new measurement techniques and technological advances, but they led to: big new (academic and applied) questions new perspectives on the world lots of downstream applications We are in the middle of a similarly big shift! OR and Management Science Transistors and Microelectronics Biotechnology

Conclusions HUGE range of “sensors” are generating A LOT of data: will lead to a very different world in many ways Large-scale data are very different than small-scale data. Easy things become hard, and hard things become easy Types of questions that are meaningful to ask are different Structure, noise, etc. properties are often deeply counterintuitive Different applications are driven by different considerations next-user-interaction, qualitative insight, failure modes, false positives versus false negatives, time sensitivity, etc. Algorithms can compute answers to known questions but algorithms can also be used as “experimental probes” of the data to form questions!

MMDS Workshop on “Algorithms for Modern Massive Data Sets” ( at Stanford University, July 10-13, 2012 Objectives: - Address algorithmic, statistical, and mathematical challenges in modern statistical data analysis. - Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured data. - Bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote cross-fertilization of ideas. Organizers: M. W. Mahoney, A. Shkolnik, G. Carlsson, and P. Drineas, Registration is available now!