Progress in New Computer Science Research Directions John Hopcroft Cornell University Ithaca, New York CAS May 21, 2010.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Future directions in computer science research John Hopcroft Department of Computer Science Cornell University Heidelberg Laureate Forum Sept 27, 2013.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Dynamic Bayesian Networks (DBNs)
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Information Networks Graph Clustering Lecture 14.
Future directions in computer science research 23rd International Symposium on Algorithms and Computation John Hopcroft Cornell University ISAAC.
Future directions in computer science research John Hopcroft Department of Computer Science Cornell University CINVESTAV-IPN Dec 2,2013.
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
“Random Projections on Smooth Manifolds” -A short summary
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Chapter 7 Sampling and Sampling Distributions
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Computer Science 1 Web as a graph Anna Karpovsky.
Important Problem Types and Fundamental Data Structures
Separate multivariate observations
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
The Erdös-Rényi models
Chapter 2 Dimensionality Reduction. Linear Methods
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
Creating a science base to support new directions in computer science Cornell University Ithaca New York John Hopcroft.
Building a Science Base for the Information Age John Hopcroft Cornell University Ithaca, NY Xiamen University.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Random-Graph Theory The Erdos-Renyi model. G={P,E}, PNP 1,P 2,...,P N E In mathematical terms a network is represented by a graph. A graph is a pair of.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.
SINGULAR VALUE DECOMPOSITION (SVD)
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Random Dot Product Graphs Ed Scheinerman Applied Mathematics & Statistics Johns Hopkins University IPAM Intelligent Extraction of Information from Graphs.
Question paper 1997.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
FROM CONCRETE TO ABSTRACT ACTIVITIES FOR LINEAR ALGEBRA Basic Skills Analysis Hypothesis Proof Helena Mirtova Prince George’s Community College
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Jure Leskovec Kevin J. Lang Anirban Dasgupta Michael W. Mahoney WWW’ 2008 Statistical Properties of Community Structure in Large Social and Information.
Graphs Slide credits:  K. Wayne, Princeton U.  C. E. Leiserson and E. Demaine, MIT  K. Birman, Cornell U.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Ultra-high dimensional feature selection Yun Li
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 Travel Times from Mobile Sensors Ram Rajagopal, Raffi Sevlian and Pravin Varaiya University of California, Berkeley Singapore Road Traffic Control TexPoint.
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Streaming & sampling.
Multimodal Learning with Deep Boltzmann Machines
3.2 Graph Traversal.
Presentation transcript:

Progress in New Computer Science Research Directions John Hopcroft Cornell University Ithaca, New York CAS May 21, 2010

Time of change  The information age is undergoing a fundamental revolution that is changing all aspects of our lives.  Those individuals and nations who recognize this change and position themselves for the future will benefit enormously. CAS May 21, 2010

Drivers of change  Merging of computing and communications  Data available in digital form  Networked devices and sensors  Computers becoming ubiquitous CAS May 21, 2010

Internet search engines are changing When was Einstein born? Einstein was born at Ulm, in Wurttemberg, Germany, on March 14, List of relevant web pages CAS May 21, 2010

Internet queries will be different  Which car should I buy?  What are the key papers in Theoretical Computer Science?  Construct an annotated bibliography on graph theory.  Where should I go to college?  How did the field of CS develop? CAS May 21, 2010

Which car should I buy? Search engine response: Which criteria below are important to you?  Fuel economy  Crash safety  Reliability  Performance  Etc. CAS May 21, 2010

MakeCostReliabilityFuel economy Crash safety Links to photos/ articles Toyota Prius23,780Excellent44 mpgFair Honda Accord28,695Better26 mpgExcellent Toyota Camry29,839Average24 mpgGood Lexus 35038,615Excellent23 mpgGood Infiniti M3547,650Excellent19 mpgGood

CAS May 21, 2010

2010 Toyota Camry - Auto Shows Toyota sneaks the new Camry into the Detroit Auto Show. Usually, redesigns and facelifts of cars as significant as the hot-selling Toyota Camry are accompanied by a commensurate amount of fanfare. So we were surprised when, right about the time that we were walking by the Toyota booth, a chirp of our Blackberries brought us the press release announcing that the facelifted 2010 Toyota Camry and Camry Hybrid mid-sized sedans were appearing at the 2009 NAIAS in Detroit. Toyota CamryCamry Hybrid2009 NAIAS We’d have hardly noticed if they hadn’t told us—the headlamps are slightly larger, the grilles on the gas and hybrid models go their own way, and taillamps become primarily LED. Wheels are also new, but overall, the resemblance to the Corolla is downright uncanny. Let’s hear it for brand consistency! Four-cylinder Camrys get Toyota’s new 2.5-liter four-cylinder with a boost in horsepower to 169 for LE and XLE grades, 179 for the Camry SE, all of which are available with six-speed manual or automatic transmissions. Camry V-6 and Hybrid models are relatively unchanged under the skin. Inside, changes are likewise minimal: the options list has been shaken up a bit, but the only visible change on any Camry model is the Hybrid’s new gauge cluster and softer seat fabrics. Pricing will be announced closer to the time it goes on sale this March. Toyota Camry  › OverviewOverview  › SpecificationsSpecifications  › Price with OptionsPrice with Options  › Get a Free QuoteGet a Free Quote News & Reviews  2010 Toyota Camry - Auto Shows 2010 Toyota Camry - Auto Shows Top Competitors  Chevrolet Malibu Chevrolet Malibu  Ford Fusion Ford Fusion  Honda Accord sedan Honda Accord sedan

Which are the key papers in Theoretical Computer Science? Hartmanis and Stearns, “On the computational complexity of algorithms” Blum, “A machine-independent theory of the complexity of recursive functions” Cook, “The complexity of theorem proving procedures” Karp, “Reducibility among combinatorial problems” Garey and Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness” Yao, “Theory and Applications of Trapdoor Functions” Shafi Goldwasser, Silvio Micali, Charles Rackoff, “The Knowledge Complexity of Interactive Proof Systems” Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy, “Proof Verification and the Hardness of Approximation Problems” CAS May 21, 2010

Fed Ex package tracking CAS May 21, 2010

NEXRAD Radar Binghamton, Base Reflectivity 0.50 Degree Elevation Range 124 NMI — Map of All US Radar Sites Map of All US Radar Sites Animate Map Storm TracksTotal PrecipitationShow SevereRegional RadarZoom Map Click:Zoom InZoom OutPan Map ( Full Zoom Out ) Full Zoom Out »AdvancedRadarTypesCLICK»»AdvancedRadarTypesCLICK»

Collective Inference on Markov Models for Modeling Bird Migration CAS May 21, 2010 Space time

CAS May 21, 2010 Daniel Sheldon, M. A. Saleh Elmohamed, Dexter Kozen

Science base to support activities Track flow of ideas in scientific literature Track evolution of communities in social networks Extract information from unstructured data sources. CAS May 21, 2010

Tracking the flow of ideas in scientific literature Yookyung Jo

CAS May 21, 2010 Tracking the flow of ideas in the scientific literature Yookyung Jo Page rank Web Link Graph Retrieval Query Search Text Web Page Search Rank Web Chord Usage Index Probabilistic Text File Retrieve Text Index Discourse Word Centering Anaphora

CAS May 21, 2010 Original papers

CAS May 21, 2010 Original papers cleaned up

CAS May 21, 2010 Referenced papers

CAS May 21, 2010 Referenced papers cleaned up. Three distinct categories of papers

CAS May 21, 2010

Tracking communities in social networks CAS May 21, 2010 Liaoruo Wang

“Statistical Properties of Community Structure in Large Social and Information Networks”, Jure Leskovec; Kevin Lang; Anirban Dasgupta; Michael Mahoney Studied over 70 large sparse real-world networks. Best communities are of approximate size 100 to 150. CAS May 21, 2010

Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like." CAS May 21, 2010

Conductance Size of community 100

CAS May 21, 2010 Giant component

CAS May 21, 2010 Whisker: A component with v vertices connected by edges

Our view of a community CAS May 21, 2010 TCS Me Colleagues at Cornell Classmates Family and friends More connections outside than inside

CAS May 21, 2010 Should we remove all whiskers and search for communities in the core? Core

Should we remove whiskers? Does there exist a core in social networks? Yes Experimentally it appears at p=1/n in G(n,p) model Is the core unique? Yes and No In G(n,p) model should we require that a whisker have only a finite number of edges connecting it to the core? CAS May 21, 2010 Laura Wang

Algorithms How do you find the core? Are there communities in the core of social networks? CAS May 21, 2010

How do we find whiskers? NP-complete if graph has a whisker There exists graphs with whiskers for which neither the union of two whiskers nor the intersection of two whiskers is a whisker CAS May 21, 2010

Graph with no unique core CAS May 21,

Graph with no unique core CAS May 21,

What is a community? How do you find them? CAS May 21, 2010

Communities  Conductance  Can we compress graph?  Rosvall and Bergstrom, “An informatio-theoretic framework for resolving community structure in complex networks”  Hypothesis testing  Yookyung Jo CAS May 21, 2010

Description of graph with community structure Specify which vertices are in which communities. Specify the number of edges between each pair of communities. CAS May 21, 2010

Information necessary to specify graph given community structure m=number of communities n i =number of vertices in i th community l ij number of edges between i th and j th communities CAS May 21, 2010

Description of graph consists of description of community structure plus specification of graph given structure. Specify community for each edge and the number of edges between each community Can this method be used to specify more complex community structure where communities overlap? CAS May 21, 2010

Hypothesis testing Null hypothesis: All edges generated with some probability p 0 Hypothesis: Edges in communities generated with probability p 1, other edges with probability p 0. CAS May 21, 2010

Clustering Social networks Mishra, Schreiber, Stanton, and Tarjan Each member of community is connected to a beta fraction of community No member outside the community is connected to more than an alpha fraction of the community Some connectivity constraint CAS May 21, 2010

In sparse graphs How do you find alpha – beta communities? What if each person in the community is connected to more members outside the community then inside? CAS May 21, 2010

In sparse graphs How do you find alpha – beta communities? What if each person in the community is connected to more members outside the community then inside? CAS May 21, 2010

Structure of social networks Different networks may have quite different structure. An algorithm for finding alpha beta communities in sparse graphs. How many communities of a given size are there in a social network? Results on exploring a tweeter network of approximately 100,000 nodes. Liaoruo Wang, Jing He, Hongyu Liang CAS May 21, 2010

200 randomly chosen nodes Apply alpha beta algorithm Resulting alpha beta community How many alpha beta communities? CAS May 21, 2010

Massively overlapping communities Are there a small number of massively overlapping communities that share a common core? Are there massively overlapping communities in which one can move from one community to a totally disjoint community? CAS May 21, 2010

Massively overlapping communities with a common core

CAS May 21, 2010 Massively overlapping communities

Define the core of a set of overlapping communities to be the intersection of the communities. There are a small number of cores in the tweeter data set. CAS May 21, 2010

Size of initial setNumber of cores CAS May 21, 2010

CAS May 21, 2010

What is the graph structure that causes certain cores to merge and others to simply vanish? What is the structure of cores as they get larger? Do they consist of concentric layers that are less dense at the outside? CAS May 21, 2010

Sucheta Soundarajan Transmission paths for viruses, flow of ideas, or influence

CAS May 21, 2010 Trivial model Half of contacts Two third of contacts

CAS May 21, 2010 Time of first item Time of second item

Theory to support new directions CAS May 21, 2010  Random graphs  High dimensions and dimension reduction  Spectral analysis  Massive data sets and streams  Collaborative filtering  Extracting signal from noise  Learning and VC dimension

Science base What do we mean by science base? Example: High dimensions CAS May 21, 2010

High dimension is fundamentally different from 2 or 3 dimensional space CAS May 21, 2010

High dimensional data is inherently unstable Given n random points in d dimensional space essentially all n 2 distances are equal. CAS May 21, 2010

High Dimensions Intuition from two and three dimensions not valid for high dimension Volume of cube is one in all dimensions Volume of sphere goes to zero

Gaussian distribution CAS May 21, 2010 Probability mass concentrated between dotted lines

Gaussian in high dimensions CAS May 21, 2010

Two Gaussians CAS May 21, 2010

Distance between two random points from same Gaussian  Points on thin annulus of radius  Approximate by sphere of radius  Average distance between two points is (Place one pt at N. Pole other at random. Almost surely second point near the equator.) CAS May 21, 2010

Expected distance between pts from two Gaussians separated by δ

Can separate points from two Gaussians if CAS May 21, 2010

Dimension reduction Project points onto subspace containing centers of Gaussians Reduce dimension from d to k, the number of Gaussians CAS May 21, 2010

Centers retain separation Average distance between points reduced by

Can separate Gaussians provided CAS May 21, 2010 > some constant involving k and γ independent of the dimension

Finding centers of Gaussians CAS May 21, 2010 First singular vector v 1 minimizes the perpendicular distance to points

The first singular vector goes through center of Gaussian and minimizes distance to the points. Gaussian CAS May 21, 2010

Best k-dimensional space for Gaussian is any space containing the line through the center of the Gaussian. CAS May 21, 2010

Given k Gaussians, the top k singular vectors define a k dimensional space that contains the k lines through the centers of the k Gaussian and hence contain the centers of the Gaussians CAS May 21, 2010

Johnson-Lindenstrass A random projection of n points to log n dimensional space approximately preserves all pair wise distances with high probability. Clearly one cannot project n points to a line and approximately preserve all pair wise distances. CAS May 21, 2010

We have just seen what a science base for high dimensional data might look like. What other areas do we need to develop a science base for?

CAS May 21, 2010  Extracting information from large data sources  Communities in social networks  Tracking flow of ideas  Ranking is important  Restaurants, movies, books, web pages  Multi billion dollar industry  Collaborative filtering  When a customer buys a product what else is he likely to buy  Dimension reduction

Sparse vectors CAS May 21, 2010 There are a number of situations where sparse vectors are important Tracking the flow of ideas in scientific literature Biological applications Signal processing

Sparse vectors in biology CAS May 21, 2010 plants Genotype Internal code Phenotype Observables Outward manifestation

Signal processing Reconstruction of sparse vector CAS May 21, Y Random orthonormal matrix Fourier transform X A

CAS May 21, 2010 If x is a sparse vector, one can recover x from a few random samples of y Let be the random samples of y. The set of x consistent with is an affine space V and

A vector and it’s Fourier transform cannot both be sparse. CAS May 21, 2010

Lemma: Let n x be the number of non zero elements of x and let n y be the number of non zero elements of y. Then n x n y ≥n Proof: Let x have s non zero elements. Let be the sub matrix of Z consisting of columns corresponding to non zero elements of x and s consecutive rows of Z. is non singular

CAS May 21, 2010

Lemma: The Vandermonde matrix is non singular. Proof:

One can construct an n-1 degree polynomial through any n points CAS May 21, 2010

Representation in composite basis cannot have two sparse representations

Data streams Record the WWW on your lap top Census data Find number of distinct elements in a sequence. Find all frequently occurring elements in a sequence. CAS May 21, 2010

Multiplication of two very large sparse matrices C = A B Large sparse matrix Sample row of A and columns of B. What is most rows of A are very light and a few are very heavy? Sample rows with probability proportional to square of weight.

CAS May 21, 2010 Random graphs – the emergence of the giant component in the G(n,p) model. To compute the size of a connected component of G(n,p) do a search of a component starting from a random vertex and generate an edge only when the search process needs to know if the edge exists.

CAS May 21, 2010 Algorithm Mark start vertex discovered. Select unexplored vertex from set of discovered vertices and add all vertices that are connected to this vertex to the set of discovered vertices andmark vertex as explored. Discovered but unexplored vertices are called the frontier. Process terminates when frontier becomes empty.

CAS May 21, 2010 Let z i be the number of vertices discovered in the first i steps of the search. Distribution of z i is binomial To see this observe probability given vertex not adjacent to vertex being searched in a given step is not discovered in i steps is discovered in i steps

CAS May 21, 2010

0 1 This argument used average values. Can we make it rigorous?

CAS May 21, 2010 Belief propagation graphical model or random Markov field

CAS May 21, 2010 Factor graph

CAS May 21, 2010 If factor graph is a tree Send messages up the tree Computes correct marginals

If factor graph not a tree then we could use message passing algorithm anyway. No guarantee of convergence No guarantee of correct marginals even if the algorithm converge ? Important issue – correlation of variables CAS May 21, 2010

Phase transition in correlation between spin at leaves and spin at root. p q

CAS May 21, 2010 High temperature Low temperature

CAS May 21, 2010 High temperature Low Temperature

CAS May 21, 2010 High temperature Low Temperature Phase transition

CAS May 21, 2010 Learning theory and VC-dimension 250 million salary data points age census data Theorem: If shape has finite VC-dimension, an estimate based on a random sample will not deviate from true value by much.

CAS May 21, 2010 Discrepancy theory

Conclusions  We are in an exciting time of change.  Information technology is a big driver of that change.  The computer science theory needs to be developed to support this information age. CAS May 21, 2010