Efficient summarization framework for multi-attribute uncertain data Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra 1.

Slides:

Advertisements

Similar presentations

Data Mining Classification: Alternative Techniques

Advertisements

Trends in Sentiments of Yelp Reviews Namank Shah CS 591.

Recap: Mining association rules from large datasets

Lindsey Bleimes Charlie Garrod Adam Meyerson

A Unified Framework for Context Assisted Face Clustering

Hadi Goudarzi and Massoud Pedram

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

PARTITIONAL CLUSTERING

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)

VLDB’2007 review Denis Mindolin. VLDB’07 program.

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Absorbing Random walks Coverage

RANSAC experimentation Slides by Marc van Kreveld 1.

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Introduction to Approximation Algorithms Lecture 12: Mar 1.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Dealing with NP-Complete Problems

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

DATA MINING -ASSOCIATION RULES-

Computation and Incentives in Combinatorial Public Projects Michael Schapira Yale University and UC Berkeley Joint work with Dave Buchfuhrer and Yaron.

On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Influence Maximization

Mining and Summarizing Customer Reviews

S. B. Roy, S. A.-Yahia, A. Chawla, G. Das, and C. Yu SIGMOD 2010 Constructing and Exploring Composite Items 2011/4/14 1.

By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.

Computer Science 101 Modeling and Simulation. Scientific Method Observe behavior of a system and formulate an hypothesis to explain it Design and carry.

Hotspot Detection in a Service Oriented Architecture Pranay Anchuri,

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.

Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Christoph F. Eick: Using EC to Solve Transportation Problems Transportation Problems.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.

Optimization Problems

Survivable Paths in Multilayer Networks Marzieh Parandehgheibi Hyang-won Lee Eytan Modiano 46 th Annual Conference on Information Sciences and Systems.

Unsupervised Streaming Feature Selection in Social Media

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.

Polyhedral Optimization Lecture 5 – Part 3 M. Pawan Kumar Slides available online

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.

Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.

Cohesive Subgraph Computation over Large Graphs

Nanyang Technological University

Moran Feldman The Open University of Israel

Summary Presented by : Aishwarya Deep Shukla

Optimizing Submodular Functions

Distributed Submodular Maximization in Massive Datasets

Community Distribution Outliers in Heterogeneous Information Networks

Coverage Approximation Algorithms

Disambiguation Algorithm for People Search on the Web

Guess Free Maximization of Submodular and Linear Sums

Presentation transcript:

Efficient summarization framework for multi-attribute uncertain data Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra 1

Uncertain Data Set The Summarization Problem 2 location (e.g. LA) face (e.g. Jeff, Kate) visual concepts (e.g. water, plant, sky) Extractive Abstractive O1O1 O8O8 O 11 O 25 Kate Jeff wedding at LA O1O1 O2O2 OnOn …

Modeling Information Summarization Process What information does this image contain? Extract best subset … dataset summary Metrics? - Coverage Agrawal, WSDM’09; Li, WWW’09; Liu, SDM‘’09; Sinha, WWW’11 - Diversity Vee, ICDE’08; Ziegler, WWW’05 - Quality Sinha, WWW’11 3 information object

Existing Techniques 4 Kennedy et al. WWW’08 Simon et al. ICCV’07 Sinha et al. WWW’11 Hu et al. KDD’04 Ly et al. CoRR’11 Inouye et al. SocialCom ’11 Li et al. WWW’09 Liu et al. SDM’09 Do not consider information in multiple attributes Do not deal with uncertain data image customer reviewdoc/micro-blog

Challenges Design a summarization framework for  Multi-attribute Data  Uncertain/Probabilistic Data. 5 visual concept face tags location time event visual concepts P(sky) = 0.7, P(people) = 0.9 data processing (e.g. vision analysis)

Existing techniques typically model & summarize a single information dimension Limitations of existing techniques - 1 Summarize only information about visual content (Kennedy et al. WWW’08, Simon et al. ICCV’07) Summarize only information about review content (Hu et al. KDD’04, Ly et al. CoRR’11) 6

What information is in the image? 7 {sky}, {plant}, … {Kate}, {Jeff} {wedding} {12/01/2012} {Los Angeles} Elemental IU Is that all? {Kate, Jeff} {sky, plant} … Intra-attribute IU Even more information from attributes? {Kate, LA} Inter-attribute IU {Kate, Jeff, wedding} …

Are all information units interesting? Is {Sharad, Mike} an interesting intra-attribute IU? Yes, they often have coffee together and appear frequently in other photos Are all of the 2 n combinations of people interesting? Shall we select a summary that covers all these information? Well, probably not! I don’t care about person X and person Y who happen to be together in the photo of this large group. 8 Is {Liyan, Ling} interesting? Yes from my perspective, because they are both my close friends

Mine for interesting information units O1O1 face {Jeff, Kate} O2O2 face {Tom} O3O3 face {Jeff, Kate, Tom} O4O4 face {Kate, Tom} O5O5 face {Jeff, Kate} … OnOn face {Jeff, Kate} T1T1 T2T2 T3T3 T4T4 T5T5 … TnTn Modified Item-set mining algorithm frequent correlated {Jeff, Kate} 9

Mine for interesting information units O1O1 face {Jeff, Kate} O2O2 face {Jeff} O3O3 face {Jeff, Kate, Tom} O4O4 face {Kate, Tom} O5O5 face {Jeff, Kate} … OnOn face {Jeff, Kate} 10 Mine from social context (e.g. Jeff is friend of Kate, Tom is a close friend of the user) {Jeff, Kate} {Tom}

Can not handle probabilistic attributes Limitation of existing techniques – 2 … dataset summary P( Jeff ) = 0.8 P(Jeff) = 0.6 Not sure whether an object covers an IU in another object ? 11 objects IU n n 3

Deterministic Coverage Model --- Example 12 Coverage = 8 / 14 dataset summary information object

Probabilistic Coverage Model 13 Expected amount of information covered by S Expected amount of total information Simplify to compute efficiently Can be computed in polynomial time The function is sub-modular

Optimization Problem for summarization Parameters :  dataset O = { o 1, o 2, · · ·, o n }  positive number K Finding summary with Maximum Expected Coverage is NP- hard. We developed an efficient greedy algorithm to solve it. 14

For each object o in O \ S, Compute hkjhkhk Basic Greedy Algorithm Expensive to compute Cov. It is (Object-level optimization) Too many operations of computing Cov. (Iteration-level Optimization) 15 Initialize S = empty set Select o* with max Yes No done

Efficiency optimization – Object-level Reduce the time required to compute the coverage for one object  Instead of directly compute and optimize coverage in each iteration, compute the gain of adding one object o to summary S gain(S,o) = -  Updating gain(S,o) is much more efficient ( ) 16

Submodularity of Coverage Expected Coverage Cov(S,O) is submodular: 17 Cov(S, O) Cov(S ∪ o, O) – Cov(S, O) Cov(T, O) Cov(T ∪ o) - Cov(T, O)

Efficiency optimization – Iteration-level Reduce the number of object-level computations (i.e. gain(S, o ) ) in each iteration of the greedy process While traversing objects in O \ S, we maintain  the maximum gain so far gain*.  an upper bound Upper(S, O ) on gain(S, o ). For any  prune an object o if Upper(S, o ) < gain*. By definition By submodularity 18 Update in constant time

Experiment -- Datasets Facebook Photo Set 200 photos uploaded by 10 Facebook users Review Dataset Reviews about 10 hotels from TripAdvisor. Each hotel has about 250 reviews on average. Flickr Photo Set 20,000 photos from Flickr. 19 visual concept event time face visual concept facets rating visual event time

Experiment – Quality 20

Experiment – Efficiency 21 Basic greedy algorithm without optimization runs more than 1 minute

Summary 22 Developed a new extractive summarization framework  Multi-attribute data.  Uncertain/Probabilistic data.  Generates high-quality summaries.  Highly efficient.

23