Lab name TBA1NTUST talk Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright ©

Slides:



Advertisements
Similar presentations
2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data warehouse example
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
1 Discovery Robust Knowledge from Databases that Change Chun-Nan HsuGraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining: A Closer Look
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Theoretic Frameworks for Data Mining Reporter: Qi Liu.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Introduction to Machine Learning, its potential usage in network area,
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
What Is Cluster Analysis?
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
School of Computer Science & Engineering
Data and Applications Security Introduction to Data Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
Supporting End-User Access
Web Mining Department of Computer Science and Engg.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

Lab name TBA1NTUST talk Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright © 1998 Chun-Nan Hsu, All right reserved

Lab name TBA2NTUST talk The formation of the field “data mining” Statistics ~1800? Pattern Recognition ~1970 Rule induction Machine learning ~1980 Expert Systems ~1970 Relational Databases, Triggers ~1980 Knowledge Discovery for Databases (KDD) ~1990 MIS decision support ~1990 Data Mining ~1995

Lab name TBA3NTUST talk Taxonomies of data mining l Based on underlying technologies »decision trees, rule-based, example-based, nonlinear regression, neural networks, bayesian networks, rough sets... l Based on tasks at hand (due to Fayyad et al. 1997) »classification, regression, clustering, summarization, dependency modeling, change and deviation detection l Based on data???? Formalize these ideas »collection of similarities »time series »image --- snapshot of a state »collection of images

Lab name TBA4NTUST talk Collection of similarities l Characterize classes by generating classifiers (supervised learning)?????? l Cluster objects into classes (clustering, unsupervised learning)?????? l Many techniques available, most well understood

Lab name TBA5NTUST talk Time series l Forecasting, predicting the next (few) states l Characterizing the “trend” to detect changes and deviations l Usually can be reformulated as a supervised learning problem

Lab name TBA6NTUST talk Collection of images l Extracting dependency, co-relations l Example: a collection of shopping lists of supermarket customers l Example: a collection of symptom lists of patients taking a new medicine??????? l Techniques »Association rules »Bayesian networks and other probabilistic graphical models

Lab name TBA7NTUST talk Image l Summarization l Key feature extraction l Not much is known l Example: a snapshot of an inventory database

Lab name TBA8NTUST talk Issue: Consistency of Machine-generated Rules database state (t) database state (t+1) transactions : insert/ delete/ update Rules Consistent? Learning Data Mining Discovery

Lab name TBA9NTUST talk Dealing with Inconsistent Rules l Delete them? »Simple, but the system might have no rule to use l Modify them? »Smart, but the system might be busy modifying rules l Learn rules that are unlikely to become inconsistent »Yes, but how does it know which rule to learn? l Need a way to measure “likelihood of not becoming inconsistent” --- Robustness of knowledge

Lab name TBA10NTUST talk Robustness vs. Predictive Accuracy Given a rule A  C l Closed-world assumption on databases: BOTH insertions and deletions affect inconsistency l Robustness of a rule is measured with regard to entire database states D: Pr(A  C|D) l Predictive accuracy of a rule is measured with regard to data tuples d: Pr(C| A,d)

Lab name TBA11NTUST talk Definition of Robustness of knowledge (1) l A rule is robust if it is unlikely that the rule becomes inconsistent with a database state l Intuitively, this probability can be estimated as # of database states consistent with the rule # of possible database states l However: »database states are not equally probable »# of database states are intractably large

Lab name TBA12NTUST talk Definition of Robustness of knowledge (2) l A rule is robust given a current database state if transactions that invalidate the rule is unlikely to be performed. l Likelihood of database states depends on »Current database state »Probability of transactions performed on that state l New definition of robustness is 1 - Pr(t|d) »t: transactions that invalidate the rule is performed »d: current database state

Lab name TBA13NTUST talk Robustness Estimation l Step 1: Find transactions that invalidate the input rule l Step 2: Decompose the probabilities of invalidating transactions into local probabilities l Step 3: Estimate local probabilities

Lab name TBA14NTUST talk Step 1: Find Transactions that Invalidate the Input Rule l R1: The latitude of a Maltese Geographic location is greater than or equal to geoloc(_,_,?country,?latitude,_) & (?country = “Malta”)  ?latitude > or = l Transactions that invalidate R1: »T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < »T2: Insert an inconsistent tuple... »T3:Update a tuple whose latitude < into “Malta” l Robust(R1) = 1 - Pr(t|d) = 1 - (Pr(T1|d) + Pr(T2|d) + Pr(T3|d))

Lab name TBA15NTUST talk Step 2: Decompose the Probabilities of Invalidating Transactions x1: type of transaction? x4: on which attribute? x3: on which tuple? x2: on which relation? x5: what new attribute value? Pr(t|d) = Pr(x1,x2,x3,x4,x5|d) = Pr(x1|d) Pr(x2| x1,d) Pr(x3|x2,x1,d) Pr(x4| x2,x1,d) Pr(x5| x4,x2,x1,d) = p1 * p2 * p3 * p4 * p5

Lab name TBA16NTUST talk Step 3: Estimate Local Probabilities l Estimate local probabilities using Laplace Law of Succession (Laplace 1820) r + 1 n + k l Useful information for robustness estimation: »transaction log »expected size of tables »information about attribute ranges, value distributions l When no information is available, use database schema information

Lab name TBA17NTUST talk Example of Robustness Estimation R1: geoloc(_,_,?country,?latitude,_) & (?country = “Malta”)  ?latitude > or = l T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < »p1: update? 1/3 = 0.33 »p2: geoloc? 1/2 = 0.50 »p3: geoloc, country = “Malta”? 4/80 = 0.05 »p4: geoloc, latitude to be updated? 1/5 = 0.20 »p5: latitude updated to < 35.89? 1/2 = 0.5 l Pr(T1|d) = p1 * p2 * p3 * p4 * p5 = l Pr(T2|d) and Pr(T3|d) can be estimated similarly

Lab name TBA18NTUST talk Example (cont.): When additional information is available l Naive »p1: update?1/3 = 0.33 l Laplace »p1: update?# of previous updates + 1 # of previous transactions + 3 l m-Probability (Cestnik & Bratko 1991) »p1: update? # of previous updates + m * Pr(U) # of previous transactions + m »m is an expected number of future transactions »Pr(U) is a prior probability of updates

Lab name TBA19NTUST talk Applying Robustness Estimation l Robustness may not be the only desirable property of target rules l Need to combine robustness and other utility measures to guide learning »Tautologies are the most robust l Using many measures to guide rule generation could be difficult

Lab name TBA20NTUST talk Pruning Rule Literals with Robustness Estimation l Use existing algorithms to generate rules l Prune literals of an output rule based on its applicability and estimated robustness l Example: if wharf in Malta, depth < 50ft, with one or more crane  its length > 1200ft »shortest rule consistent with the database if wharf in Malta  its length > 1200ft »the most robust if wharf in Malta with one or more crane  its length > 1200ft

Lab name TBA21NTUST talk Applications l Learning rules for Semantic Query Optimization (Hsu & Knoblock ML94, Siegel Boston U. thesis 89, Shekha et al. IEEE TKDE 94) l Learning functional dependency (Mannila & Raiha KDD94) l Discovering models to reconcile/integrate heterogeneous databases (Dao, Son et al. KDD95) l Learning to answer intentional queries (Chu et al. 91) l Discovering knowledge for decision suppor

Lab name TBA22NTUST talk Summary l Data Mining from “Image” need to estimate the robustness of extracted knowledge l Robustness can be defined based on the probability of invalidating transactions l Robustness can be estimated efficiently l Rule pruning guided by robustness and other utility measures may yield robust and interesting rules l Discovering robust knowledge to enhance database functionalities

Lab name TBA23NTUST talk Data Mining for IR? l Different tasks need different ways to collect and prepare data l Data preparation and cleaning are important

Lab name TBA24NTUST talk Data Mining for IR? Issues l Potential Applications »text categorization (a.k.a. classification, routing, filtering) »fact extraction (a.k.a. template filling) »clustering »text summarization (a.k.a. abstracting, gisting) »user profiling and modeling »interactive query formulation l Issues »scaling up to large volume of data »feature selection (a.k.a. dimensionality reduction)

Lab name TBA25NTUST talk Projects l Recent projects »Template filling --- inducing information extractors from labeled semi-structured documents (J of Info Systems, 1999) »Feature Selection --- feature selection for backprop neural network (IEEE Tools with AI, 1998) l (to-be-proposed) projects »Alias-mining for digital library (NSC) »Classifying NL diagnosis records to ICD-9-CM coding (NHI) l More projects…plans of collaboration much welcome!