Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Clustering Basic Concepts and Algorithms
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Unsupervised Learning
Clustering Beyond K-means
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Data Mining CS 341, Spring 2007 Project Discussion.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
What is Cluster Analysis?
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
WELCOME TO EXPORT COMMODITY DATABASE (ECDB ) Presentation by DIRECTORATE GENERAL OF VALUATION.
CPSC 386 Artificial Intelligence Ellen Walker Hiram College
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Bug Localization with Machine Learning Techniques Wujie Zheng
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
MA/CS 375 Fall MA/CS 375 Fall 2002 Lecture 31.
Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Math 115b Section 3 (Spring 09)  Instructor: Kerima Ratnayaka   Phone :  Office.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Chapter 23: Probabilistic Language Models April 13, 2004.
Clustering.
Final Project and Term Paper Requirements Qiang Yang, MTM521 Material.
Clustering C.Watters CS6403.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
ITA The biggest game in town Created by S WalkerdenImages and music from Microsoft Clipart.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Intro. ANN & Fuzzy Systems Lecture 20 Clustering (1)
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Mining – Algorithms: K Means Clustering
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Fuzzy Logic in Pattern Recognition
Data Mining K-means Algorithm
A Simple Artificial Neuron
Problem Solving Techniques
Revision (Part II) Ke Chen
Information Organization: Clustering
OVERVIEW OF BIOLOGICAL NEURONS
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Unit 1 - Migration and Empire
Fuzzy Clustering Algorithms
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Junheng, Shengming, Yunsheng 11/09/2018
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes

2 Judgment File for Testing Data is Ready to Download project/test04-complete-rel.txt

3 What’s Uue on Monday (May 7 th ) Your output files for the 25 testing queries Your output files for the 25 testing queries –by 12:00am Monday (two electronic files via ) 15-minute PowerPoint presentation 15-minute PowerPoint presentation –In class A written (typed) report A written (typed) report –In class Codes (a hard copy of the programs you wrote for this project) Codes (a hard copy of the programs you wrote for this project) –In class

4 sameple_output.txt 76 Q0 NYT Q0 NYT Q0 APW Q0 APW Q0 APW Q0 NYT Q0 APW Q0 APW Q0 APW Q0 NYT Q0 APW

5 A Written Report on the Final Project Problem Statement Problem Statement Problem Analysis Problem Analysis Techniques/Approaches (what, why and how) Techniques/Approaches (what, why and how) Implementation Implementation Experimental Results and Analysis Experimental Results and Analysis Lessons Learned Lessons Learned

6 Good Luck!

7

8 Final project Goal: Apply available data mining techniques to solve real world problem. Goal: Apply available data mining techniques to solve real world problem. Requirement: Requirement: –Apply two techniques/algorithm and implement at least one algorithm. (Find existing codes online or team up with your classmates) Problem: Relevant sentence retrieval Problem: Relevant sentence retrieval –Retrieve the set of relevant sentences given a query and a collection of documents

9 An Example Query: India and Pakistan Nuclear Tests India and Pakistan Nuclear Tests Description: Description: »On May 11 and 13, 1998 India conducted five nuclear tests; Pakistan responded by detonating six nuclear tests on May 28 and 30th. This nuclear testing was condemned by the international community. Narrative : Narrative : »Relevant documents mention the nuclear testing conducted in May 1998 by both India and Pakistan. Historical information about the antagonism and rivalry between the two countries is not relevant. Mention of the furor created around the world by these detonations is relevant.

10 An Example Document XIE XIE XIE XIE XIE XIE XIE XIE France, Canada Condemn Pakistani Nuclear Tests France, Canada Condemn Pakistani Nuclear Tests XIE XIE NEW YORK, May 28 (Xinhua) -- More countries have come out to condemn Pakistan's nuclear tests. NEW YORK, May 28 (Xinhua) -- More countries have come out to condemn Pakistan's nuclear tests.

11 An Example Document XIE XIE The French Foreign Ministry issued a communique Thursday to deplore and condemn the nuclear tests conducted by Pakistan on the same day. The French Foreign Ministry issued a communique Thursday to deplore and condemn the nuclear tests conducted by Pakistan on the same day. XIE XIE France calls on both India and Pakistan not to conduct any more nuclear tests but to sign the Comprehensive Test Ban Treaty and join talks on the banning of production of fissile materials that can be used to produce nuclear arms, said the communique. France calls on both India and Pakistan not to conduct any more nuclear tests but to sign the Comprehensive Test Ban Treaty and join talks on the banning of production of fissile materials that can be used to produce nuclear arms, said the communique.

12 An Example Document XIE XIE Canadian Foreign Affairs Minister Lloyd Axworthy said in a statement released Thursday, "We continue to urge Pakistan and India to renounce their nuclear weapons programs and to sign the Nuclear Non-Proliferation Treaty and the Comprehensive Test Ban Treaty." Canadian Foreign Affairs Minister Lloyd Axworthy said in a statement released Thursday, "We continue to urge Pakistan and India to renounce their nuclear weapons programs and to sign the Nuclear Non-Proliferation Treaty and the Comprehensive Test Ban Treaty." XIE XIE He also announced a series of sanctions against Pakistan, which he said are consistent with those imposed on India after its nuclear tests. He also announced a series of sanctions against Pakistan, which he said are consistent with those imposed on India after its nuclear tests.

13 Training and Testing Training set Training set –25 queries –Collections of documents –Relevance judgment of each sentence Testing set Testing set –Another 25 queries –Collections of documents

14 Questions to Be Considered: How do you represent queries and sentences? How do you represent queries and sentences? What features may be considered for the task? What features may be considered for the task? What data mining techniques can be used and how? What data mining techniques can be used and how? How do you evaluate the performance of your system? How do you evaluate the performance of your system?

15 Evaluation: Precision and Recall Precision and Recall The F measure The F measure –F = 2*Precision*Recall/(Precision + Recall) Average Precision for query Average Precision for query –Calculate by averaging precision as recall increases. Mean Average Precision Mean Average Precision

16 Data Mining Techniques Classification Classification Clustering Clustering Association Rules Association Rules

17 Fuzzy c-means clustering Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster For each point x we have a coefficient giving the degree of being in the kth cluster uk(x) For each point x we have a coefficient giving the degree of being in the kth cluster uk(x)

18 Fuzzy c-means clustering The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: The degree of belonging is related to the inverse of the distance to the cluster The degree of belonging is related to the inverse of the distance to the cluster The coefficients are normalized with a real parameter m > 1 so that their sum is 1. The coefficients are normalized with a real parameter m > 1 so that their sum is 1.

19 Fuzzy c-means clustering Choose a number of clusters. Choose a number of clusters. Assign randomly to each point coefficients for being in the clusters. Assign randomly to each point coefficients for being in the clusters. Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : –Compute the centroid for each cluster, using the formula above. –For each point, compute its coefficients of being in the clusters, using the formula above.

20 Next Class (Wednesday April 11 th ) A 10-minute presentation on your project design, including A 10-minute presentation on your project design, including –(1) Problem statement(s) –(2) Your technical approaches (I & II) –(3) A work plan –(4) Anticipated results –(5) Evaluation metrics