M ATCHING S IMILARITY FOR K EYWORD - BASED C LUSTERING Mohammad Rezaei, Pasi Fränti Speech and Image Processing Unit University of Eastern.

Slides:



Advertisements
Similar presentations
Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Improved TF-IDF Ranker
Basic Spreadsheet Functions Objective Functions are predefined formulas that perform calculations by using specific values, called arguments, in.
Unsupervised learning
Creating a Similarity Graph from WordNet
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
A N I NTERACTIVE C LUSTERING - BASED A PPROACH TO I NTEGRATING S OURCE Q UERY I NTERFACES ON THE D EEP W EB Wensheng Wu Clement Yu AnHai Doan Weiyi Meng.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
LSP 121 Week 2 Normalization and Queries. Normalization The Old Car Club database presented a problem – what if one person owns multiple cars? (One owner.
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti.
1 CLARACLARA. 2 data Algorithm CLARA 1. For i= 1 to 5, repeat the following steps: k = 2 mincost = 9999 bestset.
Boolean Searching Class. Let’s watch a video that explains the Boolean operators AND and OR.
Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM DKE Data & Knowledge.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Data Mining for Personal Navigation Gurushyam Hariharan Pasi Fränti Sandeep Mehta DYNAMAP PROJECT University of Joensuu, FINLAND
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Date : 2014/01/14 Author : Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas Source : CIKM’13 Advisor : Jia-ling Koh Speaker : Shao-Chun Peng.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
Search Engine Optimisation. On page methodologies –Anything you can affect with the construction of a single page Off page methodologies –Refers to all.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Automated Conceptual Abstraction of Large Diagrams By Daniel Levy and Christina Christodoulakis December 2012 (2 days before the end of the world)
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Subject Headings Objective: Students will understand that both books and articles are assigned words to describe their contents. These terms are referred.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining – Algorithms: K Means Clustering
Agglomerative clustering (AC)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Semi-Supervised Clustering
Lecture 8: Word Clustering
SCHOOL WHERE TO GO? SCHOOL SCHOOL I want to learn English. PLAY.
Analysis of Algorithms CS 477/677
K-means and Hierarchical Clustering
Design and Analysis of Algorithms (07 Credits / 4 hours per week)
Data Integration with Dependent Sources
Information Organization: Clustering
Algorithm Discovery and Design
Introduction to Programming
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval and Web Design
Hierarchical Clustering
Course project work tasks
Design and Analysis of Algorithms (04 Credits / 4 hours per week)
Presentation transcript:

M ATCHING S IMILARITY FOR K EYWORD - BASED C LUSTERING Mohammad Rezaei, Pasi Fränti Speech and Image Processing Unit University of Eastern Finland August 2014

K EYWORD -B ASED C LUSTERING An object such as a text document, website, movie and service can be described by a set of keywords Objects with different number of keywords The goal is clustering objects based on semantic similarity of their keywords

S IMILARITY B ETWEEN W ORD G ROUPS How to define similarity between objects as main requirement for clustering? Assuming we have similarity between two words, the task is defining similarity between word groups

S IMILARITY OF W ORDS Lexical Car ≠ Automobile Semantic Corpus-based Knowledge-based Hybrid of Corpus-based and Knowledge-based Search engine based

W U & P ALMER animal horse amphibianreptilemammalfish dachshund hunting dogstallionmare cat terrier wolf dog

S IMILARITY B ETWEEN W ORD G ROUPS Minimum : two least similar words Maximum : two most similar words Average : Summing up all pairwise similarities and calculating average value We have used Wu & Pulmer measure for similarity of two words

I SSUES OF T RADITIONAL M EASURES 1- Café, lunch 2- Café, lunch Min: 0.32 Max: 1.00 Average: % similar services: So, is maximum measure is good?

I SSUES OF T RADITIONAL M EASURES 1- Book, store 2- Cloth, store Max: 1.00 Different services: These services are considered exactly similar with maximum measure.

I SSUES OF T RADITIONAL M EASURES 1- Restaurant, lunch, pizza, kebab, café, drive-in 2- Restaurant, lunch, pizza, kebab, café Two very similar services: Min: 0.03 (between drive-in and pizza)

M ATCHING S IMILARITY Greedy pairing of words - two most similar words are paired iteratively - the remaining non-paired keywords are just matched to their most similar words

M ATCHING S IMILARITY Similarity between two objects with N 1 and N 2 words where N 1 ≥ N 2 : S( w i, w p ( i )) is the similarity between word w i and its pair w p ( i ).

E XAMPLES 1- Café, lunch 2- Café, lunch Book, store 2- Cloth, store Restaurant, lunch, pizza, kebab, café, drive-in 2- Restaurant, lunch, pizza, kebab, café

E XPERIMENTS Data Location-based services from Mopsi ( English and Finnish words: Finnish words were converted to English using Microsoft Bing Translator, but manual refinement was done to eliminate automatic translation issues 378 services Similarity measures: Minimum, Average and Matching Clustering algorithms Complete-link and average-link

S IMILARITY BETWEEN SERVICES Mopsi service A1- Parturi- kampaamo Nona A2- Parturi- kampaamo Platina A3- Parturi- kampaamo Koivunoro B1- Kielo B2- Kahvila Pikantti Keywords barber hair salon barber hair salon barber hair salon shop cafe cafeteria coffe lunch restaurant

S IMILARITY BETWEEN SERVICES ServicesA1A2A3B1B2 Minimum similarity A A A B B Average similarity A A A B B Matching similarity A A A B B

E VALUATION B ASED ON SC C RITERIA Run clustering for different number of clusters from K=378 to 1 Calculate SC criteria for every resulted clustering The minimum SC, represents the best number of clusters

SC – C OMPLETE L INK

SC – A VERAGE L INK

T HE SIZES OF THE FOUR LARGEST CLUSTERS Complete link Similarity:Sizes of 4 biggest clusters Minimum Average Matching Average link Similarity:Sizes of 4 biggest clusters Minimum Average Matching272317

C ONCLUSION AND F UTURE W ORK A new measure called matching similarity was proposed for comparing two groups of words. Future work Generalize matching similarity to other clustering algorithms such as k-means and k-medoids Theoretical analysis of similarity measures for word groups