Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Slides:



Advertisements
Similar presentations
A Novel Visualization Model for Web Search Results An Application of the Solar System Metaphor Tien N. Nguyen and Jin Zhang Electrical and Computer Engineering.
Advertisements

The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Eigen Decomposition and Singular Value Decomposition
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Automatic Discovery of Shared Interest Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus.
Ch 4: Information Retrieval and Text Mining
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Chapter 5: Information Retrieval and Web Search
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Text mining.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Vector Space Models.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
1 CS 430: Information Discovery Lecture 5 Ranking.
Reduced echelon form Matrix equations Null space Range Determinant Invertibility Similar matrices Eigenvalues Eigenvectors Diagonabilty Power.
Vector Semantics Dense Vectors.

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
IR 6 Scoring, term weighting and the vector space model.
Data Science Dimensionality Reduction WFH: Section 7.3 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall.
Document Clustering Based on Non-negative Matrix Factorization
Department of Informatics, Nicolaus Copernicus University, Toruń
Latent Semantic Indexing
Automatic Discovery of Shared Interest Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Co Department of Informatics, Nicolaus.
Representation of documents and queries
Design open relay based DNS blacklist system
Project 1: Text Classification by Neural Networks
Concept Decomposition for Large Sparse Text Data Using Clustering
CS 430: Information Discovery
Recuperação de Informação B
Presentation transcript:

Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering, NTU Singapore Cincinnati Children’s Hospital Research Foundation, OH, USA Google: Duch

The Problem Finding people who share some of our interests in large organizations or worldwide is difficult. Analyzing people’s homepages and their lists of publications is a good way to find groups and individuals sharing common scientific interest. Maps should display individuals and groups. The structure of graphical representations depends strongly on the selection of keywords or dimensionality reduction.

The Data Reuters datasets, with 5 categories and 1 – 176 elements per category. 124 Personal Web Pages of the School of Electrical and Electronic Engineering (EEE) of the Nanyang Technological University (NTU) in Singapore, with 5 categories (control, microelectronics, information, circuit, power), and 14 – 41 documents per category.

Document-word matrix Document1: word1 word2 word3. word4 word3 word5. Document2: word1 word3 word5. word1 word3 word6. The matrix: documents x word frequencies

Methods used Inverse document frequency and term weighting. Simple selection of relevant terms. Latent Semantic Analysis (LSA) for dimensionality reduction. Minimum Spanning Trees for visual representation. TouchGraph XML visualization of MST trees.

Data Preparation Normalize columns of F dividing by highest word frequencies: Among n documents, term j occurs d j times; inverse document frequency idf j measures uniqueness of term j: tf x idf term weights:

Simple selection Simple selection: take w ij weights above certain threshold, binarize and remove zero rows: Calculate similarity using cosine measure:

Dimensionality reduction Latent Semantic Analysis (LSA): use Singular Value Decomposition on weight matrix W with U = eigenvectors of WW T and V of W T W. Remove small eigenvalues, recreate reduced W and calculate similarity:

Kruskal’s Algorithm and Top - Down Clusterization

Modified Kruskal’s Algorithm and Bottom - Up Clusterization

Reuters results Method topics clusters accuracy No dim red % LSA dim red. 0.8 (476) % LSA dim red. 0.6 (357) % Simple Selection % W rank in SVD = 595

Results for EEE NTU Web pages Method topics clusters accuracy No dim red % LSA dim red. 0.8 (467) % LSA dim red. 0.6 (350) % Simple Selection %

Examples TouchGraph LinkBrowser

Results for Summary Discharges New experiments on medical texts. 10 classes and 10 documents per class: Plain Doc-Word matrix ≈ 23% Stop-List, TW-IDF, S.S. ≈ 64% Concept Space ≈ 64% Transformation ≈ 93%

Simple Word-Doc Vector Space

Meta-Map Concept Vector Space

Concept Vector Space after transformation

Summary In real application knowledge-based approach is needed to select only useful words and to parse their web pages. Other visualization methods (like MDS) may be explored. People have many interests and thus may belong to several topic groups. Could be a very useful tool to create new shared interest groups in the Internet.