Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
Similarity/Clustering 인공지능연구실 문홍구 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.
Dimensionality Reduction PCA -- SVD
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Dimensional reduction, PCA
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Dimensionality Reduction
Non Negative Matrix Factorization
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Computer-assisted essay assessment zSimilarity scores by Latent Semantic Analysis zComparison material based on relevant passages from textbook zDefining.
Latent Dirichlet Allocation
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Hierarchical Segmentation: Finding Changes in a Text Signal Malcolm Slaney and Dulce Ponceleon IBM Almaden Research Center.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Packing to fewer dimensions
Document Clustering Based on Non-negative Matrix Factorization
LSI, SVD and Data Management
Packing to fewer dimensions
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Concept Decomposition for Large Sparse Text Data Using Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Presentation transcript:

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

Text Documents 156,000 periodicals in print worldwide (1999), and approximately 12,000 added each year. (Ulrich’s International Periodical Directory) Library of Congress maintains a collection of 17 million books and receive new books at the rate of 7,000 per working day.

Why Dimensional Reduction? High dimensional term-by-document sparse matrix –Require large number of computer resource –Difficult to capture underlying concepts.

Desired Properties of Dimensional Reduction Preserve distances between vectors (Orthogonal projection matrix) Capture underlying concepts and bring together semantically related documents

Dimensional Reduction Latent Semantic Indexing (LSI) use SVD, computationally expensive Random projection is one way of solving the problem of LSI, but cannot capture underlying semantics as LSI How about use both random projection and LSI? It turns out that it does not have the desired property of LSI This paper: random projection of concept vectors, called “concept projection”. Much faster and retrieval efficiency comparable to LSI

Concept Projection Concept: spherical K-means Projection: Randomly chosen orthogonal projection matrix, distances between vectors are approximately preserved

Spherical K-means algorithm

Conclusion Combines the random projection with concept vectors to do the dimensional reduction, get faster retrieval and comparable results as LSI

Questions In the experiment of this paper, it only uses 1033 documents, too small data set When there are at least 500 clusters, the results will be good. That means every 2 document vectors will form a cluster!