Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Eigen Decomposition and Singular Value Decomposition
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Linear Algebra Jean Walrand EECS – U.C. Berkeley.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Singular Value Decomposition and Data Management
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 LSI (lecture 19) Using latent semantic analysis to improve access to textual information (Dumais et al, CHI-88) What’s the best source of info about.
Clustering Vertices of 3D Animated Meshes
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Digital libraries and web- based information systems Mohsen Kamyar.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Knowledge based Question Answering System Anurag Gautam Harshit Maheshwari.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
From Frequency to Meaning: Vector Space Models of Semantics
Information Retrieval: Models and Methods
LSI, SVD and Data Management
Presentation transcript:

Progress Report (Concept Extraction) Presented by: Mohsen Kamyar

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

Motivations of “ Semantic Web ” Computation History: W3C definition: “ Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries ”. Huge Computations Gathering Human Data Sharing Data Processing Huge Data First Computers First Applications Inventing Web Semantic Web

Motivations of “ Semantic Web ” Some Efforts:  ( RDF enabled Wikipedia Data )  ( Friend of a Friend )  ( Semantically-Interlinked Online Communities )   ( Semantic Interoperability of Metadata and Information in unLike Environments )

Motivations of “ Semantic Web ” Different Ontologies and Metadata and Their Overlaps

Motivations of “ Semantic Web ” Different Projects and Their Overlaps

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

“ Semantic Web ” structure W3C Suggestion for Semantic Web Stack

“ Semantic Web ” structure Road from “zero” to Semantic Web Having an Ontology Annotating Documents Submit Query on Semantic Enabled Data Having an Ontology Generating Data According to Ontology Submit Query on Semantic Enabled Data Existing Data

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

Motivations of “ Automatic Ontology Extraction ” Ontology is domain specific In each domain we need a team of experts to create an ontology Existing ontologies cover small ranges of knowledge.  WordNet  Data about members, projects and … in Universities  DBLP, …

Motivations of “ Automatic Ontology Extraction ”  DailyMed for drugs  FOAF  MySpace  BBC  Widely used one is AudioSrobbler Based on event ontology with 600 million entry but it is a simple ontology. Problem: Representing Knowledge in a Machine Readable Format.

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

“ Concept Extraction ” An ontology is a set of “Concepts”, properties of concepts (and constraints on them) and “Rules” between “Concepts”. If we can’t generate a general ontology in a domain, “Concepts” can be used in determining the relevance of documents and their ranks.

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

Main Components in existing methods are as below:  Similarity Measures Euclidean Measures: Cosine Distance Problem: If we have a document “B” that is the subset of a document “A” then the cosine distance will be large Problems

 Importance Measures Frequency: Has big false positive error (on technical words) and false negative error (on common words) TFiDF (Term Frequency. Inverse Document Frequency):  Has more accuracy  But if we have two semantic correlated words this measure will make mistake on main purpose of the document.  Also for long documents it will fail  Information about order of words will be lost  If we consider stems and synonyms or other related set of words it can work fine, otherwise many problems.

Problems  LSI (Latent Semantic Indexing/Analysis) This method uses SVD (singular value decomposition) After applying SVD we have term×concept (U matrix), eigenvalues, concept×document (V matrix). Matrices U and V are eigenvectors of terms correlation matrix and document correlation matrix. Eigenvectors have property of clustering the nodes of a graph (for example in PageRank algorithm we use same property) But there are problems.

Problems  This method assumes some probabilistic properties for term-document matrix that may be not hold, for example Gaussian Model properties.  This method is not customized for sparse matrices  This method is offline  This method has big computation complexity  Some efforts have been made such as SDD  This method again is based on cosine distance and similar measures

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

Main ideas on this process are as below  Using more general spaces instead of Euclidian space, such as “Banach Spaces” and defining a new distance measure that can serve desirable properties of term-document space. Banach spaces have interesting properties for high dimensional problems because of their generlity.  Using Linguistics theorems in determining the importance of a term in a document instead of TFiDF

Main ideas  Customizing LSI for sparse matrices and creating a framework for any method for matrix decomposing.  Giving an online version of LSI  LSI (SVD) basically is an approximation for term- document matrix but also have big computation complexity and can be approximated to.  We use clustering properties of eigenvectors in LSI, so we can substitute it with any clustering that is not based on Euclidian as mentioned.