Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Search Engines and Information Retrieval
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Recommender systems Ram Akella November 26 th 2008.
Chapter 5: Information Retrieval and Web Search
Chapter 1: The Database Environment
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter One Overview of Database Objectives: -Introduction -DBMS architecture -Definitions -Data models -DB lifecycle.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Course FAQ’s I do not have any knowledge on SQL concepts or Database Testing. Will this course helps me to get through all the concepts? What kind of.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
1 Information Retrieval LECTURE 1 : Introduction.
Vertical Search for Courses of UIUC by Jessica Bell, Alexander Loeb, Sharon Paradesi, Michael Paul, Jing Xia, Jie Zhang.
Text Clustering Hongning Wang
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
IIS 645 Database Management Systems DDr. Khorsheed Today’s Topics 1. Course Overview 22. Introduction to Database management 33. Components of Database.
Information Organization: Evaluation of Classification Performance.
Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc Ohara, Kami-fukuoka, Saitama , Japan
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
MID-SEM REVIEW.
Overview of Machine Learning
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Web Mining Research: A Survey
Information Organization: Overview
Topic: Semantic Text Mining
Presentation transcript:

Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

Contents Introduction Methods description Information Retrieval System Experiments

Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

Definitions Document Terms dictionary Dictionary Cluster Word context Context or document conditional probability distribution Entropy

Document conditional probability distribution Document x y word1 word2 word3 … wordn tf(y) p(y|x) 5/m 10/m 6/m 16/m y – words tf(y) – y frequency p(y|x) – y conditional probability in document x m – document x size (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

Word context Word w Document x1Document x2Document xk y word1 word2 … wordn1 tf(y) p(y|x1) 5/m1 10/m1 16/m1 y word1 word3 … wordn2 tf(y) p(y|x1) 7/m1 12/m1 4/m1 y word1 word4 … wordnk tf(y) p(y|x1) 20/mk 9/mk 3/mk … y word1 word2 word3 … wordnk tf(y) = p(y|w) 32/m 10/m 12/m 3/m … Context conditional probability distribution

Contents Introduction Methods description Information Retrieval System Experiments

Methods document clustering method dictionary build methods document classification method using training set Information retrieval methods: keyword search method cluster based search method similar documents search method

Contextual Documents Clustering Documents DictionaryNarrow context words Clusters Distances calculation

Entropy p1 pn p2 y context conditional probability distribution p1+p2+…+pn=1 p1 pn p2 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

Contextual Document Clustering maxH(y)=H ()

Entropy α H() ) )

Word Context - Document Distance y context conditional probability distribution Document x conditional probability distribution Average conditional probability distribution

Word Context - Document Distance JS[p1,p2]=H( ) - 0.5H() )

Jensen-Shannon divergence

Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance

Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency and 2.

Retrieval algorithms keyword search method cluster based search method search by example method

Keyword search method Document 1 word 1 word 2 word 3 … word n1 Document 2 word 10 word 25 word 30 … word n2 Document 3 word 15 word 2 word 32 … word n3 Document 4 word 11 word 21 word 3 … word n4 Request: word 2Result set: document 1 document3

Cluster based search method Documents Cluster 3 word 1 word 23 … word n3 Documents Cluster 2 word 12 word 26 … word n2 Cluster 1 word 1 word 2 … word n1 Cluster context words Request: word 1Result set: Cluster 1 Cluster 3

Similar documents search document 1Cluster name Cluster Minimal Spanning Tree document 2 document 3 document 4 document 5 document 6 document 7 Request: document 3Result set: document 6 document 7

Document classification: method 1 Clusters List of topics Training set Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Test documents

Clusters Topics list Training set Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Document classification: method 2 Test documents All documents set

Contents Introduction Methods description Information Retrieval System Experiments

Information Retrieval System Architecture Features Use

Information Retrieval System architecture. data base server client

IRS architecture Data Base Data Base Server MS SQL Server 2000 Local Area Network Local Area Network “thick” client C#

IRS architecture DBMS MS SQL Server 2000: High-performance Scalable Secure Huge volumes of data treat T/SQL Stored procedures

IRS features In the IRS the following problems are solved: document clustering keyword search method cluster based search method similar documents search method document classification with the use of training set

DB structure The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization

DictionaryDocuments Table “document-word” Words contexts ClustersCentroid Cluster based search Keyword search Words with narrow contexts All words dictionary Similar documents search Algorithms implementation

document1 document2 document5document3 document4 Cluster 0, , , , , ,211 0,87310,7231 0,1011 Similar documents search

Minimal Spanning Tree document 1Cluster name Cluster document 2 document 3 document 4 document 5

Similar documents search Clusters table Tree table Distances table Similar documents search

IRS use

Contents Introduction Methods description Information Retrieval System Experiments

Test goals were: algorithm accuracy test different classification methods comparison algorithm efficiency evaluation

Experiments 60,000 documents 100 topics Training set volume = 5% of the collection size

Experiments

Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average recall precision F-measure were calculated.

Recall

Precision

F-measure

Result analysis List of some topics test documents were classified in № Category 1 Family law 2 Inheritance law 3 Water industry 4 Catering 5 Inhabitants’ consumer services 6 Rent truck 7 International law of the space 8 Territory in international law 9 Off-economic relations fellows 10 Off-economic dealerships 11 Economy free trade zones. Customs unions.

Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents. С V textan xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

Thank you for your attention!