Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.

Slides:



Advertisements
Similar presentations
1 ~Khaled Shaban PhD. Candidate Supervisors: Dr. Otman Basir Dr. Mohammad Kamel.
Advertisements

Engineering and Integrating Business Processes Rik Eshuis.
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
One-Shot Multi-Set Non-rigid Feature-Spatial Matching
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Overview of Web Data Mining and Applications Part I
Information Retrieval in Practice
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Cooperative Meeting Scheduling among Agents based on Multiple Negotiations Toramatsu SHINTANI and Takayuki ITO Department of Intelligence and Computer.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The Agricultural Ontology Service (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Library and Documentation Systems.
Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors :Shady Shehata, Fakhri Karray, Mohamed S. Kamel, Fellow 2012, IEEE An Efficient Concept-Based.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
Haggle Architecture and Reference Implementation Uppsala, September Erik Nordström, Christian Rohner.
Team Members Dilip Narayanan Gaurav Jalan Nithya Janarthanan.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Cooperative Classifiers Rozita Dara Supervisor: Prof. Kamel Pattern Analysis and Machine Intelligence Lab University of Waterloo.
1 A Compact Feature Representation and Image Indexing in Content- Based Image Retrieval A presentation by Gita Das PhD Candidate 29 Nov 2005 Supervisor:
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.
1 Semantic Driven Hshing(SDH): An Ontology-based Search Scheme for Semantic Aware Network(SA Net) Chatree Sangpachatanaruk, Taieb Znati University of Pittsburgh.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Artificial Intelligence Techniques Internet Applications 4.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
Plan for today Introduction Graph Matching Method Theme Recognition Comparison Conclusion.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
The 15th International Semantic Web Conference Kobe, Japan.
Hanan Ayad Supervisor Prof. Mohamed Kamel
Clustering of Web pages
Data and Applications Security Developments and Directions
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Color-Texture Analysis for Content-Based Image Retrieval
Associative Query Answering via Query Feature Similarity
A Consensus-Based Clustering Method
Restrict Range of Data Collection for Topic Trend Detection
Image Segmentation Techniques
How to publish in a format that enhances literature-based discovery?
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University of Waterloo Waterloo, Ontario, Canada

2 Phrase-based Document Indexing Document Index Graph Structure A model based on a digraph representation of the phrases in the document set A model based on a digraph representation of the phrases in the document set Nodes correspond to unique terms Nodes correspond to unique terms Edges maintain phrase representation Edges maintain phrase representation A phrase is a path in the graph A phrase is a path in the graph The model is an inverted list (terms  documents) The model is an inverted list (terms  documents) Nodes carry term weight information for each document in which they appear Nodes carry term weight information for each document in which they appear Shared phrases can be matched effeciently Shared phrases can be matched effeciently Phrase-based Features Phrases: more informative feature than individual words  local context matching Phrases: more informative feature than individual words  local context matching Represent sentences rather than words Represent sentences rather than words Facilitate phrase-matching between documents Facilitate phrase-matching between documents Achieves accurate document pair-wise similarity Achieves accurate document pair-wise similarity Avoid high-dimensionality of vector space model Avoid high-dimensionality of vector space model Allow incremental processing Allow incremental processing Document Index Graph

3 Phrase-based Document Indexing Document Index Graph (internal structure) Document Index Graph (size scalability) Document Index Graph (time performance)

4 Document Clustering using Cluster Similarity Histograms Similarity Histogram-based Clustering (SHC). Clusters are represented using concise statitsical representation called similarity histograms. Maximize clusters coherency by maintaining high similarity distributions in clusters histograms. Enhance a cluster any time by re-distributing documents among clusters. Both original and receiving clusters benefit from more tight similarity distributions. SHC algorithm is incremental. +ve documents: contribute to cluster cohesiveness -ve documents: contribute to cluster looseness -ve documents in one cluster could be +ve documents in another Redistribute documents among clusters such that the number of –ve documents is reduced in each cluster

5 Document Clustering using Cluster Similarity Histograms (cont’d) SHC (time performance)

6 Phrases as Document Features Effect of Phrase Similarity (F-measure)Effect of Phrase Similarity (Entropy) SHC Clustering Improvement (F-measure) SHC Clustering Improvement (Entropy) Document Clustering using Similarity Histograms

7 Current Research Web Mining Multi-Agent System Cooperative agents work on mining web content Cooperative agents work on mining web content Agents can negotiate and exchange data to achieve better solutions Agents can negotiate and exchange data to achieve better solutions Implemented distributed clustering Implemented distributed clustering Based on multiple standards including XML, Web Services. Later will incorporate XML Topic Maps (XTM), Semantic Web and Ontologies to represent discovered clusters. Based on multiple standards including XML, Web Services. Later will incorporate XML Topic Maps (XTM), Semantic Web and Ontologies to represent discovered clusters.

8 Publications Journal Publications K. Hammouda and M. Kamel, “Efficient Phrase-based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering. Accepted, September K. Hammouda and M. Kamel, “Efficient Phrase-based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering. Accepted, September K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model”, Knowledge and Information Systems. Springer. Accepted, May K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model”, Knowledge and Information Systems. Springer. Accepted, May Conference Publications K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity Histograms”, The 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp , Halifax, Canada, October 2003 K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity Histograms”, The 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp , Halifax, Canada, October 2003 K. Hammouda and M. Kamel, “Phrase-based Document Similarity Based on an Index Graph Model”, The 2002 IEEE International Conference on Data Mining (ICDM'02), pp , Maebashi, Japan, December K. Hammouda and M. Kamel, “Phrase-based Document Similarity Based on an Index Graph Model”, The 2002 IEEE International Conference on Data Mining (ICDM'02), pp , Maebashi, Japan, December Available at: Available at: