Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Dimensionality Reduction PCA -- SVD
Locating in fingerprint space: wireless indoor localization with little human intervention. Proceedings of the 18th annual international conference on.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Collective Collaborative Tagging System Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Social Networking for Research Communities Using Tagging and Shared Bookmarks: a Web 2.0 Application Marlon Pierce, Geoffrey Fox, Joshua Rosen, Siddharth.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
CS Instance Based Learning1 Instance Based Learning.
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
Overview of Web Data Mining and Applications Part I
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Tag-based Social Interest Discovery
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Social Networking to Support Researchers at Minority Serving Institutions Marlon Pierce Community Grids Lab Indiana University.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Text mining.
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
No Title, yet Hyunwoo Kim SNU IDB Lab. September 11, 2008.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
P2Pedia A Distributed Wiki Network Management and Artificial Intelligence Laboratory Carleton University Presented by: Alexander Craig May 9 th, 2011.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Web- and Multimedia-based Information Systems Lecture 2.
University “Ss. Cyril and Methodus” SKOPJE Cluster-based MDS Algorithm for Nodes Localization in Wireless Sensor Networks Ass. Biljana Stojkoska.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
MSI-CIEC Portal
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Social Networking for Scientists (Research Communities) Using Tagging and Shared Bookmarks: a Web 2.0 Application Marlon Pierce, Geoffrey Fox, Joshua Rosen,
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Data Mining and Decision Support
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Optimization Indiana University July Geoffrey Fox
3.3 Network-Centric Community Detection  Network-Centric Community Detection –consider the global topology of a network. –It aims to partition nodes of.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Nearest Neighbors CSC 576: Data Mining.
Indiana University July Geoffrey Fox
Presentation transcript:

Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana University

 Delicious example 1 Bookmark Tags Social Networks Social Networks People- generated

 Collaborative Tagging  Online bookmarking with annotations  Create social networks  Utilize power of people’s knowledge  Pros and cons  High-quality classifier by using human intelligence  But lack of control or authority 2

3

4 Search Result SOAP, REST, … Repository Query with various options RDF RSS Atom HTML Populate Bookmarks/ tags Distributed Tagging Data CCT System Data Coordinator User Service Data Importer Collective Collaborative Tagging (CCT) System

5  1 st - Service and algorithm development  Identify services and algorithms  2 nd - Interface development  Web2.o style interface  REST, SOAP, …  3 rd – Export/import service development  Merging distributed data sets  Export data to build mesh-up sites  So far, we are mainly in 1 st stage and do some experiments in 2 nd stage

6 Different Data Sources Various IR algorithms Flexible Options Result Comparison

7 Searching Given input tags, returning the most relevant X (X = URLs, tags, or users) Latent Semantic Indexing (LSI), FolkRank I I Recomme ndation Indirect input tags, returning undiscovered X II Clustering Community discovering. Finding a group or a community with similar interests K-Means, Deterministic Annealing Clustering III Trend detection Analysis the tagging activities in time- series manner and detect abnormality Time Series Analysis IV Service Description Algorithm Type

 Vector-space model (bag-of-words model)  Assume n URLs and q tags  A URL can be represented by q-dimension vector, d i = (t 1, t 2, …, t q )  A total data set can be represented by n-by-q matrix  Pairwise Dissimilarity Matrix  n-by-n symmetric matrix  Distance (Euclidean, Manhattan, … )  Angles, cosine, sine, …  O(n 2 ) complexity 8

9 (Source : MSI-CIEC)  Graph model  Building a graph with nodes and edges  Edges are indicating relationship  Becoming complex networks (tag graph)  Dissimilarity  Related with path distance  Finding path is important (Shortest path problem)  Naive approach : O(n 3 ) complexity

 Latent Semantic Indexing  Using vector-space model, find the most similar URLs with user’s query tags  Dimension reduction from high q to low d (q >> d)  Removing noisy terms, extracting latent concepts 10 Precision Recall 2 terms 4 terms 8 terms 20% dim. reduction None Ideal Line

 Discover the group structures of URLs  Non-parametric learning algorithm  Non-trivial optimization problem  Should avoid local minima/maxima solution 11

 Deterministically avoid local minima  Tracing global solution by changing level of energy  Analogy to physical annealing process (High  Low) 12

 Classification  To response more quickly to user’s requests  Training data based on user’s input and answering questions based on the training results  Artificial Neural Network, Support Vector Machine,…  Trend Detection  Can be used for prediction/forecasting  Time-series analysis of tagging activities  Markov chain model, Fourier transform, … 13

 The goal of our Collective Collaborative Tagging (CCT) system  Utilize various data sets  Provide various information retrieval (IR) algorithms  Help to utilize people-powered knowledge  Currently various models and algorithms are being investigated  Service interfaces and import/export function will be added soon 14

15

16 -. Distances, cosine, … -. O(N 2 ) complexity -. Distances, cosine, … -. O(N 2 ) complexity Dis- similarity Vector-space Model -. Paths, hops, connectivity, … -. O(N 3 ) complexity -. Paths, hops, connectivity, … -. O(N 3 ) complexity Graph Model -. Latent Semantic Indexing -. Dimension reduction schemes -. PCA -. Latent Semantic Indexing -. Dimension reduction schemes -. PCA Algorithm -. PageRank, FolkRank, … -. Pairwise clustering -. MDS -. PageRank, FolkRank, … -. Pairwise clustering -. MDS -. q-dimensional vector -. q-by-n matrix -. q-dimensional vector -. q-by-n matrix Represen- tation -. G(V, E) -. V = {URL, tags, users} -. G(V, E) -. V = {URL, tags, users}

 Pairwise clustering  Input from vector-based model vs. graph model  How to avoid local minima/maxima? (e.g, K-Means) 17 Graph model Vector-space model