Automatic Discovery of Shared Interest Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus.

Slides:



Advertisements
Similar presentations
A Novel Visualization Model for Web Search Results An Application of the Solar System Metaphor Tien N. Nguyen and Jin Zhang Electrical and Computer Engineering.
Advertisements

Cognitive Systems, ICANN panel, Q1 What is machine intelligence, as beyond pattern matching, classification and prediction. What is machine intelligence,
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Hypertext, hypermedia and interactivity. A brief overview and background primer.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Xyleme A Dynamic Warehouse for XML Data of the Web.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Chapter 19: Information Retrieval
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5: Information Retrieval and Web Search
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Lecturer: Ghadah Aldehim
Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.
Steps Toward an AGI Roadmap Włodek Duch ( Google: W. Duch) AGI, Memphis, 1-2 March 2007 Roadmaps: A Ten Year Roadmap to Machines with Common Sense (Push.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Which of the two appears simple to you? 1 2.
EE141 1 Language Janusz A. Starzyk
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Chapter 6: Information Retrieval and Web Search
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
SINGULAR VALUE DECOMPOSITION (SVD)
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Vector Space Models.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Information Retrieval
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval
Department of Informatics, Nicolaus Copernicus University, Toruń
Latent Semantic Indexing
Efficient Estimation of Word Representation in Vector Space
Automatic Discovery of Shared Interest Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Co Department of Informatics, Nicolaus.
Information Retrieval
Chapter 31: Information Retrieval
Recuperação de Informação B
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Automatic Discovery of Shared Interest Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Torun, Poland, & School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch

The Vision Vannevar Bush imagined in 1945 linked text/film information, a kind of Wikipedia, his Memex was first hypertext systems. Vannevar BushMemex Ted Nelson in “Computer Lib and Dream Machines” (1974) extended this vision to all kinds of information integrated in project Xanadu, a project founded in In essence: unbreakable two-way links, connected to origin of info, facilitating incremental publishing, deep version management & comparison. Ted Nelson project Xanadu WWW is not yet Xanadu, no links to origin of information, little maintenance, searches are frequently tedious, and linking information about any given subject is done in manual way. Xanadu vision is not sufficient; all knowledge should be organized in form of ideas supported by evidence, with links between related pieces of information automatically created (QED project). QED project If only computes could analyze and present it in coherent way, linking major ideas to papers, data, software, experiments...

The Problem Finding all people who share similar interests in large organizations or worldwide is difficult (NTU experience). Find who is related to me and in which way? Each individual may have many different interests so the search process should be topic-oriented, not people-oriented. The process should be automatic – use info on people’s homepages and their lists of publications. Visualize relations using graphs with individuals as nodes and different type of relations as edges. The structure of the graphical representation depends strongly on the selection of key entities of the nodes – text should be projected first on domain ontology.

Steps WWW spiders used to collect documents from some domain (NTU home pages have been used for tests). Convert html documents to text, clean using stop-words, apply stemming etc. Final filtering & dimensionality reduction to obtain vector representation of the term-document matrix. Cluster info in some way (try Clusty, Vivisimo or Carrot2).ClustyVivisimoCarrot2 Visualize related nodes that represent individual homepages, link them by estimates of similarity of shared interest: see Websom and its applications in digital lib, astro VizieR etc. Websomdigital libVizieR This goes beyond visualization of Google link analysis or “the brain interface” use in Britannica BrainStormer.Google link analysisthe brain interface

Implementation and Design

Document-word matrix Document1: word1 word2 word3. word4 word3 word5. Document2: word1 word3 word5. word1 word3 word6. The matrix: documents x word frequencies Document 1 Document 2 W 1 W 2 W 3 W 4 W 5 W 6

First shot: methods used Inverse document frequency and term weighting. Simple selection of relevant terms or Latent Semantic Analysis (LSA) for dimensionality reduction – standard method in info retrieval. Minimum Spanning Trees for visual representation. TouchGraph XML visualization of MST trees.

Data Preparation Normalize columns of F dividing by highest word frequencies: Among n documents, term j occurs d j times; inverse document frequency idf j measures uniqueness of term j: tf x idf term weights:

Simple selection Take w ij weights above certain threshold, binarize and remove zero rows: Calculate similarity using cosine measure (takes care of the vector length normalization):

Similarity using cosine measure Using the same vectors, V 1,V 2,V 3 Similarity of vector 1 and vector 2 is S 12 =0.615, and S 13 = Document 1 and Document 2 are more likely to be related. For visualization a threshold value (ex. S ij > 0.3) which will determine which links to show is used.

Dimensionality reduction Latent Semantic Analysis (LSA): use Singular Value Decomposition on weight matrix W with U = eigenvectors of WW T and V of W T W. Remove small eigenvalues from , recreate reduced W and calculate similarity:

Kruskal’s MST Algorithm and Top - Down Clusterization Minimum spanning tree = weighted graph with minimum total cost, created by a simple greedy algorithm.

Cluster identification during MST construction.

Some experiments: Reuters datasets, with 5 categories and 1 – 176 elements per category, 600 documents: can we see categories? 124 Personal Web Pages of the School of Electrical and Electronic Engineering (EEE) of the Nanyang Technological University (NTU) in Singapore. 5 department names may be used as categories: control, microelectronics, information, circuit, power, with 14 – 41 documents per category. Can we discover department structure?

Reuters results For 600 documents W rank in SVD is W rank = 595 Method topics clusters accuracy No dim red % LSA dim red. 0.8 (476) % LSA dim red. 0.6 (357) % Simple Selection % 0.8 means 0.8*W rank eigenvectors retained

Results for EEE NTU Web pages Method topics clusters accuracy No dim red % LSA dim red. 0.8 (467) % LSA dim red. 0.6 (350) % Simple Selection %

Examples Live demo EEE full EEE Selected EEE Selected small clusters EEE Selected small

Limitations Keywords have been derived from what we find on web pages only, too many, too sparse matrices. Synonymous concepts should be treated as a single feature, producing larger frequency counts. People working on “architecture in mechanical design” who are interested in “computer art” are associated with someone in “computer architecture”. Web pages contain many irrelevant information. Abbreviations of all sorts are used. No topics, therefore only a single category used.

Adding ontologies Select relevant terms using engineering ontologies (from keywords used in library classification)engineering ontologies Add medical concepts (ULMS) and use MetaMap to discover these concepts in text. Processing: Term weighting, stemming etc Simple selection of relevant terms. TouchGraph XML visualization

EEE: Simple Word-Doc Vector Space

EEE: Transformed Concept Vector Space

Med: Simple Word-Doc Vector Space

Med: Meta-Map Concept Vector Space

Med: after Metamap transformation

Results for Summary Discharges New experiments on medical texts. Short (~ half page) hospital summary discharges. 10 classes and 10 documents per class = main disease treated. Plain Doc-Word matrix ≈ 23% Stop-List, TW-IDF, S.S. ≈ 64% Metamap Transformation ≈ 93%

Summary In real application knowledge-based approach is needed to select relevant concepts and to parse web pages but problems with acronyms, abbreviations, synonyms etc should be solved. Other visualization and clusterization methods should be explored. People have many interests and thus may belong to several topic groups – topics are related to concepts that should be high in ontology, but have no simple description. Could be a very useful tool to create new shared interest groups for social networks in the Internet. Could point out to potential collaborators or interesting research from individual point of view.

Similar attempts Flink is presentation of the scientific work and social connectivity of Semantic Web reseachers, displaying homepages of experts who have contributed to the International Semantic Web Conference (ISWC) series. Kartoo is a metasearch engine that displays topic maps:

Related work in my group Neural basis of language: creation of network of concepts instead of vector models. Medical text analysis using UMLS ontologies. Instead of clustering formulate minimum number of questions to define more precise search. Creativity – inventing new names.

Words in the brain The cell assembly model of language has strong experimental support; F. Pulvermuller (2003) The Neuroscience of Language. On Brain Circuits of Words and Serial Order. Cambridge University Press. Acoustic signal => phonemes => words => semantic concepts. Semantic activations are seen 90 ms after phonological in N200 ERPs. Phonological density of words = # words that sound similar to a given word, that is create similar activations in phonological areas. Semantic density of words = # words that have similar meaning, or similar extended activation network. Perception/action networks, results from ERP & fMRI.

Words: simple model Goals: make the simplest testable model of creativity; create interesting novel words that capture some features of products; understand new words that cannot be found in the dictionary. Model inspired by the putative brain processes when new words are being invented. Start from keywords priming auditory cortex. Phonemes (allophones) are resonances, ordered activation of phonemes will activate both known words as well as their combinations; context + inhibition in the winner-takes-most leaves one or a few words. Creativity = imagination (fluctuations) + filtering (competition) Imagination: many chains of phonemes activate in parallel both words and non-words reps, depending on the strength of synaptic connections. Filtering: associations, emotions, phonological/semantic density.

Beyond ontologies Neurocognitive approach to language understanding: use recognition, semantic and episodic memory models, create graphs of consistent concepts for interpretation, use spreading activation and inhibition to simulate effect of semantic priming, annotate and disambiguate text. For medical texts ULMS has >2M concepts, 15M relations … See: Unambiguous Concept Mapping in a Medical Domain, Thursday 11:45 (Matykiewicz, Duch, Pestian).

Humanized interface Store Applications, eg. 20 questions game Query Semantic memory Parser Part of speech tagger & phrase extractor On line dictionaries Manual verification

DREAM modules Natural input modules Cognitive functions Affective functions Web/text/ databases interface Behavior control Control of devices Talking head Text to speech NLP functions Specialized agents DREAM project is focused on perception (visual, auditory, text inputs), cognitive functions (reasoning based on perceptions), natural language communication in well defined contexts, real time control of the simulated/physical head.

Thank you for lending your ears... Google: Duch => Papers