Bogdan Vrusias © 2003 Scene of Crime Information System: Playing at St. Andrews 22nd August 2003 Bogdan Vrusias, Mariam Tariq, Lee Gillam Department of.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
Information Retrieval in Practice
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
AI – CS364 Hybrid Intelligent Systems Overview of Hybrid Intelligent Systems 07 th November 2005 Dr Bogdan L. Vrusias
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Query Operations Relevance Feedback & Query Expansion.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
A Language Independent Method for Question Classification COLING 2004.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
GEMET GEneral Multilingual Environmental Thesaurus leading the way to federated terminologies Stefan Jensen, Head of information services group with input.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Information Retrieval
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
A Genetic Algorithm-Based Approach to Content-Based Image Retrieval Bo-Yen Wang( 王博彥 )
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Relevance Feedback in Image Retrieval System: A Survey Tao Huang Lin Luo Chengcui Zhang.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Information Retrieval in Practice
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Visual Information Retrieval
Search Engine Architecture
Multimedia Information Retrieval
Multimedia Information Retrieval
Co-operative neural networks and ‘integrated’ classification
Presentation transcript:

Bogdan Vrusias © 2003 Scene of Crime Information System: Playing at St. Andrews 22nd August 2003 Bogdan Vrusias, Mariam Tariq, Lee Gillam Department of Computing University of Surrey England

Bogdan Vrusias © 2003 Talk Outline  SoCIS  Handling Multilinguality  Synonymy and Morphology  Relevance Ranking  Performance Issues  Results and Evaluation  Conclusions & Future

Bogdan Vrusias © 2003 SoCIS  The EPSRC-funded Scene of Crime Information System (SoCIS) project was running from October 1999 to March  The aim was to study the link between images and texts within a specialist domain context.  A method has been outlined for developing an intelligent content-based image retrieval (CBIR) system, which can store and retrieve images based on the linguistic descriptions of the images.  The corpus-based method uses the lexical and semantic properties of specialist texts for extracting key terms and for discovering the ontological organisation of the terms.

Bogdan Vrusias © 2003 SoCIS  The system, which is based on a 3-tier architecture of client, server, and database, can be accessed via a local intranet.  SoCIS is an intelligent CBIR system that automatically: –labels (and indexes) images by keywords as well as relational facts extracted from the descriptions provided by domain experts; –extracts physical features of an image; –populates a database comprising domain-specific terminology, together with the semantic relationships between terms, starting from a random selection of collateral texts of the domain; and –learns to link image and text by using neural networks

Bogdan Vrusias © 2003 SoCIS  SoCIS has integrated modules from –System Quirk (Ahmad & Rogers, 2001) - a set of tools for building and managing multilingual term bases with the use of powerful text analysis techniques, and –GATE (Cunningham et al., 2002) - a framework and graphical development environment comprising robust NLP tools.  The main advantages that SoCIS can be said to have over other text-based and CBIR systems is its ability to extract information from both texts and images, to encode this information for indexing, and to build thesauri, all automatically.

Bogdan Vrusias © 2003 Adapting SoCIS  SoCIS was specifically targeted at the use of specialist languages.  The system has been built based on the knowledge gathered from Scene of Crime experts –from the testing and evaluation sessions performed with them, and –from a domain-specific text corpus.  The system had to be adapted to deal with multilinguality as well as structured data from a more general domain for the ImageCLEF collection.  The indexing module was used to extract single and compound terms from the output of the parser.  We used Wordnet for query expansion purposes but the indexing had to be carried out without using a terminology dictionary to filter out invalid terms.  A relevance ranking mechanism was adopted to handle the expanded terms retrieved from Wordnet.

Bogdan Vrusias © 2003 Handling Multilinguality  We relied upon translation engines as found on the Internet.  Google’s translation tools encountered difficulties.  Altavista’s Babelfish was selected as the principal translation engine (  However since Altavista’s Babelfish does not translate Dutch, FreeTranslation.com (  To translate the queries, Java code was used to wrap definitions of the query syntax used by these sites.  Using the Java JTidy utility, the resulting HTML was converted to XML and XSLT.

Bogdan Vrusias © 2003 Handling Multilinguality  Certain of these translations will cause problems with the retrieval. E.g. "Golf course bunkers":  The quality of returned translation will therefore have a significant impact on the results being returned. German Golf course shelter French Bunkers of ground of gulf Italian (1) A bunker in a distance of golf Italian (2) bunkers in a golf course Spanish (1) B??nkers in a golf course Spanish (2) Track of golf Dutch Bunkers on a wave job

Bogdan Vrusias © 2003 Synonymy and Morphology  A program was written to query a Wordnet database to provide a set of synonyms and hyponyms for each of the query terms.  Given a query term, the program returns all the words in the synset that the particular term is an element of, as well as all the hyponyms of each synset element to a specified level in the hierarchy.  Initially we planned to go down 2 levels in the hierarchy but ended up using just the synonyms due to system performance issues related to the large number of expanded terms returned.  Some basic morphological analysis was also carried out for each query term to account for the use of variants such as singular or plural terms as well as the verb or adjective forms.

Bogdan Vrusias © 2003 Synonymy and Morphology  Taking the query “Boats on Loch Lomond” as an example:  The term ‘boat’ returned 53 expanded words going down one level in the hierarchy. –Synonyms returned were: travel on water, sauceboat, gravy boat. –Hyponyms returned were: motorboat, mail boat, mailboat gondola, propel by oars, propel by paddles, yacht, and so on.  ‘Loch’ returned one synonym lough.  ‘Lomond’ was not present since it is a proper noun.  The very common term ‘man’ had 131 expanded words going down one level and 344 expanded words going down two levels with words such as –private, make swollen, belly out, candy striper, Homo erectus, clothes horse, ridicule with a satire, and gentleman.

Bogdan Vrusias © 2003 Relevance Ranking  Each keyword carried a proportion of its frequency in an annotation divided by the total number of terms allocated to this annotation.  The original keyword was then multiplied with weight 1, each expanded term (synonyms) returned by WordNet with weight 0.9, and words containing substrings of the original keywords with weight 0.1.  The total ranking was then given by:  Where f td is the term frequency of term t in document d, w t is the weight of a term t as described previously, and N d is the total number of words in document d.

Bogdan Vrusias © 2003 Performance Issues  The system has been designed for the analysis of free text in specialist domains whereas with the ImageCLEF collection we were dealing with structured texts in a general domain. The indices produced were relatively unreliable due to the different syntactic structure of the ImageCLEF text when compared to free text.  Due to the fact that we used Wordnet for query expansion, we encountered problems associated with polysemous words as well as different word forms.  Due to the amount of time it was taking to process the expanded queries (some times reaching up to 300 words) we had to limit the expansion to just synonyms of the original query terms. We had six computers running in parallel to finish the processing, which was taking approximately 8 hours per language.

Bogdan Vrusias © 2003 Results and Evaluation  A system that in principle would allow a user to query a collection of images that have been annotated in English, using a query in one of six languages has been prototyped.  Across all languages, the following sets of results were obtained (missing topics and quantities for that topic are given in the third column) Spanish105 / (3), 33 (1), 34 (1), 36 (1), 39 (2), 43 (3), 47 (1) English48 / 5040, 46 French47 / 517, 17, 25 Italian91 / (2), 17 (1), 27 (3), 29 (2) 31 (1), 39(1), 43 (1), 45 (1), German43 / 504, 7, 13, 27, 40, 46, 48 Dutch38 / 505, 7, 13, 17, 18, 20, 27, 29, 36, 39, 40, 43 Total372 / 421

Bogdan Vrusias © 2003 Results and Evaluation  For Topic 14, the top 5 results have been taken once for each language, and the similarity matrix between these results is as follows.  These results show degrees of similarity between the English, Italian and Dutch results, with German and Spanish showing similarities, and French showing the most marked behavioural difference. EnFrDeEsItNlTotal

Bogdan Vrusias © 2003 Results and Evaluation  Taking a list of the exemplar images for retrieval, the ranking (where it exists) of that image within the 1000 results for each language was considered. For each language, if the exemplar image was retrieved within the first 1000, this was counted. If it was retrieved within the top 100 results, this was also noted. The following table presents the results obtained. HighLowAveIn top 100 Nl De En Fr It Es

Bogdan Vrusias © 2003 Note on Text & Image Retrieval  Increasingly, images are being indexed and retrieved by both their visual content and by related texts such as captions that describe the image.  Image descriptors extracted directly from image data (colour, texture and shape) tend to capture little of an image’s semantic content – hence there is a need to extract information about the image content from collateral texts.  Information fusion has proven that in some cases it increases the retrieval of relevant data.

Bogdan Vrusias © 2003 Future Work  Developing grid-enabled systems where the different processing modules, as well as instances of the same module, could be run as a service, in parallel, which would significantly improve the processing time.  The ranking mechanism needs to be further refined and tuned by carrying out more trial runs.  To improve the query expansion, one suggestion could be to use part-of- speech information from the query sentence to filter out some of the irrelevant expanded terms.  We are investigating methods of effectively combining text-based with image-based retrieval techniques.  This technique incorporated into a system that learns how to index, would improve performance.  We are investigating the creation of multimedia thesauri.

Bogdan Vrusias © 2003 Questions?