Database & Information Systems Group University of Basel October 10th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th, 2007
October 10th, 2007 Michael Springmann - Database & Information Systems Group 2 Projects I.DELOS (EU FP6) Network of Excellence on Digital Libraries Task 1.6 Management of and Access to Virtual Electronic Health Records Task 1.8 DelosDLMS II.DILIGENT (EU FP6) A Digital Library Infrastructure on Grid Enabled Technology Work Package 1.4 Index & Search – Feature Extraction ARTE Scenario
October 10th, 2007 Michael Springmann - Database & Information Systems Group 3 What is similarity search? From a collection, return a ranked list of items for a given reference object. Reference Object
October 10th, 2007 Michael Springmann - Database & Information Systems Group 4 Steps to compute similarity I.Define query (reference object) II.Select feature to use for comparison III.Extract feature of reference object IV.Compare feature with each element of collection V.Return (subset) of ranked list e.g. Color Histogram e.g. 5-NN
October 10th, 2007 Michael Springmann - Database & Information Systems Group 5 Similarity Search: Media Types I.Image – Color, Texture, Shape II.Text – TF/IDF, Edit Distance III.Audio – Spectrum, Rhythm, Beat, Pitch IV.Video Sequences – Visual, Subtitles / Audio Transcripts, (rich) Meta Data Combinations of several types Complex Documents High dimensional feature vectors
October 10th, 2007 Michael Springmann - Database & Information Systems Group 6 Goals Effectiveness Theme: Find good/better results! Measure: Quality, e.g. for benchmark collections Precession, Recall, MAP Question: How can we find better results w.r.t. the information need of the user? Efficiency Theme: Retrieve the results fast! Measure: Execution time Question: How can we achieve this with algorithmic optimizations?
October 10th, 2007 Michael Springmann - Database & Information Systems Group 7 Similarity Search: What it is... A way to order / rank things May help to group objects Limitations: 1.Feature matches categorization criterion 2.No sharp borders
October 10th, 2007 Michael Springmann - Database & Information Systems Group 8
October 10th, 2007 Michael Springmann - Database & Information Systems Group 9
October 10th, 2007 Michael Springmann - Database & Information Systems Group 10 ISIS (Interactive Similarity Search) I.Originated at ETH Zurich, continued at UMIT and UNIBAS II.VA-File can handle collections of size > images while still achieving interactive answering times III.Used image features: Color Moments, Texture Moments IV.Global and 5 Fuzzy Regions
October 10th, 2007 Michael Springmann - Database & Information Systems Group 11 5 Fuzzy Regions
October 10th, 2007 Michael Springmann - Database & Information Systems Group 12 Similarity Search: What it is... and what it ain‘t? A way to order / rank things May help to group objects Limitations: 1.Feature matches categorization criterion 2.No sharp borders Feature extraction will not find out: One person sleeping... at least not without application specific adjustments / training
October 10th, 2007 Michael Springmann - Database & Information Systems Group 13 Domain Dependent: Face Detection CMU Face Detector in ISIS
October 10th, 2007 Michael Springmann - Database & Information Systems Group 14 What is CLEF? Cross Language Evalution Forum ( Started in 2000 Continuation of CLIR Track at TREC last ran in 2002 Workshop held each year directly following ECDL Is a DELOS activity Several Tracks Multilingual Document Retrieval on News Collections (Ad-Hoc) Scientific Data Retrieval (Domain-Specific) Interactive Cross-Language Information Retrieval (iCLEF) Multiple Language Question Answering Cross-Language Image Retrieval (ImageCLEF) Cross-Language Speech Retrieval (CL-SR) CLEF Web Track (WebCLEF) Cross-Language Geographical Information Retrieval (GeoCLEF)
October 10th, 2007 Michael Springmann - Database & Information Systems Group 15 ImageCLEF ( II. Object Retrieval Task PASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre- segmented. IV. Medical Automatic Annotation Task IRMA Database, medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific. III. Medical Image Retrieval PEIR, MIR, PathoPic, mypacs.net: > images, heterogeneous case notes in XML I. Ad-hoc photographic retrieval task IAPR TC-12 Benchmark, (tourist) images, multi- lingual descriptions. Main challenge: Short annotations
October 10th, 2007 Michael Springmann - Database & Information Systems Group 16 IRMA Code Classification Example Technical code (T) describes the image modality, e.g. 1 = x-ray, 11 = plain radiography, 112 = analog, 1123 = high beam energy Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine Anatomical code (A) refers to the body region examined, here: chest Biological code (B) describes the biological system examined. O always means unspecific and therefore is always followed by other Os or -. 4 independent axes:
October 10th, 2007 Michael Springmann - Database & Information Systems Group 17 Medical Automatic Annotation Task I training images, development evaluation images – correct classification known test images – correct classification secret. II.Hierarchical classification, not only true / false for classification results, but based on how many choices in hierarchy. III.BSc project by Andreas Dander (UMIT) „Bildähnlichkeitssuche mit Medizinischen Bildern“ Implementation of Image Distortion Model (IDM)
October 10th, 2007 Michael Springmann - Database & Information Systems Group 18 Image Distortion Model (IDM) Uses reduced size images of at most 32 pixels width/height Corresponding pixels
October 10th, 2007 Michael Springmann - Database & Information Systems Group 19 Edge Detection (Sobel Filter)
October 10th, 2007 Michael Springmann - Database & Information Systems Group 20 Efficiency: Speeding up IDM Algorithmic optimization Idea: Only k ≤ 5 of the reference images are used for subsequent kNN classification. Early termination of distance computation of unused images Base decision on threshold derived from best k images seen so far Pixels not evaluated due to exceeded threshold
October 10th, 2007 Michael Springmann - Database & Information Systems Group 21 Early Termination Strategy - Experimental results For IDM: Less than 30% of all pixels need to get evaluated
October 10th, 2007 Michael Springmann - Database & Information Systems Group 22 Speaking of numbers… I.Original RWTH Aachen implementation of IDM requires for X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz. II.Using L 2 -Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results. III.Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L 2 -Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample. IV.We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound.
October 10th, 2007 Michael Springmann - Database & Information Systems Group 23 Multithreading - Implementation I.Several Java Worker Threads, each computes similarity between one reference image and query. II.Dispatcher keeps track of distance threshold for early termination. III.IDM with early termination takes 4.3 seconds on Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz. IV.Opens possibility for optimizing second goal…
October 10th, 2007 Michael Springmann - Database & Information Systems Group 24 Effectiveness: Adjusting IDM Uses reduced size images of at most 32 pixels width/height Corresponding pixels
October 10th, 2007 Michael Springmann - Database & Information Systems Group 25 Multithreading - Results
October 10th, 2007 Michael Springmann - Database & Information Systems Group 26 ImageCLEF 2007 Results RWTH_mi KNN: IDM + CCF + TTF BLOOM SVM: SIFT + Pixels RWTHi6 SVM/ME: Image Patches UFR SVM: Color Moments + Texture (DWT) + Edge Orientation UNIBAS_DBIS KNN: IDM OHSU Neural Network: GIST SVM: SIFT BIOMOD Decission Trees: Random Subwindow Use Machine Learning Experts on Domain – Provided Dataset, won 2005 Use Machine Learning No Machine Learning… yet
October 10th, 2007 Michael Springmann - Database & Information Systems Group 27 What’s next? Blobworld ( More expressive query definition! Region of interest
October 10th, 2007 Michael Springmann - Database & Information Systems Group 28 iPaper demo
October 10th, 2007 Michael Springmann - Database & Information Systems Group 29 Query by Sketch (SNF Project)
October 10th, 2007 Michael Springmann - Database & Information Systems Group 30 Compound Document Matching E.g. patient records
October 10th, 2007 Michael Springmann - Database & Information Systems Group 31 Conclusion I.Similarity Search allows for a variety of applications: New means for browsing, data mining, classification II.Is computationally intensive Algorithmic optimization can speed up IDM by factors Multithreading / distributed execution III.Query requires example object QbS may help