Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database & Information Systems Group University of Basel October 10th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th, 2007.

Similar presentations


Presentation on theme: "Database & Information Systems Group University of Basel October 10th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th, 2007."— Presentation transcript:

1 Database & Information Systems Group University of Basel October 10th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th, 2007

2 October 10th, 2007 Michael Springmann - Database & Information Systems Group 2 Projects I.DELOS (EU FP6)  Network of Excellence on Digital Libraries  http://www.delos.info/ http://www.delos.info/  Task 1.6 Management of and Access to Virtual Electronic Health Records  Task 1.8 DelosDLMS II.DILIGENT (EU FP6)  A Digital Library Infrastructure on Grid Enabled Technology  Work Package 1.4 Index & Search – Feature Extraction  ARTE Scenario

3 October 10th, 2007 Michael Springmann - Database & Information Systems Group 3 What is similarity search? From a collection, return a ranked list of items for a given reference object. Reference Object 1. 0.9992. 0.8733. 0.7224. 0.712 5. 0.5036. 0.4427. 0.392

4 October 10th, 2007 Michael Springmann - Database & Information Systems Group 4 Steps to compute similarity I.Define query (reference object) II.Select feature to use for comparison III.Extract feature of reference object IV.Compare feature with each element of collection V.Return (subset) of ranked list e.g. Color Histogram 20323617221078 e.g. 5-NN

5 October 10th, 2007 Michael Springmann - Database & Information Systems Group 5 Similarity Search: Media Types I.Image – Color, Texture, Shape II.Text – TF/IDF, Edit Distance III.Audio – Spectrum, Rhythm, Beat, Pitch IV.Video Sequences – Visual, Subtitles / Audio Transcripts, (rich) Meta Data  Combinations of several types  Complex Documents High dimensional feature vectors

6 October 10th, 2007 Michael Springmann - Database & Information Systems Group 6 Goals Effectiveness  Theme: Find good/better results!  Measure: Quality, e.g. for benchmark collections Precession, Recall, MAP  Question: How can we find better results w.r.t. the information need of the user? Efficiency  Theme: Retrieve the results fast!  Measure: Execution time  Question: How can we achieve this with algorithmic optimizations?

7 October 10th, 2007 Michael Springmann - Database & Information Systems Group 7 Similarity Search: What it is... A way to order / rank things May help to group objects Limitations: 1.Feature matches categorization criterion 2.No sharp borders

8 October 10th, 2007 Michael Springmann - Database & Information Systems Group 8

9 October 10th, 2007 Michael Springmann - Database & Information Systems Group 9

10 October 10th, 2007 Michael Springmann - Database & Information Systems Group 10 ISIS (Interactive Similarity Search) I.Originated at ETH Zurich, continued at UMIT and UNIBAS II.VA-File can handle collections of size > 600.000 images while still achieving interactive answering times III.Used image features: Color Moments, Texture Moments IV.Global and 5 Fuzzy Regions

11 October 10th, 2007 Michael Springmann - Database & Information Systems Group 11 5 Fuzzy Regions

12 October 10th, 2007 Michael Springmann - Database & Information Systems Group 12 Similarity Search: What it is... and what it ain‘t? A way to order / rank things May help to group objects Limitations: 1.Feature matches categorization criterion 2.No sharp borders Feature extraction will not find out: One person sleeping... at least not without application specific adjustments / training

13 October 10th, 2007 Michael Springmann - Database & Information Systems Group 13 Domain Dependent: Face Detection CMU Face Detector in ISIS

14 October 10th, 2007 Michael Springmann - Database & Information Systems Group 14 What is CLEF? Cross Language Evalution Forum (http://www.clef-campaign.org/)http://www.clef-campaign.org/ Started in 2000 Continuation of CLIR Track at TREC last ran in 2002 Workshop held each year directly following ECDL Is a DELOS activity Several Tracks Multilingual Document Retrieval on News Collections (Ad-Hoc) Scientific Data Retrieval (Domain-Specific) Interactive Cross-Language Information Retrieval (iCLEF) Multiple Language Question Answering (QA@CLEF) Cross-Language Image Retrieval (ImageCLEF) Cross-Language Speech Retrieval (CL-SR) CLEF Web Track (WebCLEF) Cross-Language Geographical Information Retrieval (GeoCLEF)

15 October 10th, 2007 Michael Springmann - Database & Information Systems Group 15 ImageCLEF (http://www.imageclef.org) II. Object Retrieval Task PASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre- segmented. IV. Medical Automatic Annotation Task IRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific. III. Medical Image Retrieval c@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML I. Ad-hoc photographic retrieval task IAPR TC-12 Benchmark, 20.000 (tourist) images, multi- lingual descriptions. Main challenge: Short annotations. 1123-127-500-000

16 October 10th, 2007 Michael Springmann - Database & Information Systems Group 16 IRMA Code Classification Example 1123-127-500-000 Technical code (T) describes the image modality, e.g. 1 = x-ray, 11 = plain radiography, 112 = analog, 1123 = high beam energy Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine Anatomical code (A) refers to the body region examined, here: chest Biological code (B) describes the biological system examined. O always means unspecific and therefore is always followed by other Os or -. 4 independent axes:

17 October 10th, 2007 Michael Springmann - Database & Information Systems Group 17 Medical Automatic Annotation Task I.10.000 training images, 1.000 development evaluation images – correct classification known. 1.000 test images – correct classification secret. II.Hierarchical classification, not only true / false for classification results, but based on how many choices in hierarchy. III.BSc project by Andreas Dander (UMIT)  „Bildähnlichkeitssuche mit Medizinischen Bildern“  Implementation of Image Distortion Model (IDM)

18 October 10th, 2007 Michael Springmann - Database & Information Systems Group 18 Image Distortion Model (IDM) Uses reduced size images of at most 32 pixels width/height Corresponding pixels

19 October 10th, 2007 Michael Springmann - Database & Information Systems Group 19 Edge Detection (Sobel Filter)

20 October 10th, 2007 Michael Springmann - Database & Information Systems Group 20 Efficiency: Speeding up IDM Algorithmic optimization Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification.  Early termination of distance computation of unused images  Base decision on threshold derived from best k images seen so far Pixels not evaluated due to exceeded threshold

21 October 10th, 2007 Michael Springmann - Database & Information Systems Group 21 Early Termination Strategy - Experimental results For IDM: Less than 30% of all pixels need to get evaluated

22 October 10th, 2007 Michael Springmann - Database & Information Systems Group 22 Speaking of numbers… I.Original RWTH Aachen implementation of IDM requires for X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz. II.Using L 2 -Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results. III.Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L 2 -Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample. IV.We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound.

23 October 10th, 2007 Michael Springmann - Database & Information Systems Group 23 Multithreading - Implementation I.Several Java Worker Threads, each computes similarity between one reference image and query. II.Dispatcher keeps track of distance threshold for early termination. III.IDM with early termination takes 4.3 seconds on Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz. IV.Opens possibility for optimizing second goal…

24 October 10th, 2007 Michael Springmann - Database & Information Systems Group 24 Effectiveness: Adjusting IDM Uses reduced size images of at most 32 pixels width/height Corresponding pixels

25 October 10th, 2007 Michael Springmann - Database & Information Systems Group 25 Multithreading - Results

26 October 10th, 2007 Michael Springmann - Database & Information Systems Group 26 ImageCLEF 2007 Results RWTH_mi KNN: IDM + CCF + TTF BLOOM SVM: SIFT + Pixels RWTHi6 SVM/ME: Image Patches UFR SVM: Color Moments + Texture (DWT) + Edge Orientation UNIBAS_DBIS KNN: IDM OHSU Neural Network: GIST SVM: SIFT BIOMOD Decission Trees: Random Subwindow Use Machine Learning Experts on Domain – Provided Dataset, won 2005 Use Machine Learning No Machine Learning… yet

27 October 10th, 2007 Michael Springmann - Database & Information Systems Group 27 What’s next? Blobworld (http://elib.cs.berkeley.edu/blobworld/)http://elib.cs.berkeley.edu/blobworld/ More expressive query definition! Region of interest

28 October 10th, 2007 Michael Springmann - Database & Information Systems Group 28 iPaper demo

29 October 10th, 2007 Michael Springmann - Database & Information Systems Group 29 Query by Sketch (SNF Project)

30 October 10th, 2007 Michael Springmann - Database & Information Systems Group 30 Compound Document Matching E.g. patient records

31 October 10th, 2007 Michael Springmann - Database & Information Systems Group 31 Conclusion I.Similarity Search allows for a variety of applications: New means for browsing, data mining, classification II.Is computationally intensive  Algorithmic optimization can speed up IDM by factors 3.5-4.9  Multithreading / distributed execution III.Query requires example object  QbS may help


Download ppt "Database & Information Systems Group University of Basel October 10th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th, 2007."

Similar presentations


Ads by Google