A Web-based System for Collaborative Annotation of Large Image and Video Collections Multimedia ’05 Proceedings of the 13 th annual ACM international conference on Multimedia Authors: Timo Volkmer, John R. Smith, and Apostol (Paul) Natsev Presented by: Thay Setha An evaluation and user study
Outline 1.Introduction 2.Related work 3.The IBM EVA Annotation System 4.Application and Evaluation 5.Conclusion and Future Work
Introduction Nowadays, Research and development of video and image search system has become very popular. Annotated Collections of images and videos are a necessary basis of the successful development of multimedia retrieval systems. The large dataset in training data need to be annotated completely as well as accurately.
Introduction In this paper, they present describe and evaluate a web-based system for collaborative annotation of large collections of images or temporally pre-segmented videos (The IBM Efficient Video Annotation). – Optimized for collaborative annotation – Feature: work loading sharing – Support inter-annotator analysis The IBM Efficient Video Annotation (EVA) system was developed for the purpose of the 2005 TRECVID Annotation Forum for the annotation of approximately 80 hours of video to be used for the 2005 TRECVID Video Retrieval Evaluation benchmark.
Introduction The major focus in the design of this system was on usability by – simplify and speed up the annotation process – maintaining configuration and customization options – Different annotation styles which is a necessity for a large user base of annotators. – Customize number, Size, and layout of thumbnails displayed per page (annotate only few images per page without scrolling Vs scrolling and annotating many images at a time) – Use mouse/Keyboard for navigation for annotation – Select one/more concepts to annotate at a time
Introduction EVA Tool was designed with a few simplifying assumptions to promote consistency, simplicity, and speed of annotation: – All annotations apply terms from a small controlled-term vocabulary and no free text annotations are allowed – All annotations are for static visual concepts only and can be inferred from a single key frame without required users to play back video clips. – All annotations are assigned at the global frame level only and are assumed applicable to the entire shot. No object identification or regional annotation is required.
Related work The annotation tool from Informedia Image Classifier by the Informedia team at Carnegie Mellon University. – Semi-supervised image classification with Standalone Microsoft Windows Application – Does not provide statistics during collaborative annotation VIPER Annotation Tool by the Computer Vision Group at the University of Geneva – Feature: temporal segmentation, browsing, and event characterization. ESP Game – Web-based system and annotate images with custom concepts in game-like environment – Confidence of each annotation is computed based on how many users agree on a particular concept for an image. Ricoh MovieTool – MPEG-7 based system for video annotation – Automatic shot segmentation and hierarchical annotation – Complicated user interface
The IBM EVA Annotation System EVA System is a web based application for new image and video annotation system that allow user access using web browser. During annotation users can navigate page by page through the entire set of images. Users can specify parameters such as – number of thumbnail size per page – their organization in columns – and thumbnail size where thumbnail can either be an image or a representative frame of a video segment. Annotator can select a video and one or more concepts to use in a session.
The IBM EVA Annotation System
Each image can be assigned one of four labels in regards to the currently selected concept: – Positive: The image can clearly be classified with the given concept. – Negative: The image can clearly be classified as not possessing the given concept – Ignore: The semantics of this image is not clear clearly expressed and it should not be used for classification with the given concept. (blurred images and imperfect frame) – Skip: The image remains currently unannotated and will be reviewed later. This is default state.
The IBM EVA Annotation System A custom concept lexicon can be loaded and each annotator can be assigned either the full lexicon or a part thereof and then choose one or more concepts of those assigned for an annotation session. The assignment of labels is done for one concept at a time; user decides which and how many concepts are used in the current session. Lead to a more accurate and complete annotation as apposed to annotating all available concepts at once.
The IBM EVA Annotation System Bulk annotation buttons: – Bulk-positive: All thumbnails on the page that are currently marked as “skip” will be set to “negative” – Bulk-negative: All thumbnails on the page that are currently marked as “skip” will be set to “negative” – Bulk-ignore: All thumbnails on the page that are currently marked as “skip” will be set to “ignore” – Bulk-skip: all thumbnails of the current page will be reset to “skip”, regardless of their current state. Bulk button act only on previously unlabelled thumbnails except for the bulk “skip” use to clears all annotation for a given concept and given page of thumbnails. Efficient annotation for very rare or very frequent concepts by assuming a default state of “negative” or “positive” labels. It is easier to do than going through each image and assigning its label separately.
The IBM EVA Annotation System Beside using mouse, Annotation can be performed more efficiently by keyboard once a user has gone through a brief training phase. Labels can be assigned by using only a single keystroke and if a label is assigned by keyboard, the cursor is automatically advance to the next thumbnail. Annotation progress with statistics for each video and each concept assigned to the current user
The IBM EVA Annotation System They have added the ability to collect aggregate-level user data during annotation: – Time spent on each page – Number and size of thumbnails – Statistics about the usage of keyboard and mouse.
Application and Evaluation The collection of video clips consisted of 137 television news and entertainment broadcasts in Chinese, Arabic, and English language is provided to do the annotation. Each video temporally segmented into shots and one representative frame was selected for each shot which result in 61,904 frames
Application and Evaluation During the annotation effort, they monitored statistic such as: Inter- user agreement, average annotation time, concept frequency, and progress per concept to study the manual annotation process. According to the 7 semantics dimensions, the concepts in the lexicon are grouped into 7 categories: – Category A: Program Category (7 Concepts) – Category B: Setting/Scene/Site (15 Concepts) – Category C: People (8 Concepts) – Category D: Objects (8 Concepts) – Category E: Activities (2 Concepts) – Category F: Events (2 Concepts) – Category G: Graphics (2 Concepts)
Application and Evaluation Figure 3: Correlation between concept frequency and inter-user disagreement. The Pearson’s correlation coefficient is r = 0.73 for all concepts and r = 0.78 when omitting “‘Urban”, “Vegetation”, “Entertainment”, and “Police/Security”. The inter-annotator disagreement is based on those shots that had redundant annotation, that is, one or more users annotated with the same shot with the same concept with “positive” with “negative” only. They observe a high correlation between concept frequent and inter-user disagreement, As could be expected frequent concepts tend to cause more disagreement.
Application and Evaluation Figure 4: Inter-user disagreement for all concepts, normalized over concept frequency. Concepts such as “Urban”, “Vegetation”, “Entertainment”, and “Police/Security” clearly stand out. Figure 4 received by normalized the inter-annotator disagreement over frequency. This confirm that some concepts stand out and show relatively high disagreement. They concluded that this might be caused by unclear specification of these concepts.
Application and Evaluation Figure 5: Average concept inter-user disagreement, frequency, and annotation time as a function of concept category. Concept categories are ordered by annotation time per frame. Figure 5 shows average concept frequency, average annotation time per frame, and normalized inter-user disagreement for all concept categories. Concept categories “Objects” and “Event” require long time to be annotated but have low disagreement, this led us to the conclusion that these concepts are generally more complex to annotate. They are indeed defined and can identified, but require much attention.
Application and Evaluation Figure 6: Average annotation time per frame as a function of concept frequency. Rare and frequent concepts on average required more time to be labeled (computed variance 2 = 0.22). Grouped all concepts according to their frequency and evaluated the average annotation time for each group. Figure 6 shows that rare and frequent concepts required more time to annotate than concepts with medium frequency. The bulk annotation button have contributed to the annotation time being not very different across concepts. This is supported by user-feedback, suggesting that this feature was generally appreciated.
Application and Evaluation Figure 9: Average annotation time per frame, grouped by the primary input device of users. Finally, they have evaluated how the input device that users choose can affect the efficiency of annotation. The majority of all users (44%) preferred to use primarily the keyboard, 17% of the users who chose the mouse and 39% of the users could not clearly be classified and used both.
Conclusion and Future Work They present a new web-based system for collaborative image and video annotation, the IBM Efficient Video Annotation (EVA) system – Provide efficient user interface and powerful back-end feature such as annotation statistic and user level workload distribution. – Initial evaluation through analysis of the annotation time and quality – High inter-annotator agreement – Data security as it allows scheduled backups of all data on the server side. Future work: – Conduct an-depth quantitative evaluation. – Use as a research platform to study human-computer interaction. – Include more feature such as machine learning technique and improved browsing and filtering methods.
Thank You!