Shin’ichi Satoh National Institute of Informatics
Nowadays abundant multimedia information available Web, broadband network, CATV, satellite... digital camera, mobile phone,
YouTube: 35 hours of video uploaded every minute
Flickr: 5 billion photos Facebook: 3 billion photos per month
How can we utilize such huge amounts of multimedia? Search could be one promising option Any technical problems? It seems like multimedia search is already available Google, Yahoo!, Bing image search, Flickr, YouTube, etc...
Multimedia search is possible only via text search technology This problem is prominent especially for visual media (audio can be converted into text via ASR)
But major part of multimedia data has no text data We checked a number of photos in Flickr and found around 85% of photos have no tags or description as far as we use the text search-based technologies, such large amounts of multimedia are inaccessible at all!
Moreover, text-based multimedia search is NOT perfect searching images of "people playing drums" some results are good but some results are very strange John dog
Multimedia semantic content analysis is required However it’s difficult ◦ Multimedia is difficult to handle by computers ◦ Inherently difficult due to “Semantic Gap” Query: Lion Lion
Multimedia data is huge ◦ text: 1kb/s (10 words), audio: 100kb/s (MP3), video 10Mb/s (MPEG2) computers since 1940s (ENIAC 1946) text processing by computer since 1950s! (Turing test 1950, ELIZA and SHRDLU 1960s) project Gutenberg since 1971 CD-ROM (1985), DVD (1993), larger memory, external storage (hard disk drives) multimedia data (audio/image/video) are getting manageable only after 1990s
Please guess what this is. Water Lilies, Monet
Please guess what this is.
Computers are so good at handling text, but not so at handling multimedia text: artificial media, symbolized by nature multimedia: ambiguous, depend on cognition, natural media, not symbolized, etc... human can easily “see” or perceive but we cannot explain how we “see” The quick brown fox jumps over the lazy dog
1980s Landsat images, medial images, stock photos Search using relational DB only via statistics and text issue was how to handle “huge” data of images less attention was paid to content analysis
CBIR: Image retrieval based on “content” T. Kato, TRADEMARK & ART MUSEUM (1989) IBM QBIC (1990s) Take an image as a query, and return “similar” images Use “features,” e.g., color histogram, edge, shape, etc. It worked for images without metadata Assume that similar images in the feature space are semantically similar as well But this is not always true
Feature space Semantic Gap
Let’s take a look at face detection as an example... Face detection is very stable technology Before 1990 face detection was very unstable ◦ Shape of facial features and their geometric relations were hard coded After late 1990s face detector using machine learning succeeded in very stable performance ◦ Simply provide a lot of face image examples (a few thousands) to the system and let it learn Early face detection method Machine learning
Following the success of machine-learning- based approaches in face detection, OCR, ASR, etc., researchers decided to “train” computers for media semantic content analysis build corpus: tens, hundreds, or thousands images/video shots per concept with manual annotation extract features (low-level, but recently “local” features are known to be more effective) train computers to automatically map between low-level features and semantic categories using machine learning Several corpora available
Caltech 101 (2003), Caltech 256 (2007) 101/256 concepts define the set of concepts first, then collect images (via image search engine) manual selection, so clean annotation up to a few hundreds images per concept standard benchmark datasets “small world effect” anticipated questionable selection of concepts
airplane, chair, elephant, faces, leopards, rhino bonsai, brain, scorpion, trilobite, yin_yang...
Large number of concepts, large number of images #concepts: 10,000+ #images: 10,000,000+ concepts are systematically selected from WordNet (computer-readable thesaurus)
Manual annotation by Amazon Mechanical Turk Hard to control quality Scalability issue
Currently researchers are focusing on the issue: how to effectively learn semantic concepts from GIVEN training media corpus Corpus: the larger, the better But how to obtain large corpus? CGM (Flickr, Web): noisy Manual annotation (AMT): costly, less scalable Other approaches such as ESP game could be interesting
Text Audio/ Speech Image Video Project Gutenberg bag-of-words TF/IDF WSJ TREC PageRank MFCC Viterbi, HMM CMU-MIT Face DB Pascal VOC ImageNet TRECVID Caltech101 V-J Face Det. USPS OCR single digit 1000 words LVCSR IBM ViaVoice
Multimedia content analysis research: “just started” More advanced results to come Business value? Killer applications?