Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP3 – Retrieval systems
12 Month Review Meeting Project # Introduction to Workpackage overview Objectives: To provide retrieval systems offering the ability to search by various musical similarity measures. To search for spoken words or phrases. To search across different media for associated content. Queries may be text-based, spoken or audio examples. Tasks: T3.1: Music retrieval T3.2: Speech retrieval T3.3: Cross-media retrieval T3.4: Vocal query interface
12 Month Review Meeting Project # Introduction to Workpackage participants & schedule Participants and their contributions QMUL22 mm– music & cross media retrieval DIT:5 mm– music retrieval ALL:24 mm– speech retrieval & vocal queries LFUI:3 mm– integration of retrieval engines NICE:12 mm– retrieval for attributes of speech Schedule T3.1 Music retrieval:month 9 – month 20 T3.2 Speech retrieval:month 8 – month 20 T3.3: Cross-media retrieval:month 1 – month 6 month 21 – month 26 T3.4: Vocal query interface:month 7 – month 20
12 Month Review Meeting Project # Introduction to Workpackage Task Music retrieval Searching and organizing music collections: Adequate representation for the audio in the query Textual and keyword: e.g Author, title, date, genre, etc Automatic feature extraction Low-level acoustic similarity measures Mid-level features – characterize the rhythmic structure High-level features musically relevant parameters visualisation of key events along the assets' timeline
12 Month Review Meeting Project # Introduction to Workpackage Task Speech retrieval Retrieving the content of speech corpuses in English and Hungarian languages Building test corpuses Levels of recognition Phoneme level recognition Pronunciation dictionary filter or morphological analysis Text corpus based language model Phoneme and word level indexing Fast retrieval Improve the performance
12 Month Review Meeting Project # Introduction to Workpackage Task Cross-media retrieval Searching media in various formats (audio recordings, video recordings, notated scores, images) Using metadata Feature extraction Similarity measures Optimised multidimensional search methods Video analysis enter a piece of media as a query and might retrieve an entirely different type of media
12 Month Review Meeting Project # Introduction to Workpackage Task Vocal query interface Voice initiated media retrieval (without natural language processing) Recording of query Phoneme level recognition Pronunciation dictionary based word(s) identification Speaker adaptation
12 Month Review Meeting Project # Deliverables D3.1 Report outlining retrieval system functionality and specification (Month 6) D3.2 Prototype on speech and music retrieval systems with vocal query interface (Month 20) D3.3 Prototype on cross-media retrieval system (Month 20)
12 Month Review Meeting Project # Deliverable D3.1 – Report outlining retrieval system functionality and specification Topics described: User requirements Relations to other work packages Music retrieval Speech retrieval Indexing and speaker retrieval Cross-media retrieval Vocal query interface Retrieval system integration and knowledge management – role of ontology Example user interfaces Contributors: ALL, DIT, NICE, QMUL, RSAMD
12 Month Review Meeting Project # Milestones M3.1 Initial vocal query system tested, initial speech and music retrieval algorithms developed (Month 8) M3.2 Vocal query is fully-functional, speech and music retrieval implemented, cross-media retrieval method finalized (Month 14) M3.3 Vocal query finished, speech and music retrieval systems established, basic cross-media retrieval implemented (Month 20) M3.4 Cross-media retrieval fully functional, further work is only refinement and optimization (Month 26)
12 Month Review Meeting Project # Milestones M3.1 – Speech retrieval & Vocal query Vocal query system: Is ready for demonstration in Hungarian and in English phoneme level recognition implemented and tested, performance improvement in progress Hungarian Tri-phone management is under development English pronunciation dictionary embedded, with multiple versions of pronunciation Morphological analyzer implemented for the Hungarian pronunciation The performance of the Hungarian version is better than the English, the reason is under investigation Speech retrieval: See above Language model established for the Hungarian version, for English we are looking for a good text corpus Word level recognition implemented, under testing, the performance depends on the content/domain of the speech Phoneme based search finished, word based is under implementation
12 Month Review Meeting Project # Milestones M3.1 – Music retrieval Music retrieval: Extractors for tempo, key changes and mode detection have been implemented as VAMP plugins. SoundBite similarity retrieval is fully functional and available as a MAC OS application. A segmenter based on SoundBite has been implemented as a VAMP plugin. A framework for the automatic extraction of audio features has been built. It uses VAMP plugins and outputs descriptors directly in RDF format, allowing easy integration with the ontology.
12 Month Review Meeting Project # Workpackage Progress Parallel development of separate retrieval engines is in good progress Results are according to the schedule of the technical annex We aim to demonstrate our results by running software modules Scientific research induce risk when targeting improvement of performance Integration of different retrieval engines into a common architecture and user interface will be challenging Utilization of the power of the ontology centric approach
12 Month Review Meeting Project # Workpackage Progress Music retrieval - 1 Music retrieval system Searches assets according to their relevance to music- related queries,using various metadata and automatically extracted features to produce a ranked list of audio files. Search methods: Textual and keyword: e.g Author, title, date, genre, etc... Similarity based on automatically extracted low-level features Music-related parameters using automatically extracted descriptors: Instrument, orchestration, tempo, key, etc...
12 Month Review Meeting Project # Workpackage Progress Music retrieval – 2 – Music Analysis module To PCM & compressed audio assets repository Input Audio File (PCM) Manual Entry Tags/Data De-Noising / Restoration Source Separation Mid-Level Feature Extractors High-Level Features Extractors Compression Reliability Metric High-level features (parametric search) Mid-level descriptors similarity search Optimal source separation & denoising parameters To Metadata Repository Manual Tags & Manual High Level Features Archive application (musical audio)
12 Month Review Meeting Project # Workpackage Progress Music retrieval - 3 Music Analysis module Musically relevant descriptors are automatically extracted by a module running a series of “VAMP” plugins. Descriptors returned by the plugins can be classified as: Mid-level features used by the retrieval system to search for similar audio assets (e.g. Timbre profile ). High-level features enable the user to search for audio assets using musically relevant parameters. High level descriptors are also used for the visualisation of key events along the assets' timeline (e.g. position of beats, bars, key changes and instruments), providing a considerable aid in the analysis of a piece of music.
12 Month Review Meeting Project # Workpackage Progress Music retrieval - 4 Similarity Search Based on SoundBite algorithm: It allows simultaneous segmentation, thumbnailing and modelling of an audio asset
12 Month Review Meeting Project # Workpackage Progress Speech retrieval - 1 Research and testing is made on well prepared corpuses Text and corresponding speech needed hours Quality is very important Low noise Mixed sound source omitted Accent sensitive Field interviews, phone conversations Quality of the recording Silence to silence segmentation - automated and manual Half of the corpus for training Half of the corpus for testing Hungarian corpus Hungarian radio station ‘Kossuth’ broadcast quality More than 20 hours - segmentation enhanced manually English corpus TIMIT for research purpose US Supreme Court recordings
12 Month Review Meeting Project # Workpackage Progress Speech retrieval - 2 Preparation for speech recognition Training on segmented corpuses Importance of same accent and same speech content domain Layers of speech recognition Phoneme level recognition -> acoustic score Pronunciation dictionary/morphological analysis based recognition -> word exist or not Language model -> final probability Mixed, phoneme and word based indexing keeping probabilities Index based fast retrieval with score value
12 Month Review Meeting Project # Workpackage Progress Speech retrieval - 3 Phoneme level recognition ALL spent many months with research to refine its algorithms Successful results Using gaussian mixture model in HMM nodes Triphone and allophone identification and management Speaker clustering Our phoneme level recognition exceeded the 60% hit rate This solution was not good enough to build up a reliable speech retrieval on it The workflow: Input: silence to silence segments of wave Output: probable phoneme sequences with acoustic score
12 Month Review Meeting Project # Workpackage Progress Speech retrieval - 4 Dictionary level Filtering of feasible phoneme sequences by a pronunciation dictionary in English (custom based pronunciation) a morphological analyzer in Hungarian (rule based pronunciation) The workflow: Input: phoneme sequences with acoustic score Output: feasible word sequences with kept acoustic score Language level On the basis of big text corpuses we rank the probability of the word sequences The solution performs much better on domain specific speech (legal, medical domain) The workflow: Input: word sequences with acoustic score Output: word sequences with modified acoustic score
12 Month Review Meeting Project # Workpackage Progress Speech retrieval - 5 Integration of speech retrieval into the EASAIER architecture Speech retrieval works as a black box Relies on binary indexes Business logic layer needed Temporal RDF triplets generated User initiated retrieval performed
12 Month Review Meeting Project # Workpackage Progress Cross media retrieval - 1 Cross-media retrieval The CM retrieval engine and its functionalities were specified in Deliverable 3.1 Video analysis modules necessary for CMR are specified in internal EASAIER software modules document and consist of: Video Transcoding Module Shot Detection and Key-Frame Extraction Module Low-Level Feature Extraction Module
12 Month Review Meeting Project # Workpackage Progress Cross media retrieval – 2 – Video Analysis module Compression Audio Stream Extraction Video Segmentation and Keyframe extraction Keyframe Analysis Manual Annotation Audio stream analysis Input Video File Keyframes PCM Original video file Streaming video file (eg. mpeg 4) Multimedia assets repository KF temporal data Video segments temporal data Metadata Features Temporal data Video segments metadata KF Extracted Features Metadata repository
12 Month Review Meeting Project # Workpackage Progress Cross media retrieval - 3 For transcoding purposes we chose and tested ffmpeg software First version of Shot Detection and Key-Frame Extraction were developed at QMUL and it is ready for integration MPEG-7 eXperimentation Model (XM) will be integrated for purpose of Low-Level Feature Extraction.
12 Month Review Meeting Project # Workpackage Progress Vocal query Technology based on the first two layer of speech recognition (see above) Phoneme level recognition Pronunciation dictionary or morphological analyzer Process Phoneme level recognition with acoustic score Matching to the dictionary Solution is the item with the best acoustic score which also found in the dictionary Technology will be demonstrated Speed have to be tuned Performance under evaluation Performance difference between English and Hungarian
12 Month Review Meeting Project # Contributions and Connections with Other Workpackages
12 Month Review Meeting Project # Upcoming Work Plan Months – Music related Up to month 20 Music retrieval - prototype available, remaining time will be integration, testing, and continuing development of music retrieval based on other similarity measures. Month 20- Deliver D3.2 Between month 20 and 26 Music retrieval - only further work is refinement based on user studies (WP7) Month 26 Deliver D3.3
12 Month Review Meeting Project # Upcoming Work Plan Months – Speech related Before D3.2 – Month 20 Phoneme level recognition Speed tuning Language specific performance improvement Triphone implementation in English Speaker clustering Dictionary level More pronunciation variants in English Language model Bigger text corpuses in English and in Hungarian Indexing and retrieval Reimplementation for mixed phoneme and word search After Month 20 Testing and refinement Performance tuning
12 Month Review Meeting Project # Upcoming Work Plan Months – Cross media related All cross-media retrieval effort will be focused on coding and testing the software. Material produced for working documents (System Architecture, Software Modules, Metadata) specify the software to be developed and integrated, and how this will be done. The required routines have for the most part been developed at a 'proof-of-concept' level (ontological relationships between different media, feature extraction for images and video, key frame display for retrieved video) Month 26- M3.4 -Cross-media retrieval fully functional, further work is only refinement and optimization. D3.3 Prototype on cross-media retrieval system Between month 20 and 26, QMUL will work closely with SILOGIC to integrate cross-media retrieval into the EASAIER system, which we intend to use as the prototype for this.
12 Month Review Meeting Project # Demonstration Overview Retrieval engines and tools are demonstrated separately Speech related Demo Segmentation user interface Vocal query (isolated word search) in Hungarian Vocal query (isolated word search) in English Music related Demo Soundbite – timbre-based music similarity embedded in ITunes
12 Month Review Meeting Project # Demonstration Segmentation User interface for manual segmentation Synchronizing silence to silence speech segments to the text Checking automated silence to silence segmentation Synchronizing word boundaries to the text Synchronizing phoneme boundaries to the text
12 Month Review Meeting Project # Demonstration Segmentation – screen shot
12 Month Review Meeting Project # Demonstration Vocal query Searching for a spoken word in a dictionary Recording input Phoneme level recognition Matching probable phoneme sequences to content of the pronunciation dictionary Display of the most probable solution from the dictionary
12 Month Review Meeting Project # Demonstration Vocal query – screen shot
12 Month Review Meeting Project # Demonstration Music retrieval – screen shot