Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003

What is Audio Fingerprinting?
a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came -a segment of audio data is converted into a low-dimensional trace (a vector). This trace is then compared against a large set of stored, pre-computed traces. -a feature extraction process extracts certain spectral features from the signal. These are summarized to generate a condensed representation of the essence of each audio item, a so-called fingerprint. This fingerprint is stored within a class database.

Applications Broadcast monitoring Connected Audio Other
playlist generation royalty collection ad verification Connected Audio general term for consumer applications Other Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials Finding desired content efficiently in “an overwhelming amount of audio material” -Fuelled by the digital revolution, an overwhelming amount of audio material has become available to today's consumers. Finding desired content efficiently has become a key issue in this context -Broadcast monitoring -playlist generation -royalty collection -program verification -ad verification -…this is currently a manual process—people listening & filling out cards -Connected Audio -general term for consumer applications (cell phone over car radio example) -can rapidly identify songs at the push of a button, so that the consumer may be directed to purchase them… corporate greed -Other uses -napster—use of fingerprinting systems to prohibit the transmission of copywritten materials

“Benefits” Automated search of illegal content on the Internet
examines the real audio information rather than just tag information For the consumer make the meta-data of songs in a library consistent, allowing for easy organization can guarantee that what is downloaded is actually what it says it is will allow consumer to record signatures of sound and music on small handheld devices -potential benefits to the consumer. Firstly it can organize music song titles in search results in a consistent way by using the reliable meta-data of the fingerprint database. Secondly, fingerprinting can guarantee that what is downloaded is actually what it says it is. -automatic audio identification will allow consumers to record signatures of sound and music on small handheld devices -Assuming that the fingerprint database contains correct meta-data, audio fingerprinting can make the meta-data of the songs in the library consistent, allowing easy organization based on, for example, album or artist. -Automated search of illegal content on the Internet via fingerprinting methods is an efficient way of securing audio-related Intellectual Property by monitoring the Internet. Today's search restrictions on file names and extensions (such as .mp3) belong to the past since audio identification systems examine the real audio information rather than just tag information.

Two principle components
Compute the fingerprint Compare it to a database of previously computed fingerprints A text example: “…in a box. I will not eat them with a fox. I…” -In stream audio fingerprinting, a fixed-length segment of the incoming audio stream is converted into a low-dimensional trace (a vector). This input trace is then compared against a large set of stored, pre-computed traces, where each stored trace has previously been extracted from a particular audio segment (for example, a song). The input traces are computed at repeated intervals and compared with the database. We call the pre-computed traces ’fingerprints’, since they are used to uniquely identify the audio segment. -the feature extraction process extracts certain spectral features from the signal based on psychoacoustic considerations. These are summarized to generate a condensed representation of the essence of each audio item, a so-called fingerprint. This fingerprint is stored within a class database

Details to worry about Robustness (to noise, distortion) Reliability
Fingerprint size (reduced dimensionality) Granularity Search speed and scalablity Computationally efficient Resulting features must be informative about the audio content Semantic or non-semantic features? Hash table or vector representation? -Robustness: can an audio clip still be identified after severe signal degradation? In order to achieve high robustness the fingerprint should be based on perceptual features that are invariant (at least to a certain degree) with respect to signal degradations. Preferably, severely degraded audio still leads to very similar fingerprints. The false negative rate is generally used to express the robustness. A false negative occurs when the fingerprints of perceptually similar audio clips are too different to lead to a positive match. -Reliability: how often is a song incorrectly identified? (e.g. “Rolling Stones – Angie” being identified as “Beatles – Yesterday”). The rate at which this occurs is usually referred to as the false positive rate. -Fingerprint size: how much storage is needed for a fingerprint? To enable fast search searching, fingerprints are usually stored in RAM memory. Therefore the fingerprint size, usually expressed in bits per second or bits per song, determines to a large degree the memory resources that are needed for a fingerprint database server. -Granularity: how many seconds of audio is needed to identify an audio clip? Granularity is a parameter that can depend on the application. In some applications the whole song can be used for identification, in others one prefers to identify a song with only a short excerpt of audio. -Search speed and scalability: how long does it take to find a fingerprint in a fingerprint database? What if the database contains thousands and thousands of songs? For the commercial deployment of audio fingerprint systems, search speed and scalability are a key parameter. Search speed should be in the order of milliseconds for a database containing over 100,000 songs using only limited computing resources (e.g. a few high-end PC’s). What kind of features are most useful (ie. Salient) or suitable? -semantic -bpm, mood, genre -ambiguous meanings -easier for people to “understand” -may change over time -not universally applicable -more difficult to compute -non-semantic -mathematical in nature -eg . (AudioFlatness from MPEG7) -How to represent the fingerprint? -hash functions -vector of real numbers

Computing the fingerprint
Compare to hash functions…? compare computed hash value with that stored in a database Drawback need to worry about perceptual similarity and not mathematical similarity PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different perceptual similarity is not transitive not possible to design a system which computes mathematical fingerprints for perceptually similar objects -so how to construct a fingerprint for perceptually similar objects? -threshold T such that with very high probability ||F(X)-F(Y)||£T if objects X and Y are similar and ||F(X)-F(Y)||>T when they are dissimilar.

Techniques (general) Any ‘x’ number of seconds may be used to compute the fingerprint Audio gets separated into frames Features computed for each frame: Fourier coefficients MFCC, LPC Spectral flatness sharpness “features mapped into a more compact representation by using …HMM, or quantization” -any interval of X number of seconds can be used to compute the fingerprint -audio is separated into frames -features computed for each frame -Fourier coefficients -MFCC -spectral flatness -sharpness -LPC -also, derivatives, means, & variances of audio features -“features mapped into a more compact representation by using …HMM, or quantization” -called sub-fingerprints -a stream of data is converted into a stream of sub-fingerprints -one sub-fingerprint is not enough for identification; several are usually required.

Techniques (Haitsma, Kalker)
one 32-bit sub-fingerprint every 11.6 ms A block consists of 256 sub-fingerprints Corresponds to a granularity of only 3 seconds Large overlap (31/32), so subsequent sub-fingerprints are similar and vary slowly in time worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database The proposed fingerprint extraction scheme is based on this general streaming approach. It extracts 32-bit sub-fingerprints for every interval of 11.6 milliseconds. A fingerprint block consists of 256 subsequent sub-fingerprints, corresponding to a granularity of only 3 seconds. An overview of the scheme is shown in Figure 1. The audio signal is first segmented into overlapping frames. The overlapping frames have a length of 0.37 seconds and are weighted by a Hanning window with an overlap factor of 31/32. This strategy results in the extraction of one sub-fingerprint for every 11.6 milliseconds. In the worst-case scenario the frame boundaries used during identification are 5.8 milliseconds off with respect to the boundaries used in the database of pre-computed fingerprints. The large overlap assures that even in this worst-case scenario the sub-fingerprints of the audio clip to be identified are still very similar to the sub-fingerprints of the same clip in the database. Due to the large overlap subsequent sub-fingerprints have a large similarity and are slowly varying in time. Figure 2a shows an example of an extracted fingerprint block and the slowly varying character along the time axis. -worst-case scenario: actual fingerprint is exactly in between the window of each sub-fingerprint 11.6ms/2 = 5.8 ms -the algorithm has proven robust enough to withstand these sorts of distortions -most important features reside in the spectral domain, -remember, we extract features based on how they may be perceived, not the actual mathematical content. Therefore, phase is often neglected. -even though we are using Hanning window to form a number of overlapping frames, the data (seems to be) sent through a filterbank instead: 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) between 300 and 2000Hz -or, data in the spectral domain is just measured in discrete quantities -must compute T (threshold) such that the number of bit errors between two audio signals is below it

Techniques (Haitsma, Kalker)
Data from each frame is sent through a filterbank 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) between 300 and 2000Hz phase is neglected (perceptual reasons) -most important features reside in the spectral domain, -remember, we extract features based on how they may be perceived, not the actual mathematical content. Therefore, phase is often neglected. -we are using Hanning window to form a number of overlapping frames, the data from each one is sent through a filterbank: 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) between 300 and 2000Hz -or, data in the spectral domain is just measured in discrete quantities -must compute T (threshold) such that the number of bit errors between two audio signals is below it

System overview

Techniques (Burges, Platt)
downsampled to kHz, split into frames with overlap of 2 MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient -first convert the audio signal to mono and downsample it to KHz, -The signal is split into fixed-length frames which overlap by half. (In the first experiment, the frame length is 23.2 ms, while for the other experiments, the frame length is 372 ms.) An MCLT [6] is then applied to each frame. A log spectrum is generated by taking the log modulus of each MCLT coefficient. -so: convert signal into a feature vector -uses Pattern Classification and Scene Analysis (PCA) to find a set of projections -generates a vector of 128 values for every 11.6ms interval -dimensional-reduction method (i.e. lots of math) -downsampled to kHz, split into frames w/2 x overlap (11.6 ms) -MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient.

Use prior knowledge to define form of the feature extractor Features computed by a “linear, convolutional” neural network convert signal into a feature vector uses Pattern Classification and Scene Analysis (PCA) to find a set of projections generates a vector of 128 values for every 11.6ms interval dimensional-reduction method (i.e. lots of math) we first use prior knowledge to define the parametric form of the feature extractor. Then we use a new algorithm, called Distortion Discriminant Analysis (DDA), that sets the parameters of the feature extractor. Feature extractors learned with DDA fulfill all four requirements listed above. DDA features are computed by a linear, convolutional neural network, where each layer performs a version of oriented Principal Components Analysis (OPCA) dimensional reduction

3 layers of Oriented PCA (OPCA) operates on a frame of 128 values layer 1: generates 10 values for each frame layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs) -system uses 3 layers of Oriented PCA (OPCA) -operates on 128 value output -layer 1: generates 10 values for each frame (11.6ms) -layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values (corresponding to a window of ~487.2ms) -evaluated every 243.6ms (ie. Overlapping) -layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values every ms -Multiple layers are aggregated in order to enforce shift invariance and reduce computation time -still, misalignment problems greater w/this method (worst case 125ms out)

Searching the Database
Look for the most similar (not necessarily exact) fingerprint 10,000 5-min. songs  250 million sub-fingerprints brute force takes in excess of 20 minutes on a very fast PC brute force computes bit-error rate for every possible position in the database

Searching the Database
make assumption that at least 1 (of the 256) sub-fingerprints are error-free then, use a hash table (as opposed to more memory-intensive look-up table) 800,000 times faster -make assumption that at least 1 (of the 256) sub-fingerprints are error-free -then, use a hash table (as opposed to more memory-intensive look-up table) -also, if one sub-fingerprint is not error-free (chances are there is, though), can check for ones w/one bit difference -this only increase search time by 33 (acceptable) -furthermore, a “reliability estimate” can be computed for each bit, so that the algorithm does not need to check results for every case where each bit is, in turn, reversed. -KD-trees?

Results false-positive rate of 3.6x10-2 (Haitsma, Kalker)
On tests with a large (500,000) set of input traces has a “low” false-positive and false-negative rate. (Burges, Platt) didn’t test on time compression, expansion can withstand distortions occurring from transmission over mobile phones.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Similar presentations

Presentation on theme: "Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Similar presentations

Presentation on theme: "Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003."— Presentation transcript:

Similar presentations

About project

Feedback