Speech and Audio Coding Heejune AHN Embedded Communications Laboratory Seoul National Univ. of Technology Fall 2013 Last updated 2013. 9. 31.

Slides:



Advertisements
Similar presentations
Alex Chen Nader Shehad Aamir Virani Erik Welsh
Advertisements

Audio Compression ADPCM ATRAC (Minidisk) MPEG Audio –3 layers referred to as layers I, II, and III –The third layer is mp3.
Department of Computer Engineering University of California at Santa Cruz MPEG Audio Compression Layer 3 (MP3) Hai Tao.
Introduction to MP3 and psychoacoustics Material from website by Mark S. Drew
MPEG/Audio Compression Tutorial Mike Blackstock CPSC 538a January 11, 2004.
CS335 Principles of Multimedia Systems Audio Hao Jiang Computer Science Department Boston College Oct. 11, 2007.
MPEG-1 MUMT-614 Jan.23, 2002 Wes Hatch. Purpose of MPEG encoding To decrease data rate How? –two choices: could decrease sample rate, but this would cause.
CGMB324: Multimedia System Design
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Digital Audio Compression
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Digital Audio Coding – Dr. T. Collins Standard MIDI Files Perceptual Audio Coding MPEG-1 layers 1, 2 & 3 MPEG-4.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
AUDIO COMPRESSION TOOLS & TECHNIQUES Gautam Bhattacharya.
4.2 Digital Transmission Pulse Modulation (Part 2.1)
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
1 Digital Audio Compression. 2 Formats  There are many different formats for storing and communicating digital audio:  CD audio  Wav  Aiff  Au 
Chapter 14 MPEG Audio Compression 14.1 Psychoacoustics 14.2 MPEG Audio 14.3 Other Commercial Audio Codecs 14.4 The Future: MPEG-7 and MPEG Further.
CELLULAR COMMUNICATIONS 5. Speech Coding. Low Bit-rate Voice Coding  Voice is an analogue signal  Needed to be transformed in a digital form (bits)
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
Multimedia communications EG-371Dr Matt Roach Multimedia Communications EG 371 and EG 348 Dr Matthew Roach Lecture 2 Digital.
Speech & Audio Processing
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Lecture 14: Spring 2007 MPEG Audio Compression
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Audio Coding MPEG1 Layers I, II, III MPEG2MPEG4 Sherida Subrati Anthony Caliendo.
Fundamental of Wireless Communications ELCT 332Fall C H A P T E R 6 SAMPLING AND ANALOG-TO-DIGITAL CONVERSION.
COMP 249 :: Spring 2005 Slide: 1 Audio Coding Ketan Mayer-Patel.
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
Digital Audio Multimedia Systems (Module 1 Lesson 1)
1 Audio Compression Multimedia Systems (Module 4 Lesson 4) Summary: r Simple Audio Compression: m Lossy: Prediction based r Psychoacoustic Model r MPEG.
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
CS :: Fall 2003 Audio Coding Ketan Mayer-Patel.
School of Informatics CG087 Time-based Multimedia Assets Compression & StreamingDr Paul Vickers1 Compression & Streaming Serving, shrinking, and otherwise.
Speech Coding PCM DPCM ADPCM LPC CELP A road map Page 1 of 30
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
CMPT 365 Multimedia Systems
Media Representations - Audio
Pulse Code Modulation (PCM)
A Tutorial on MPEG/Audio Compression Davis Pan, IEEE Multimedia Journal, Summer 1995 Presented by: Randeep Singh Gakhal CMPT 820, Spring 2004.
UNIT III Audio Compression
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
CE Digital Signal Processing Fall 1992 Waveform Coding Hossein Sameti Department of Computer Engineering Sharif University of Technology.
MPEG Audio coders. Motion Pictures Expert Group(MPEG) The coders associated with audio compression part of MPEG standard are called MPEG audio compressor.
Sound Sound is a continuous wave that travels through the air
1 PCM & DPCM & DM. 2 Pulse-Code Modulation (PCM) : In PCM each sample of the signal is quantized to one of the amplitude levels, where B is the number.
Chapter 4 Audio and video compression
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Digital Audio III. Sound compression (I) Compression of sound data requires different techniques from those for graphical data Requirements are less stringent.
1 Audio Coding. 2 Digitization Processing Signal encoder Signal decoder samplingquantization storage Analog signal Digital data.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.
Voice Sampling. Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal.
UNIT V. Linear Predictive coding With the advent of inexpensive digital signal processing circuits, the source simply analyzing the audio waveform to.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Audio Coding Lecture 7. Content  Digital Audio Basic  Speech Compression  Music Compression.
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
MP3 and MP4 Audio By: Krunal Tailor
Vocoders.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
PCM & DPCM & DM.
MPEG-1 Overview of MPEG-1 Standard
Govt. Polytechnic Dhangar(Fatehabad)
Presentation transcript:

Speech and Audio Coding Heejune AHN Embedded Communications Laboratory Seoul National Univ. of Technology Fall 2013 Last updated

Heejune AHN: Image and Video Compressionp. 2 Audio Coding Audio signal classes speech signal : 300Hz Hz (< 4KHz) For telephone service, e.g. VoIP, mobile voice telephony wide speech signal : Hz High quality voice telephony. e.g. Skype, W-AMR in 3.5/4 G wideband audio signal : - 20KHz Entertainment. E.g. CD, mp3, DVD, broadcasting, cinema movies CD example Sampling frequency : 44.1 KHz 16-bits/ sample two stereo channels net bit-rate = 2 x 16 x 44.1 x 10 = 1.41 Mbits/sec. real bit-rate = 1.41 x 49/16 Mbit/sec = 4.32 Mbit/sec. for synchronization and error correction, 49 bits for every 16-bit audio sample.

Heejune AHN: Image and Video Compressionp. 3 Audio coding techniques Speech coding Sound generation model is well studied Wave form coding (time-domain) Linear PCM, Non-linear PCM, DPCM, ADPCM Vocoder Use speech specific speech production model Analysis at encoders and Synthesis at decoders. LP-Vocoder, RPE (regular pulse excited) Vocoder (GSM), Q-CELP (qualcomm code-excited Linear prediction) (CDMA), AMR (Adaptive Multi-Rate) (Algebric code excited LP, 3G, 4G) Audio coding No sound generation model yet. Human auditory system and psychoacoustic perception MPEG1L1, 2, 3, MPEG-2 AAC, Dolby AC-3

Heejune AHN: Image and Video Compressionp. 4 Speech Coding Signals

Heejune AHN: Image and Video Compressionp. 5 Linear PCM Linear Uniform Quantizer, i.e., quantization levels are evenly spaced. SNR = 6 * no of bits + C db at least 6*12 db required for human intelligible 16-bit samples provide plenty of dynamic range. Application In CD, File formats of WAV (MS), AIFF (Unix and Mac)

Heejune AHN: Image and Video Compressionp. 6 Nonlinear PCM Nonlinear Non-uniform quantization Quantization step-size decreases logarithmically with signal level Using that the human audio perception is logarithm-scale Companding Compress and Expand before Uniform quantization u-law and A-law

Heejune AHN: Image and Video Compressionp. 7 Mu law Provides 14-bit quality (dynamic range) with an 8-bit encoding Compression factor 2:1. au (Sun audio file format). in North American & Japanese ISDN voice service

Heejune AHN: Image and Video Compressionp. 8

Heejune AHN: Image and Video Compressionp. 9 DPCM Predictive Coding Transmit the difference (rather than the sample). Difference between 2 x-bit samples can be represented with significantly fewer than x-bits

Heejune AHN: Image and Video Compressionp. 10 Slope-overload problem in DPCM Prediction differences x(n) are too large for the quantizer to handle. Encoder fails to track rapidly changing signals. Especially near the Nyquist frequency.

Heejune AHN: Image and Video Compressionp. 11 ADPCM Adaptive step size a larger step-size for fast-varying (high-frequency) samples; a smaller step-size for slowly varying samples. Step size parameter is not transmitted; Use previous sample values to estimate changes in the signal in the near future.

Heejune AHN: Image and Video Compressionp. 12 Speech Production Model Lung Glottis/vocal cord : 성대 (pulse) Vocal track (filter) Cavities (articulation) Organ model: change the vocal track filter 2 Modes voiced sound: vocal cord is vibrating (quasi-periodic excitation) Unvoiced sound : vocal cord is open (noise like model) Lung glittis Vocal cordVocal trackLips/tungs cavities

Heejune AHN: Image and Video Compressionp. 13 Signal and System modeling Voiced Speech Excitation generation Vocal track + Radiation effect Time varying Linear system Train of glottal pulse Lambda = 4L c= 340m/s Formant Frequencies f1= 340/4*0.17 = 500 Hz f3 = 1500 Hz f5 = 2500 Hz T = pitch frequencyT

Heejune AHN: Image and Video Compressionp. 14 Unvoiced Speech No tonic excitation, no formant frquencies

Heejune AHN: Image and Video Compressionp. 15 Linear Predictive Coding LPC A generalization of ADPCM Linear prediction model Prediction parameters : changes slowly relatively the data rate at N /sec M/sec … Analysis Encoder synthesis decoder at N /sec

Heejune AHN: Image and Video Compressionp. 16 Speech analysis and synthesis In every 5 to 40 m Analysis Extract pitch frequencies Spectral analysis : Using multiple band-pass filers. Synthesis Based on mode, excitation signal generate into the vocal track filter.

Heejune AHN: Image and Video Compressionp. 17 LPC analyzer S(n) Pitch Detector (period, gain) Voice/unvoiced ? coder decoder LPC synthesizer

Heejune AHN: Image and Video Compressionp. 18 LPC Vocoder

Heejune AHN: Image and Video Compressionp. 19 Bandwidth efficiency

Heejune AHN: Image and Video Compressionp. 20 Psychoacoustic Principles Psycho-acoustic Human auditory perception property, similar to HVS in Video Coding Average human does not hear all frequencies the same way. Limitations of the human sensory system leads to facts that can be used to cut out unnecessary data in an audio signal. Key Properties Critical Band Property Masking Property ‘ Absolute Threshold of hearing Auditory masking

Heejune AHN: Image and Video Compressionp. 21 Critical Bands Human auditory Limited and frequency dependant Resolution 25 critical bands Bands 1 Bark (e.g. Band) = the width of one critical band Critical band number (Bark) for a given frequency, z(f): f z(f) ≈ f/100 f > 500Hz => z(f) ≈ log2(f/1000)

Heejune AHN: Image and Video Compressionp. 22 Absolute threshold of hearing Range: 20 Hz - 20 kHz, most sensitive at 2 kHz to 4 kHz. Dynamic range (quietest to loudest is about 96 dB) Normal voice range is about 500 Hz - 2 kHz.

Heejune AHN: Image and Video Compressionp. 23 Simultaneous Masking Masking Effects The presence of tones at certain frequencies makes us unable to perceive tones at other “nearby” frequencies Humans cannot distinguish between tones within 100 Hz at low frequencies and 4 kHz at high frequencies.

Heejune AHN: Image and Video Compressionp. 24 Approximates a triangular function modeled by spreading function, SF (x) (db), where x has units of bark sleeper less sleep

Heejune AHN: Image and Video Compressionp. 25 Example Hopefully we can built demo.

Heejune AHN: Image and Video Compressionp. 26 Temporal Masking If we hear a loud sound, then it stops, it takes a little while until we can hear a soft tone nearby. As much as 50 ms before and 200 ms after. Example; Play 1 kHz masking tone at 60 dB and 1.1 kHz test tone at 40dB.

Heejune AHN: Image and Video Compressionp. 27 MPEG-1 Audio MP3 Mostly popular audio format at present. denotes MPEG-1 Audio Level 3, not MPEG-3 (no such a thing in the world) Part of MPEG-1 Standards 1.2 Mbps for video Mbps for audio Sampling frequency 32, 44.1 and 48 kHz One or two audio channels Monophonic, Dual-monophonic, Stereo, Joint Stereo Compression ratio from 2.7:1 to 42:1 Uncompressed CD audio - 44,100 samples/sec * 16 bits/sample * 2 ch > 1.4 Mbps 16 bit stereo sampled at 48 kHz is reduced to 256 kbps

Heejune AHN: Image and Video Compressionp. 28 MPEG-1 Audio Levels Level of Complexity Layer 1 DCT type filter with one frame and equal frequency spread per band. Psychoacoustic model only uses frequency masking. Layer 2 Use three frames in filter (before, current, next, a total of 1152 samples). This models a little bit of the temporal masking. Layer 3 (Known as MP3) Better critical band filter is used (non-equal frequencies) Psychoacoustic model includes temporal masking effects, and takes into account stereo redundancy. Huffman coder.

Heejune AHN: Image and Video Compressionp. 29 Block Diagram for Level 1

Heejune AHN: Image and Video Compressionp. 30 Sub-band Filter Part

Heejune AHN: Image and Video Compressionp. 31 Subband Bank 32 PCM samples yields 32 subband samples. Each sub-band evenly spaced, not like critical bands’ width For kHz sampling rate, each sub-band is 750 Hz wide. Frame Samples out of each filter are grouped into blocks, called frames. Blocks of 12 for Layer 1 (384 samples). Blocks of 36 for Layers 2 and 3(1152 kHz, 32x12 represents 8ms of audio.

Heejune AHN: Image and Video Compressionp. 32 Psychoacoustic analysis FFT gets detailed spectral information about the signal. 512-point FFT for Layer 1, 1024-point for Layer 2 and 3 Don’t be confused the subband filter input samples! Tonal and NonTonal masker from each band Determine the minimal masking threshold in each subband. (using global threshold and masking threshold) Calculate the signal-to-mask ratio (SMR) in each subband, This can be considered a quantization margin.

Heejune AHN: Image and Video Compressionp. 33 Segment the signal into 512 samples => 12 ms

Heejune AHN: Image and Video Compressionp. 34 Masking example Suppose the levels of the first 16 of the 32 sub-bands are: Band !1 !2 !3 !4 !5 !6 !7 !8 !9 !10 !11 !12 !13 !14 !15 !16! Level !0 !8 !12 !10 !6 !2 !10 !60 !35 !20 !15 !2 !3 !5 !3 !1! (dB) The level of the 8th band is 60 dB, the pre-computed masking model specifies a masking of 12 dB in the 7 th band and 15 dB in the 9th. The signal level in 7th band is 10 (< 12 dB), so ignore it. The signal level in 9th band is 35 (< 15 dB), so send it. Only the signals above the masking level needs to be sent.

Heejune AHN: Image and Video Compressionp. 35 Scaling In order to use full range of quantizer Encoder : divide signal by scale factor before quantization Decoder : multiply by scale factor after quantization Scale Factor Largest signal quantized using 6-bit scale-factor. The receiver needs to know the scale factor and quantisation levels used. Information included along with the samples The resulting overhead is very small compared with the compression gains.

Heejune AHN: Image and Video Compressionp. 36 Bit allocation and Quantization Constraints the target bit rate. 192 kbps target rate => 8 => ~1.5 kbits/frame. Objective Objective is to minimize noise-to-mask ratio (NMR) over all sub- bands, i.e., minimize quantization noise. Mission For each audio frame, bits must be distributed across the sub-bands from a predetermined number of bits defined by 12 data/ frame/band 1/scaling factorquantizer frames bit-allocation

Heejune AHN: Image and Video Compressionp. 37 A solution for bit allocation More bits for low Mask Level (since noise is easily noticed) Iterative process For each sub-band, determine NMR = SNR – SMR Allocate bits one at a time to the sub-band with the largest NMR. Recalculate NMR values.Iterate until all bits used.

Heejune AHN: Image and Video Compressionp. 38 Output Frame Frame format Framelen - frame length in bytes; Bit-allocation: 4 bits allocated to each subband (0-31) bits, ‘0000’ indicates no sample. Scale-factors - 6 bits Multiplier that sizes the samples to fully use the range of the quantizer Has variable length depending on N = the number of subbands with non-zero bit allocation. Audio Data subbands 0-31 stored in order, with each subband including 12 samples

Heejune AHN: Image and Video Compressionp. 39

Heejune AHN: Image and Video Compressionp. 40

Heejune AHN: Image and Video Compressionp. 41 Level 3 Block Diagram

Heejune AHN: Image and Video Compressionp. 42 Filter-bank Improved (MDCT) Equally spaced Sub-bands do not accurately reflect the ear’s critical bands. f z(f) ≈ f/100 f > 500Hz => z(f) ≈ log2 (f/1000) Each sub-band further analyzed using Modified DCT(MDCT) create 18 samples (for total of 576 samples). Better tracking of masking threshold. MP3 also specifies a MDCT block length of 6. Lots of bit allocation options for quantizing frequency coefficients. Huffman code For Quantized coefficients