Gist of Achieving Human Parity in Conversational Speech Recognition
Suggested Major Project Topic Areas Image recognition Category of image Handwriting recognition Object identification Image to text Reinforcement learning Unsupervised learning Generative adversarial networks Autoencoders Semi-supervised learning Transfer learning Other Speech Speech recognition Speech synthesis Other synthesis (music, painting, text) Natural language Machine translation Word embeddings Summarization Text understanding Information retrieval Other LSTM and gated network applications Health and Medicine Preference prediction Sequence prediction Time series Stocks Very deep networks Highway networks Residual learning Interpretability and Human-supplied knowledge and control Each team should post their tentative topic choice and begin posting gists. If you are undecided, post gists on multiple topics. Each person should post at least 4 gists. You can post as many as you want. There will be a running competition for the best gists. You may rate and comment on any posted gist.
Gist Format Quick gist of the paper: What is the significant result? How major? What is the premise? What are the main prior work? What are the new methodologies? What techniques are assumed known?
They are proud of this work They are proud of this work. Most of the abstract is spent bragging about the result.
Repeats the claim in the title.
Announces that what follows is important.
Things to look for in the paper: Convolutional and LSTM networks Existing techniques Things to look for in the paper: Convolutional and LSTM networks
Things to look for in the paper: Convolutional and LSTM networks What’s new! Things to look for in the paper: Convolutional and LSTM networks
Things to look for in the paper: Convolutional and LSTM networks What’s new! The novel technique Things to look for in the paper: Convolutional and LSTM networks Novel spatial smoothing
Things to look for in the paper: Convolutional and LSTM networks What’s new! Something else that may be non-standard. Things to look for in the paper: Convolutional and LSTM networks Novel spatial smoothing Lattice-free MMI acoustic training
Of course, also look for the results! Why did they emphasize that it was systematic use? Things to look for in the paper: Convolutional and LSTM networks Novel spatial smoothing Lattice-free MMI acoustic training Why “systematic” use? The results (compare to human)
Modest about Their Method Best practices: You could should do this, too!
This is explaining the emphasis on “systematic” use. While they are proud of their results, they are being modest about how they achieved them. They are attributing the results mainly to careful engineering rather than to the novel techniques that they have added to the old stand-bys. Yes, every course in deep learning should cover CNNs and RNNs.
FYI: More complete history, not necessary for gist
What’s new?
CNNs Notice that they only reference, but do not describe the prior work. Can you tell which of these references have Microsoft authors? You will need to read these references to understand the techniques used in this paper, and this is just to understand the CNNs. This gist should list the full title of each cited reference as required prior-work reading.
Final CNN Variant: LACE More prior-work references. Also, ResNet is an implicit prior-work reference.
At least the LACE architecture itself is shown in detail.
LSTMs Although LSTMs are only “a close second”, they are used in combination with the convolutional networks, so you need to know how to implement both. More prior-work references.
Spatial Smoothing The part that is new. There are no prior-work references, but there is quite a bit of jargon.
Speaker Adaptive Modeling The abstract didn’t even mention speaker-adaptive modeling, but you’ll need to know how to implement it. More prior-work references. Do you understand why CNN models are treated differently from the LSTM models with regard to the appended i-vector?
Lattice-Free Sequence Training This is only the initial paragraph. There are several more paragraphs describing “our implementation”. However, this clip shows the prior-work references. This section is a mix of a new implementation and prior work. They did not call it “novel” as they did the spatial smoothing.
LM Rescoring and System Combination This is how they combine the acoustic analysis with the language models. In addition to the explicit references, you will need look at prior references for RNN LMs and LSTM LMs, but we will not look at the details in the next two sections.
Results There are many tables of results, showing tests of individual components and various combinations. There are also comparisons with prior work. This table is a sufficient representative for the gist. It shows results for various versions of the Microsoft system and compares the final system with human performance.
Summary of Gist A “break-through” paper announcing the first result exceeding human performance on a well-known, heavily researched benchmark. The result was mainly achieved by “careful engineering and optimization” based on accumulated prior art. If you have already implemented all of the prior art, there is a relatively small amount of new things that you need to implement. However, even then there will be a lot of tuning and optimizing. If you are starting from scratch, there is a very large amount of prior work that you will need to implement as a prerequisite. This is an important paper on a major piece of work. If you want to understand state-of-the-art speech recognition very thoroughly, reading this paper and its references would be a good start. There are many sections in the paper that were skipped in this summary, such descriptions of the testing of human performance and the implementation on Microsoft’s CNTK framework. Conclusion: Worth reading this paper if you want to be up-to-date in speech recognition. Would be extremely ambitious as a student team project.