Download presentation
Presentation is loading. Please wait.
1
Module Recognition Algorithms
Seminar Speech Recognition Project Module Recognition Algorithms Presentation 2 – progress This presentation gives an indication of the progress of the second project group for “module recognition algorithms”, with René Verhage and Leo Yang, as of November 14th The presentation slides will only give an overview of the research done on the current RES software, not on the other part of the research: literature search on alternative algorithms suitable for speech recognisers. The research done on the RES software is about internal timing analysis of the search algorithm. The software has been modified, but only to include code for timing analysis; no parts of the actual recognition code have been changed, nor have any parameters been adjusted. The additional code, of course, has impact on the execution time, but is negligible compared to the total execution time.. The timing output on screen has been (manually) imported in Excel. From these data several graphs have been made. This timing analysis will give some insight in the bottlenecks of the RES software, and where adjustment to the code is beneficial and where it is not worth the effort, in terms of faster execution time. The timing analysis has been done on a notebook, running Windows XP (Professional), having a 1GHz Pentium III processor and 224 Mb of RAM. As a result of the 1Ghz processor, the clock function, which returns an integer number of clock cycles since the start of the program, has a resolution of 1 ms . However, it appeared during the tests that the usable resolution is about 10 ms, for several clock function calls in a short time gave the same result until about 10 ms later. On the other hand, the clock function does really indicate the number of clock cycles since the start, so it can be used in tights loops for total or average calculations. Note: Shortly after this presentation was prepared, it appeared that the software was compiled as a debug version, hence the timing analysis presented here is from this version. Compiled as a release version the recognition is significantly faster, but not yet real-time… to be continued…
2
Two sound files, ready to use with the RES software as delivered, have been used to do timing analysis on. The analysis is performed on the word recognizer, from the TEST_ME/WORD_REC directory. The res.ini file in this directory has not been changed. First of all the execution time has been split up between initialization and the actual recognition task. As shown in the diagram above, the initialization takes about 9.4 seconds for both sound files. Of course: the initialization is independent of the sound file used. The 4y0011 sound file is the shorter of the two, taking about 11.3 seconds to get recognized; the 4y0021 sound file is recognized in about 19.0 seconds. For the rest of the analysis, the initialization time is left out. The real recognition time is important for determining the possibility to do real-time recognition with RES.
3
The diagram above shows the relative performance of the RES system to recognize both sound files. The recognition time is indicated as percentage of the actual time of the respective sound files. Having set the real-time of the sound files at 100%, it is clear that the recognition cannot be done in real-time. The actual recognition time is much higher that the time of the sound files; both sound files take over 300% of their real-time to recognize. Only if the recognition time would be below 100%, the recognition could be done at real-time.
4
[ 4] HypoTree::Viterbi_Beam: while part:
[ 0] Total time: s [ 1] main: initialisation & closing time: total: s [ 2] main: Viterbi beam/window: total: s [ 3] HypoTree::Viterbi_Beam: average: s (1 calls) [ 4] HypoTree::Viterbi_Beam: while part: total: s average: s (1 iterations) [ ] HypoTree::Viterbi_Beam: initialisation, backtracking, and closing: total: s average: 0.01 s [ 5] HypoTree::Viterbi_Beam: double for-loop: total: s average: s (265 iterations) [ 6] HypoTree::Viterbi_Beam: dynamic Viterbi part: total: s average: s (265 iterations) [12] HypoTree::Viterbi_Beam: new observation: total: s average: e-005 s (265 iterations) [ 7] HypoTree::Viterbi_Beam: start in outer for-loop: total: s average: s (49389 iterations) [ 8] HypoTree::Viterbi_Beam: inner for-loop: total: s average: s (49389 iterations) [ 9] HypoTree::Viterbi_Beam: inner for-loop, begin: total: s average: e-005 s ( iterations) [10] HypoTree::Viterbi_Beam: inner for-loop, if: total: s average: e-007 s ( iterations) [11] HypoTree::Viterbi_Beam: inner for-loop, else: total: s average: e-007 s ( iterations) The default res.ini file uses the Viterbi_Beam function for the recognition of the sound files. Because there is one sound file (per test run), there is only one call to this function. Within this function a split up has been made to identify the parts contributing most to the recognition time. The slide above gives an example of the timing analysis output from the RES software. The example is from a run with the sound file 4y0021. After analysis of the different parts in the main routine, the Viterbi_Beam function is analyzed. The highlighted part indicates the first split-up. In the Viterbi_Beam function first some initialization is done. Next there is a while section in which the actual recognition of the sound file is performed, i.e. a tree is build with possible recognition results. Next this tree is backtracked from the leave with the highest probability, and then the file is closed. As shown, the while section takes up most of the time; even that much that it is useless to show in a graph. Therefore the next level of the timing analysis will be in the while section.
5
The while section can be split up into three parts: a double for-loop, a section which I called the ‘dynamic part’, and a call to get a new observation (i.e. a new portion of the sound file). Actually, the whole recognition has to do with building up the dynamic tree, also in the double for-loop. Therefore the term ‘dynamic part’ is not quite right. In the part indicated by ‘dynamic part’ actions are performed on the dynamic tree to reduce it’s size, like pruning and the Viterbi beam reduction. Although the reduction of the dynamic tree can be visualized very well in the diagram above, it is also clear that the double for-loop has (significantly) the largest contribution to the recognition time. Therefore the next level of timing analysis concentrates on the double for-loop.
6
Inside the double for loop there is not as much difference between the different parts. The split up of the double for-loop is in the start of the outer for-loop and the inner for loop. The inner for-loop is analyzed in more detail by the identification of the beginning and the if-else-statement. As indicated in the shown diagram this last split-up reveals that almost all time of the inner loop is in the (one) function call in the begin. This timing analysis can (should) be given next levels of detail to find the bottlenecks on the lowest level. Summarizing the bottlenecks found so far: to get a better performance we should concentrate on the double for-loop. Inside the double for loop there are two functions contributing enough to the total time to try to find improvements for both of them. The next slides will try to give some insight into the two for-loops by counting the number of iterations.
7
This diagram, and the next one, shows the number of active states for each of the while-loop iterations. The while loop iterates for each segment of the sound file. For each segment, first the number of active states in the dynamic tree built so far is determined. One can see that the recognition possibilities to analyze for a certain segment, varies with the number of possible recognitions of the previous segment: the current number of active leaves. From the different number of while-loop iterations it can also be identified that the diagram show above is from the smaller sound file. The 3.6 seconds are split up into 166 segments. The larger sound file, 6.1 seconds, is split up into 265 segments.
8
No effort is put in correlating these diagrams with the contents of the sound file, but one can imagine that the actual speech does have its impact on the shape of the number of active states over the while-loop iterations. It may be beneficial to do correlate these; maybe this can reveal some ideas to reduce the number of active states. For if that can be achieved, the next part of the function, the number of outer loop iterations can be reduced. This will (probably) result in less time for the double for-loop, and hence a better performance.
9
Next, the number of inner for-loop iterations is shown; these are the number of possible candidates per iteration of the outer for-loop. The total number of outer for-loop iterations is the summation of all the number of active states (shown in the previous slides). For the smaller file this results in outer for-loop iterations; the larger sound file has iterations (of which only can be shown in an Excel graph at once). If the number of possible candidates from these diagrams is again summed, the total number of inner for-loop iterations can be determined. For the smaller sound file this results in , and for the larger file in iterations. So, it could be that the reason for the ‘high’ recognition time might be more in the total number of iterations than in the single execution time of one or more functions. This might mean that reduction of recognition time can be achieved by adjusting some parameters of the algorithm, although that would probably also result in a lower recognition accuracy. This investigations is still open…
10
Also in these diagrams there seems to be some correlation with the sound files (which is not investigated). But also it can be seen that there is a constantly returning (approximately) constant number of possible candidates. Who knows, maybe the cause of this reveals some interesting details of the recognition (algorithm)…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.