Parallelizing Dynamic Time Warping

Parallelizing Dynamic Time Warping
+ Hello, everyone. Last time, I talked about parallelizing Dynamic Time Warping. Today, I am going to extend this topic a bit more theoretically and practically Jieyi Hu

First, let’s review some of the stuff we talked about last time about dynamic time warping. this algorithm is for comparing how similar two sequences are. it is usually used to compare speech, gestures or temporal sequences in other forms. Therefore, it is usually used in real time application, which the run time matters in these cases.

Costly Also it is costly, therefore, we want to parallelizing it.

Cost[i, j] Cost[0, 0] Cost[0, 1] Cost[1, 0] Cost[1, 1]
x y z x0 y0 z0 x1 y1 z1 . x23 y23 z23 x24 y24 z24 x y z x0 y0 z0 x1 y1 z1 . x23 y23 z23 x24 y24 z24 Cost[i, j] Cost[0, 0] ∞ . Cost[0, 1] Cost[1, 0] Cost[1, 1] Quick review for the algorithm, two sequences, construct cost table, find cost for every cell by calculating the distance and adding the minimum of some previous cell. However, last time, I have not found a way to parallelizing the this part of the algorithm because the cost value is related to previous costs. You cannot start calculate the cost until you have the previous ones.

∞ . and right after we learn about red-black ordering, i wonder why not use similar strategy on Dynamic time warping. However, I cannot do it like red-black ordering completely, since without calculating the red here first, you cannot calculate the orange, and for that you cannot calculate the yellow and so on. So it still has to go sequentially for each color coded line here.but inside each color coded line, each calculation is indepedent to each other. they could go in parallel.

testcase width of m height of n x11 x12 x13 . x1m x21 x22 x23 x2m x31 x32 x33 x3m xn1 xn2 xn3 xnm cost table ∞ . m for each cell, calculation takes m steps for calculating the distance between two test cases. see the summation here. To calculate the first red line here, to achieve m steps, we need only 1 processor, for the first yellow line here, we need two, and 3 and go up to n processors for the blue line, and it is the maximum. Therefore, if we have p=n processors, each color coded line takes m steps. we have 2n-1 of these colored coded lines. With finding the minimum cost among the table on the fly along calculating cost for all cells, we need O((2n-1)*m), which is O(nm)

the sequential version, m steps for each cell, there are n by n cells, the time-complexity is O(n2m), and the parallel one are O(nm), therefore, the speedup is n, and eff is 1, which is superlinear speedup, n processors, and n speedup, as well as cost-optimal. but without n processors, the time complexity I think is still close to O(n2m), even though I haven’t know how to prove it.

however, is it the same case in real life
however, is it the same case in real life? according to my experience, when comparing hand gesture that is captured by microsoft band whose sensors have low frequency, the sequence is not that long, it is around 30, which means n is 30. And even 30 processors is achievable on pc, it is not that achievable at this moment on mobile device. And if the gesture is captured by apple watch whose sensors have high frequency, giving long sequence, it is hard to keep the number of the processors up with the length of the sequence. Therefore, superlinear speed is not so practical in real life.

to verify that, I tried to code it in python and test it on pc
to verify that, I tried to code it in python and test it on pc. At first try, the parallel came out slower than the sequential one. So I doubt that if I coded it wrong or use the wrong Python method for I thought this parallelization was so perfect theoretically.

Then I found out I did use the wrong python method
Then I found out I did use the wrong python method. When I distributed the task, I distributed task among processes. And since processes don’t have shared memory space, there are tons of data copy happens before the real task begin. Therefore, it causes a lot of overhead.

cost table ∞ . m But look back to the algorithm, among color coded lines, operations go one by one sequential, and inside the color coded lines, operations are independent. Therefore, even using shared memory space, it won’t have any conflict.

So I switch to distributing task on threads instead of processes
So I switch to distributing task on threads instead of processes. And this did give me better result in terms of run time.

However, it is still disappointing
However, it is still disappointing. The parallel ones are always slower than the sequential one, which I don’t understand why at first. So I asked people on stack overflow.

cost table ∞ . m and I was told that when the operation that I am parallelizing is not expensive, it could negates the benefits of parallelizing it because of the overhead. But later I realized not only that operation is inexpensive being one of the reason, the other reason is that the computer I used to run this code has only 2 processors and 4 logic cores. Therefore, the time complexity is never O(nm), but it is close to O(n2m) I think. Anyway look back to my Red-black ordering style DTW algorithm, the operation I am parallelizing is just finding the cost of one cell which is m steps, which is fairly inexpensive.

Process Process Process Process Process Process Process Process
cost table ∞ . m Process cost table ∞ . m Process cost table ∞ . m Process cost table ∞ . m Process cost table ∞ . m Process cost table ∞ . m cost table ∞ . m cost table ∞ . m But what is expensive. The whole algorithm itself is, it is O(n2m). And in real life, we always have to compare one captured test case against several different predefined test cases.

if we have l predefined test cases, and p processors
if we have l predefined test cases, and p processors. it’s linear speed up even it is SPMD embarrassing parallelization.

Anyway, I coded it in python to get the real life result
Anyway, I coded it in python to get the real life result. And for this one I used process instead of thread for the operation is not only independent to each other but also do not need shared memory space.

and from the data and graph, we could see that parallelized one is actually better than the sequential because the operation we parallelized is actually expensive. And if we overlook the overhead, it is close to linear speedup.

Conclusion For shared memory space, use thread.
For independent operation that do not require shared memory space, use process. Red-Black ordering style DTW achieves superlinear speedup theoretically. Red-Black ordering style DTW would require p=n processors to achieve superlinear speedup

Reference Fikret Ercal, Greg Hewgill, Stack Overflow, dano, Stack Overflow,

Thank you

Parallelizing Dynamic Time Warping

Similar presentations

Presentation on theme: "Parallelizing Dynamic Time Warping"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelizing Dynamic Time Warping

Similar presentations

Presentation on theme: "Parallelizing Dynamic Time Warping"— Presentation transcript:

Similar presentations

About project

Feedback