Parallel Implementation Of Word Alignment Model: IBM MODEL 1 Professor: Dr.Azimi Fateme Ahmadi-Fakhr Afshin Arefi Saba Jamalian Dept. of Electrical and Computer Engineering Shiraz University General-purpose Programming of Massively Parallel Graphics Processors 1
Machine Translation 2 Suppose we are asked to translate a foreign sentence f into an English sentence e: f : f 1 … f m e : e 1 … e l What should we do ? For each word in foreign sentence f, we find its most proper word in English. Based on our knowledge in English language, we change the order of generated English words. We might also need to change the words themselves. f 1 f 2 f 3 … f m e 1 e 2 e 3 … e m e 1 e 3 e 2 e m+1 …e l
Example 3 امروز صبح به مدرسه رفتم went school to morning today Finding its most proper word in English Reordering and Changing the words todaymorningwenttoschool thismorningwenttoschoolI Translation Model Language Model Translation
Statistical Translation Models 4 امروز صبح به مدرسه رفتم went school to morning today Finding its most proper word in English Translation Model t( go| رفتم ) > t(x| رفتم ) x as all other English words The machine must know t(e|f) for all possible e and f to find the max. Machine should be trained: IBM Model 1-5 Calculate t(f|e).
IBM Models 1 (Brown et.al [1993]) 5 Model 1 Corpus (Large Body Of Text) t(f|e) for all e and f which are in the Corpus
IBM Models 1 (Brown et.al [1993]) 6 Choose initialize value for t(f|e) for all f and e, then repeat the following steps until Convergence:
IBM Models 1 (Brown et.al [1993]) t(f|e): fjfj eiei The problem is to find t(f|e) for all e and f How probable it is that f j be the translation of e i
IBM Models 1 (Brown et.al [1993]) t(f|e): c(f|e): fjfj eiei Total(e): eiei ∑ of each Row C(f|e) Initialize Initialize to Zero
IBM Models 1 (Brown et.al [1993]) 9 In each sentence pair, for each f in foreign sentence, we calculate ∑ t(f|e) for all e in the English sentence, called total s. Suppose we are given : : Total s [2]= t(f|e)[1,2]+t(f|e)[2,2]+t(f|e)[3,2]+t(f|e)[4,2] C(f|e)[1,2]+=t(f|e)[1,2]/total s [2] Total_e[1]+= t(f|e)[1,2]/total s [2]
IBM Models 1 (Brown et.al [1993]) 10 After processing all sentence pairs in the corpus, update the value of t(f|e) for all e and f: t(f|e)[i,j] = C(f|e)[i,j]/total(e)[i] Start processing the sentence pairs, Calculating C(f|e) and total(e) using t(f|e) Continue the process until value t(f|e) has converged to a desired value.
IBM Model 1 (Psudou Code) 11 initialize t(f|e) do until converge c(f|e)=0 for all e and f, total(e)=0 for all e, for all sentence pair do total(s,f)=0 for all f, for all f in f (s) do for e in all e (s) do total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(e)+=t(f|e)/total(s,f) for all e do for all f do t(f|e)=c(f|e)/total(e) Initialization Calculating Total s for each f In f (s) Calculating C(f|e) and total(e) Initialize to zero Updating t(f|e) using C(f|e) and total(e)
Parallelizing IBM Model 1 12 initialize t(f|e) do until converge c(f|e)=0 for all e and f total(f)=0 for all f for all sentence pair do total(s,f)=0 for all f, for all e in e (s) do for f in all f (s) do{ total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(f)+=t(f|e)/total(s,f) for all e do for all f do t(f|e)=c(f|e)/total(f) For each f,e it is independent of others Updating the value of each t(f|e) for all t and f is independent of each other The process on each sentence pair is independent of others For each f,e it is independent of others
Initialize t(f|e) 13 __global__ void initialize(float* device_t_f_e){ int pos=blockIdx.x*blockDim.x+threadIdx.x; device_t_f_e[pos]=(1.0/NUM_F); } Underflow is possible __global__ void initialize(float* device_t_f_e){ int pos=blockIdx.x*blockDim.x+threadIdx.x; device_t_f_e[pos]=(100000/NUM_F); } Each thread initialize one entry of t(f|e) to a specified value:
Process Of Each Sentence Pair 14 for all sentence pair do total(s,f)=0 for all f, for all e in e (s) do for f in all f (s) do{ total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(f)+=t(f|e)/total(s,f) Using shared memory No use of Reduction. Why? Use atomicAdd(), as it’s possible that two or more threads add a value to c(f|e) or total(f) simultaneously. It is data dependent. Each Thread Process one Sentence Pair
Updating t(f|e) 15 __global__ void update (float* device_t_f_e, float* device_count_f_e, float* device_total_f, int block_size, int Col) { int pos=blockIdx.x*block_size+threadIdx.x; float total=device_total_f[pos/Col]; float count=device_count_f_e[pos]; device_t_f_e[pos]=(100000*count/total); device_count_f_e[pos]=0; } Each thread update one entry of t(f|e) to a specified value And Set one entry of c(f|e) to zero for next iteration Here, it is not possible to set total(f) to Zero, As there is no synchronization between threads out of a block
Setting total(f) to Zero 16 __global__ void total(float* device_total_f){ int pos=threadIdx.x+blockDim.x*blockIdx.x; device_total_f[pos]=0; } Each thread set one entry of total(f) to Zero:
Results 17 NUM_FNUM_E#SENTPAIRCPU-TimeGPU-TimeSpeed-Up
Future Goals 18 Convergence Condition: We repeat the iterations of calculating C(f|e) and t(f|e) for 5 times. But it should be driven from the value of t(f|e). We wish to add it to our code as it has a capability of parallelization. It’s just one of IBM Model 1-5, which are implemented as GIZA++ package. We wish to parallelize 4 other models.
We Want to Express Our Appreciation to: 19 For her useful comments and valuable notifications. For his kindness and full support.
20