Parallel Implementation Of Word Alignment Model: IBM MODEL 1 Professor: Dr.Azimi Fateme Ahmadi-Fakhr Afshin Arefi Saba Jamalian Dept. of Electrical and.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

List Ranking and Parallel Prefix

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.

Introduction to Programming G51PRG University of Nottingham Revision 1

Chapter 2 — Instructions: Language of the Computer — 1 Branching Far Away If branch target is too far to encode with 16-bit offset, assembler rewrites.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Statistical Machine Translation IBM Model 1 CS626/CS460 Anoop Kunchukuttan Under the guidance of Prof. Pushpak Bhattacharyya.

1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:

CS 106 Introduction to Computer Science I 02 / 12 / 2007 Instructor: Michael Eckmann.

Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.

 Monday, 9/30/02, Slide #1 CS106 Introduction to CS1 Monday, 9/30/02  QUESTIONS (on HW02, etc.)??  Today: Libraries, program design  More on Functions!

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.

THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.

Natural Language Processing Expectation Maximization.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

UNIT - 1Topic - 3. Computer software is a program that tells a computer what to do. Computer software, or just software, is any set of machine-readable.

12/1/98 COP 4020 Programming Languages Parallel Programming in Ada and Java Gregory A. Riccardi Department of Computer Science Florida State University.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to Loops For Loops. Motivation for Using Loops So far, everything we’ve done in MATLAB, you could probably do by hand: Mathematical operations.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Synchronization These notes introduce:

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Computer Science 320 Parallel Image Generation. The Mandelbrot Set.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Lecturer: Nguyen Thi Hien Software Engineering Department Home page: hienngong.wordpress.com Chapter 2: Language C++

1 Lecture 6: Assembly Programs Today’s topics:  Large constants  The compilation process  A full example  Intro to the MARS simulator.

Computational Linguistics Seminar LING-696G Week 6.

Software Engineering Algorithms, Compilers, & Lifecycle.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Neural Machine Translation

English-Korean Machine Translation System

User-Defined Functions

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Expectation-Maximization Algorithm

CS 179: Lecture 3.

Memory and Data Locality

Machine Translation and MT tools: Giza++ and Moses

ECE 576 POWER SYSTEM DYNAMICS AND STABILITY

Machine Translation and MT tools: Giza++ and Moses

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Synchronization These notes introduce:

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011

Force Directed Placement: GPU Implementation

Presentation transcript:

Parallel Implementation Of Word Alignment Model: IBM MODEL 1 Professor: Dr.Azimi Fateme Ahmadi-Fakhr Afshin Arefi Saba Jamalian Dept. of Electrical and Computer Engineering Shiraz University General-purpose Programming of Massively Parallel Graphics Processors 1

Machine Translation 2  Suppose we are asked to translate a foreign sentence f into an English sentence e: f : f 1 … f m e : e 1 … e l  What should we do ?  For each word in foreign sentence f, we find its most proper word in English.  Based on our knowledge in English language, we change the order of generated English words.  We might also need to change the words themselves. f 1 f 2 f 3 … f m e 1 e 2 e 3 … e m e 1 e 3 e 2 e m+1 …e l

Example 3 امروز صبح به مدرسه رفتم went school to morning today Finding its most proper word in English Reordering and Changing the words todaymorningwenttoschool thismorningwenttoschoolI Translation Model Language Model Translation

Statistical Translation Models 4 امروز صبح به مدرسه رفتم went school to morning today Finding its most proper word in English Translation Model t( go| رفتم ) > t(x| رفتم ) x as all other English words  The machine must know t(e|f) for all possible e and f to find the max.  Machine should be trained:  IBM Model 1-5  Calculate t(f|e).

IBM Models 1 (Brown et.al [1993]) 5 Model 1 Corpus (Large Body Of Text) t(f|e) for all e and f which are in the Corpus

IBM Models 1 (Brown et.al [1993]) 6 Choose initialize value for t(f|e) for all f and e, then repeat the following steps until Convergence:

IBM Models 1 (Brown et.al [1993]) t(f|e): fjfj eiei The problem is to find t(f|e) for all e and f How probable it is that f j be the translation of e i

IBM Models 1 (Brown et.al [1993]) t(f|e): c(f|e): fjfj eiei Total(e): eiei ∑ of each Row C(f|e) Initialize Initialize to Zero

IBM Models 1 (Brown et.al [1993]) 9 In each sentence pair, for each f in foreign sentence, we calculate ∑ t(f|e) for all e in the English sentence, called total s. Suppose we are given : : Total s [2]= t(f|e)[1,2]+t(f|e)[2,2]+t(f|e)[3,2]+t(f|e)[4,2] C(f|e)[1,2]+=t(f|e)[1,2]/total s [2] Total_e[1]+= t(f|e)[1,2]/total s [2]

IBM Models 1 (Brown et.al [1993]) 10 After processing all sentence pairs in the corpus, update the value of t(f|e) for all e and f: t(f|e)[i,j] = C(f|e)[i,j]/total(e)[i] Start processing the sentence pairs, Calculating C(f|e) and total(e) using t(f|e) Continue the process until value t(f|e) has converged to a desired value.

IBM Model 1 (Psudou Code) 11 initialize t(f|e) do until converge c(f|e)=0 for all e and f, total(e)=0 for all e, for all sentence pair do total(s,f)=0 for all f, for all f in f (s) do for e in all e (s) do total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(e)+=t(f|e)/total(s,f) for all e do for all f do t(f|e)=c(f|e)/total(e) Initialization Calculating Total s for each f In f (s) Calculating C(f|e) and total(e) Initialize to zero Updating t(f|e) using C(f|e) and total(e)

Parallelizing IBM Model 1 12 initialize t(f|e) do until converge c(f|e)=0 for all e and f total(f)=0 for all f for all sentence pair do total(s,f)=0 for all f, for all e in e (s) do for f in all f (s) do{ total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(f)+=t(f|e)/total(s,f) for all e do for all f do t(f|e)=c(f|e)/total(f) For each f,e it is independent of others Updating the value of each t(f|e) for all t and f is independent of each other The process on each sentence pair is independent of others For each f,e it is independent of others

Initialize t(f|e) 13 __global__ void initialize(float* device_t_f_e){ int pos=blockIdx.x*blockDim.x+threadIdx.x; device_t_f_e[pos]=(1.0/NUM_F); } Underflow is possible __global__ void initialize(float* device_t_f_e){ int pos=blockIdx.x*blockDim.x+threadIdx.x; device_t_f_e[pos]=(100000/NUM_F); } Each thread initialize one entry of t(f|e) to a specified value:

Process Of Each Sentence Pair 14 for all sentence pair do total(s,f)=0 for all f, for all e in e (s) do for f in all f (s) do{ total(s,f)+=t(f|e) for all e in e (s) do{ for all f in f (s) do c(f|e)+=t(f|e)/total(s,f) total(f)+=t(f|e)/total(s,f) Using shared memory No use of Reduction. Why? Use atomicAdd(), as it’s possible that two or more threads add a value to c(f|e) or total(f) simultaneously. It is data dependent. Each Thread Process one Sentence Pair

Updating t(f|e) 15 __global__ void update (float* device_t_f_e, float* device_count_f_e, float* device_total_f, int block_size, int Col) { int pos=blockIdx.x*block_size+threadIdx.x; float total=device_total_f[pos/Col]; float count=device_count_f_e[pos]; device_t_f_e[pos]=(100000*count/total); device_count_f_e[pos]=0; } Each thread update one entry of t(f|e) to a specified value And Set one entry of c(f|e) to zero for next iteration Here, it is not possible to set total(f) to Zero, As there is no synchronization between threads out of a block

Setting total(f) to Zero 16 __global__ void total(float* device_total_f){ int pos=threadIdx.x+blockDim.x*blockIdx.x; device_total_f[pos]=0; } Each thread set one entry of total(f) to Zero:

Results 17 NUM_FNUM_E#SENTPAIRCPU-TimeGPU-TimeSpeed-Up

Future Goals 18  Convergence Condition:  We repeat the iterations of calculating C(f|e) and t(f|e) for 5 times.  But it should be driven from the value of t(f|e).  We wish to add it to our code as it has a capability of parallelization.  It’s just one of IBM Model 1-5, which are implemented as GIZA++ package.  We wish to parallelize 4 other models.

We Want to Express Our Appreciation to: 19 For her useful comments and valuable notifications. For his kindness and full support.

20