Presentation is loading. Please wait.

Presentation is loading. Please wait.

LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.

Similar presentations


Presentation on theme: "LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa."— Presentation transcript:

1 LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa 2, M.A. Martí 2 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain

2 LREC 2008 AWN 2 Index of the talk The AWN project Semi-automatic Extensions of AWN Intuitive basis Previous work using heuristics Using Bayesian Networks Empirical evaluation Conclusions

3 LREC 2008 AWN 3 The AWN project USA REFLEX program funded (2005-2007) Partners: Universities  Princeton, Manchester, UPC, UB Companies  Articulate Software, Irion Description: Black et al, 2006 Elkateb et al, 2006 Rodríguez et al, 2008

4 LREC 2008 AWN 4 The AWN project Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry

5 LREC 2008 AWN 5 The AWN project Current figures Arabic synsets11270 Arabic words23496 pos DB content adj 661 nouns 7961 adv 110 verbs 2538 Named entities: Synsets that are named entities 1142 Synsets that are not named entities10028 Words in synsets that are named entities 1656

6 LREC 2008 AWN 6 Semi-automatic Extensions of AWN Intuitive basis In Arabic (and other Semitic Languages) many words having a common root (i.e. a sequence of typically three consonants) have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules

7 LREC 2008 AWN 7 Semi-automatic Extensions of AWN

8 LREC 2008 AWN 8 Semi-automatic Extensions of AWN Lexical rules regular verbal derivative forms regular nominal and adjectival derivative forms masdar (nominal verb) masculine and feminine active and passive participles inflected verbal forms

9 LREC 2008 AWN 9 Semi-automatic Extensions of AWN Procedure for generating a set of likely : produce an initial list of candidate word forms filter out the less likely candidates from this list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates and include the valid associations in AWN.

10 LREC 2008 AWN 10 Semi-automatic Extensions of AWN Resources PWN AWN LOGOS database of conjugated Arabic verbs NMSU bilingual Arabic-English lexicon Arabic Gigaword Corpus UN (2000-2002) bilingual Arabic-English Corpus

11 LREC 2008 AWN 11 Semi-automatic Extensions of AWN Score the reliability of the candidates build a graph representing the words, synsets and their associations  associations synset-synset:  explicit in WN2.0  path-based apply a set of heuristic rules that use directly the structure of the graph  GWC 2008 apply Bayesian inference  LREC 2008

12 LREC 2008 AWN 12 Using Bayesian Inference

13 LREC 2008 AWN 13 Using Bayesian Inference

14 LREC 2008 AWN 14 Using Bayesian Inference Building the CPT for each node in the BN edges EW  AW probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings. all the mass probability is distributed between pairs occurring in the BN other edges (EW  S, S  S) linear distribution on priors noisy or model

15 LREC 2008 AWN 15 Using Bayesian Inference Performing Bayesian Inference in the BN Assign probability 1 to nodes in layer 1 Infer the probabilities of nodes in layer 3 Select for each word in layer 1 select as candidates the synsets in layer 3 connected to it and with probability over a threshold Score the candidate pair with this probability Select the candidates scored over a threshold

16 LREC 2008 AWN 16 Empirical Evaluation 10 verbs randomly selected from AWN + درس

17 LREC 2008 AWN 17 Empirical Evaluation Results

18 LREC 2008 AWN 18 Conclusions the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272). The sample is clearly insufficient. The overlaping of Heu + BN seems to improve the results An analysis of the errors shows a substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ة).

19 LREC 2008 AWN 19 Further work Repeat the entire procedure relying when possible on dictionaries containing diacritics Refine the scoring procedure by assigning different weights to the different relations. Include additional relations (e.g. path-based) Use additional Knowledge Sources for weighting the relations: related entries already included in AWN SUMO Magnini's domain codes

20 LREC 2008 AWN 20 Thank you for your attention


Download ppt "LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa."

Similar presentations


Ads by Google