Presentation is loading. Please wait.

Presentation is loading. Please wait.

Progress updates on dependency parsing

Similar presentations


Presentation on theme: "Progress updates on dependency parsing"— Presentation transcript:

1 Progress updates on dependency parsing
David Ling

2 Wiki parsing Wikipedia 2018 dump ~3 GB (zipped)
Word1, pos1, word2, pos2, dependency, freq Wikipedia 2018 dump ~3 GB (zipped) Using spaCy About 4 days with 2 computers Parsed dependency count : 226,078,650 rows Building database via sqlite (~11 GB) Build the entire database in RAM first, copy to hard disk afterward (3 hours vs 48 hours) Use “apsw” instead of the default “sqlite3” python library (missing direct copying from RAM to hard disk) To improve searching and building, large SSD hard disk can be considered Sorted query result for ‘word2 = programmers, dependency = nsubj ’

3 Wiki parsing Even a single word pair can have different dependencies
VB: verb base form VBP: 3rd person singular verb Dobj: direct object Dep: unspecified dependency Wiki parsing Even a single word pair can have different dependencies Example query: Word 1 = think, word 2 = people Can check prepositions of a word: Example: Word 2 = floor, pos 1 = IN, dep = pobj May useful for statistics of HSMC students Problem: searching is fast if you have specified all the quantities, but is still slow if some quantities are arbitrary pobj: object of preposition

4 Wiki parsing Another preposition example: bus

5 LDC – Lingustic Data Center
Received data from LDC After purchasing data, download link will be available in your account Web 1T corpus~ 24 GB

6 LDC – Lingustic Data Center
Data samples of web 1T corpus


Download ppt "Progress updates on dependency parsing"

Similar presentations


Ads by Google