Progress updates on dependency parsing 2018-01-29 David Ling
Wiki parsing Wikipedia 2018 dump ~3 GB (zipped) Word1, pos1, word2, pos2, dependency, freq Wikipedia 2018 dump ~3 GB (zipped) Using spaCy About 4 days with 2 computers Parsed dependency count : 226,078,650 rows Building database via sqlite (~11 GB) Build the entire database in RAM first, copy to hard disk afterward (3 hours vs 48 hours) Use “apsw” instead of the default “sqlite3” python library (missing direct copying from RAM to hard disk) To improve searching and building, large SSD hard disk can be considered Sorted query result for ‘word2 = programmers, dependency = nsubj ’
Wiki parsing Even a single word pair can have different dependencies VB: verb base form VBP: 3rd person singular verb Dobj: direct object Dep: unspecified dependency Wiki parsing Even a single word pair can have different dependencies Example query: Word 1 = think, word 2 = people Can check prepositions of a word: Example: Word 2 = floor, pos 1 = IN, dep = pobj May useful for statistics of HSMC students Problem: searching is fast if you have specified all the quantities, but is still slow if some quantities are arbitrary pobj: object of preposition
Wiki parsing Another preposition example: bus
LDC – Lingustic Data Center Received data from LDC After purchasing data, download link will be available in your account Web 1T corpus~ 24 GB
LDC – Lingustic Data Center Data samples of web 1T corpus