Presentation is loading. Please wait.

Presentation is loading. Please wait.

CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan

Similar presentations


Presentation on theme: "CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan"— Presentation transcript:

1 CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan aidhog@gmail.com

2 PRÁCTICA

3 Instructions http://aidanhogan.com/teaching/cc5212-1-2016/

4 RunWordCountInMemory -i C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts.txt.gz -igz -k 100

5 Why did it work in memory? We processed a lot of data. Why did it work in memory? Not so many unique words … – but lots of new proper nouns – Heap’s law: – U(n) ≈ Kn β – English text K ≈ 10 β ≈ 0.6

6 What if it doesn’t work in memory? How could we implement a word- count (or a bi-gram count) using the hard disk for storage?

7 Most generic method: use sorting tengo que aprender más español tan pronto que puedo y tengo que tomar cada oportunidad para practicar como ahora aprender cada como español más oportunidad para practicar pronto puedo que tan tengo tomar y ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 que3 tan1 tengo2 tomar1 y1 que3 tengo2 ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 tan1 tomar1 y1 How can we use the disk to sort?

8 -i C:\Users\ahogan\Documents\Teaching\Data\wi kipedia\es\es-wiki-abstracts.txt.gz -igz -n 4 -o C:\Users\ahogan\Documents\Teaching\Data\wi kipedia\es\es-wiki-abstracts-4grams.txt.gz -ogz

9 -i C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts-4grams.txt.gz - igz -o C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts-4grams-s.txt.gz -ogz -tmp C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\tmp\ -b 1000000

10 External Merge-Sort 1: Batch Sort in batches bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram121 bigram732 Input on-disk (Input size: n) In-memory sort (Batch size b) Output batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram6 bigram42 bigram123

11 External Merge-Sort 2: Merge bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 In-memory sortInput batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 Sorted output (Output size: n)

12 Counting bigrams is then easy? bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1 Could use merge-sort again to order by occurrence!

13 Does external merge-sorting scale? If you have too many batches to read simultaneously, disk will go nuts – Use lots of main-memory to reduce batch count – Only merge k at a time Any problem with external merge-sorting as we scale really high? If we have n batches and merge them k at a time, how many passes will we need? Any solution(s)?

14 Does external merge-sorting scale? – Use multiple machines!


Download ppt "CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan"

Similar presentations


Ads by Google