Presentation is loading. Please wait.

Presentation is loading. Please wait.

Labs 3: Bi-Grams. Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program.

Similar presentations


Presentation on theme: "Labs 3: Bi-Grams. Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program."— Presentation transcript:

1 Labs 3: Bi-Grams

2 Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board http://aidanhogan.com/teaching/cc5212-1/mdp-lab3.zip – C:/Program Files (x86)/eclipse/ (in Spanish ) – File > Import > … http://aidanhogan.com/teaching/cc5212-1/ExternalMergeSort.java – Only if you weren’t here last week (half marks) Use es-abstracts.txt.gz from the last time

3 Scale! … knowing how to build a scalable system over many machines requires knowing how to build a scalable system on one machine first How can we count a large set of bi-grams on one machine! Won’t fit in memory so what do we do?

4 Phrasing Bi-grams! – Phrase of two adjacent words When we counted words … – Counting done in memory – Merging done in memory – Faster on one machine! More bi-grams than single words! – So how can we scale the computation? – Won’t fit in memory! (or will it?) Tengo a? Tengo de? Tengo que?

5 Step 2: Fix Some Noise … org.mdp.wc.WordParserIterator loadNext()

6 Step 2: Extract Bigrams to a File org.mdp.cli.ExtractBigrams – Small file for testing (): -i [path]\es-abstracts.txt.gz -igz -o [path]\bigrams-10k.txt –n 10000 – Large file for real run (GZipped): -i [path]\es-abstracts.txt.gz -igz -o [path]\bigrams.txt.gz –ogz

7 Step 3: Try In-memory Count org.mdp.cli.RunBigramCountInMemory -i [path]\bigrams.txt.gz –igz –k 500 Will it run for the big file?

8 External Merge-Sort 1: Batch Sort in batches bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram121 bigram732 Input on-disk (Input size: n) In-memory sort (Batch size b) Output batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram6 bigram42 bigram123

9 Step 4: Implement Batching org.mdp.cli.ExternalMergeSort Implement writeSortedBatches() – Load batchSize lines into memory ArrayList list – When list.size() == batchSize Dump the data to a batch String batchName = getBatchFileName(tmpFolder, batchId); PrintWriter batch = openBatchFileForWriting(batchName); Clear the list and close the batch file Add the batch-name to batchNames() – Do some logging! – Forget about reverseOrder for now

10 Step 5: Implement Merging org.mdp.cli.ExternalMergeSort Implement mergeSortedBatches() – Open files for reading BufferedReader[] brs = new BufferedReader[batches.size()]; – Read a line from each file into memory – Select the lowest line (from file i), write to out Load the next line from file I – Do some logging! – Forget about reverseOrder for now

11 External Merge-Sort 2: Merge bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 In-memory sortInput batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 Sorted output (Output size: n)

12 Step 6: Try Sorting 10k Bigrams org.mdp.cli.ExternalMergeSort -i [path]\bigrams-10k.txt -o [path]\bigrams-10k-sorted.txt –b 3000 If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

13 Counting bigrams is then easy? bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1 Could use merge-sort again to order by occurrence!

14 Step 7: Implement Counting org.mdp.cli.CountDuplicates Implement countDuplicates() – Store two lines: current and last – If current line same as last line, increment counter – If current line different from last line, print count and line to a file, reset count Use String sortNum = StringWithNumber.getSortableNumber(du pes);

15 Step 8: Try Counting 10k Bigrams org.mdp.cli.CountDuplicates -i [path]\bigrams-10k-sorted.txt -o [path]\bigrams-10k-counts.txt If successful, try counting the large file! (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

16 Step 9: Implement Reverse Order org.mdp.cli.ExternalMergeSort In writeSortedBatches() & externalMergeSort()

17 Step 10: Merge-Sort the Counts org.mdp.cli.ExternalMergeSort -i [path]\bigrams-10k-counts.txt -o [path]\bigrams-10k-counts-sorted.txt – b 3000 -r If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

18 Step 11: Get the top 500 org.mdp.cli.CopyLinesFromFile -i [path]\bigrams-counts- sorted.txt.gz –igz -o [path]\bigrams-counts-sorted- top500.txt –n 500

19 Final Step: Profiling (Optional) Java Interactive Profiler Run ExternalMergeSort for a large file Use VM arguments: -javaagent:lib\profile.jar –noverify When finished, check profile.txt in your project’s root directory See if you can optimise something in “Most Expensive Methods”

20 Final Final Steps Remove tmp/ folder from mdp-lab3/ folder and recycle bin (Shift + Del) I set up tareas.


Download ppt "Labs 3: Bi-Grams. Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program."

Similar presentations


Ads by Google