CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
P3- Represent how data flows around a computer system
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Sorting. Sorting Considerations We consider sorting a list of records, either into ascending or descending order, based upon the value of some field of.
1 Merge Sort Review of Sorting Merge Sort. 2 Sorting Algorithms Selection Sort uses a priority queue P implemented with an unsorted sequence: –Phase 1:
Conjunctions. Conjunctions that require the Indicative como - given that puesto que - since ya que - due to the fact that como - given that puesto que.
MergeSort (Example) - 1. MergeSort (Example) - 2.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Computer Hardware.
Quick Review of Apr 22 material Sections 13.1 through 13.3 in text Query Processing: take an SQL query and: –parse/translate it into an internal representation.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
CSC 213 – Large Scale Programming. Today’s Goals  Review past discussion of data sorting algorithms  Weaknesses of past approaches & when we use them.
Sorting in Linear Time Lower bound for comparison-based sorting
CSE 373 Data Structures Lecture 15
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Storage units. We need Storage Units to save our data.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 2: Distributed Systems I Aidan Hogan
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Wrap-Up.
Data Flow Diagram A method used to analyze a system in a structured way Used during: Analysis stage: to describe the current system Design stage: to describe.
Configuration.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
GIANFRANCO BARBALACE Y FRANCO CAVIGLIA CATENAZZI1ºB Types and components of a computer systems.
Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j)
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IV: 2014/03/31.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan
Presenting! Chapter #3. What are we Going to Cover Today? Nuts / Bolts Get to the Meat & Potatoes The Summary of the Chapter Q & A.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Labs 3: Bi-Grams. Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program.
Heapsort Idea: two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Lecture 28 CSE 331 Nov 9, Mini project report due WED.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
CompSci 100E 30.1 Other N log N Sorts  Binary Tree Sort  Basic Recipe o Insert into binary search tree (BST) o Do Inorder Traversal  Complexity o Create:
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
CISC105 – General Computer Science Class 8 – 06/28/2006.
Parts and Operation of a Computer
PageRank. Un Motor de Búsqueda “obama” PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc.
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
1 Algorithms Searching and Sorting Algorithm Efficiency.
بسم الله الرحمن الرحيم شرح جميع طرق الترتيب باللغة العربية
Sorting.
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Teachers use the IRLA to determine the reading behaviors that should be expected at each color level. These reading behaviors are in line with national.
Design Patterns for SSIS Performance
Entry Ticket: Algorithms and Program Construction
Dear Parents and Guardians: Ms. Hubbard, ENL Teacher, PKHS
Chapter 2 (16M) Sorting and Searching
CC Procesamiento Masivo de Datos Otoño Lecture 4: MapReduce/Hadoop II
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Lecture 29 CSE 331 Nov 8, 2017.
Merge Sort 2/23/ :15 PM Merge Sort 7 2   7  2   4  4 9
English Language Development II
Lecture 30 CSE 331 Nov 12, 2012.
CENG 351 Data Management and File Structures
CSE 589 Applied Algorithms Spring 1999
Merge Sort Procedure MergeSort (L = a1, a2, , an) if (n > 1)
Merge Sort 5/30/2019 7:52 AM Merge Sort 7 2   7  2  2 7
Computer Electronic device Accepts data - input
Calibration Method.
External Sorting Dina Said
Presentation transcript:

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan

PRÁCTICA

Instructions

RunWordCountInMemory -i C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts.txt.gz -igz -k 100

Why did it work in memory? We processed a lot of data. Why did it work in memory? Not so many unique words … – but lots of new proper nouns – Heap’s law: – U(n) ≈ Kn β – English text K ≈ 10 β ≈ 0.6

What if it doesn’t work in memory? How could we implement a word- count (or a bi-gram count) using the hard disk for storage?

Most generic method: use sorting tengo que aprender más español tan pronto que puedo y tengo que tomar cada oportunidad para practicar como ahora aprender cada como español más oportunidad para practicar pronto puedo que tan tengo tomar y ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 que3 tan1 tengo2 tomar1 y1 que3 tengo2 ahora1 aprender1 cada1 como1 español1 más1 oportunidad1 para1 practicar1 pronto1 puedo1 tan1 tomar1 y1 How can we use the disk to sort?

-i C:\Users\ahogan\Documents\Teaching\Data\wi kipedia\es\es-wiki-abstracts.txt.gz -igz -n 4 -o C:\Users\ahogan\Documents\Teaching\Data\wi kipedia\es\es-wiki-abstracts-4grams.txt.gz -ogz

-i C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts-4grams.txt.gz - igz -o C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\es-wiki-abstracts-4grams-s.txt.gz -ogz -tmp C:\Users\ahogan\Documents\Teaching\Data\ wikipedia\es\tmp\ -b

External Merge-Sort 1: Batch Sort in batches bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram121 bigram732 Input on-disk (Input size: n) In-memory sort (Batch size b) Output batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram6 bigram42 bigram123

External Merge-Sort 2: Merge bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 In-memory sortInput batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 Sorted output (Output size: n)

Counting bigrams is then easy? bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1 Could use merge-sort again to order by occurrence!

Does external merge-sorting scale? If you have too many batches to read simultaneously, disk will go nuts – Use lots of main-memory to reduce batch count – Only merge k at a time Any problem with external merge-sorting as we scale really high? If we have n batches and merge them k at a time, how many passes will we need? Any solution(s)?

Does external merge-sorting scale? – Use multiple machines!