Labs 3: Bi-Grams. Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program.

Slides:



Advertisements
Similar presentations
compilers and interpreters
Advertisements

CS 400/600 – Data Structures External Sorting.
CSC 360- Instructor: K. Wu Overview of Operating Systems.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
Version Control System (Sub)Version Control (SVN).
The Operating System. What is an Operating System? The software which makes it possible for you to use your computer The software which starts up when.
Labs 2: Palabras. Palabras Archiecture Slave1Slave2Slave3 SlaveM Directory Master1Master2Master3MasterN … … Jobs1 Jobs2 Jobs3 JobsM Slave1 Slave2 Slave3.
Practice Quiz Question
PageRank + Inverted Index. Un Motor de Búsqueda “obama”
MergeSort (Example) - 1. MergeSort (Example) - 2.
Designing a Virtual Machine. Basic Approach Object-oriented design Try to model the hardware. Seek a level of detail that is appropriate for interpretation.
Scite Scintilla integrated text editor. Click here.
Memory & Storage Architecture Seoul National University Computer Architecture “ Bomb Lab Hints” 2nd semester, 2014 Modified version : The original.
Update the PATH variable Trying to run the command: “javac Ex1.java” you’ve may encountered the error: “javac is not recognized as internal or external.
Slide 1. Slide 2 Administrivia Nate's office hours are Wed, 2-4, in 329 Soda! TA Clint will be handing out a paper survey in class sometime this week.
CS 206 Introduction to Computer Science II 10 / 28 / 2009 Instructor: Michael Eckmann.
Insertion Sort By Daniel Tea. What is Insertion Sort? Simple sorting algorithm Builds the final list (or array) one at a time – A type of incremental.
Data Structures/ Algorithms and Generic Programming Sorting Algorithms.
Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Configuring the Wireless on Your Configurator Computer.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Python.
General Computer Science for Engineers CISC 106 Lecture 07 James Atlas Computer and Information Sciences 06/29/2009.
THE BIG PICTURE. How does JavaScript interact with the browser?
CAD3D Project. SketchUp - Project Create a new SketchUp project called InitialsXX where the XX are your first and last initial. Use the Rectangle tool.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
CS 114 – Class 02 Topics  Computer programs  Using the compiler Assignments  Read pages for Thursday.  We will go to the lab on Thursday.
Martin Dodge Practical 2, 24th March 2004, pm Social Science Research Methodologies.
Labs 1.1: Mensaje. Step 1: Get Started Login: – Username: nombre/cc5212 – Password on board – C:/Program.
Faculty Webpage Design Minimum Requirements. Go to: then High Schoolhttp://gcsc.groupfusion.net/
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IV: 2014/03/31.
Colleague, Excel & Word Best of Friends Presented by: Joan Kaun & Yvonne Nelson College of the Rockies.
ACHIEVE CERTIFY EDUCATE ACE CERTIFICATION PROCESS.
Change in your CAD Project File - it happens all the time in robotics.
Day 2 – Logic and Algorithms REACHING WIDER SUMMER SCHOOL.
Making Python Pretty!. How to Use This Presentation… Download a copy of this presentation to your ‘Computing’ folder. Follow the code examples, and put.
Interface and Implementation Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Slide 1 Project 1 Task 2 T&N3311 PJ1 Information & Communications Technology HD in Telecommunications and Networking Task 2 Briefing The Design of a Computer.
Google App Engine MemCache ae-09-session
Installing SAS 1. Requirements If you do not have an old copy of SAS installed on your computer, go directly to Slide 6. Make sure you have uninstalled.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 2a – A Unix Command Sampler (Courtesy of David Notkin, CSE 303)
Georgia Institute of Technology Speed part 4 Barb Ericson Georgia Institute of Technology May 2006.
1 Day 2 Logging in, Passwords, Man, talk, write. 2 Logging in Unix is a multi user system –Many people can be using it at the same time. –Connections.
More Sequences. Review: String Sequences  Strings are sequences of characters so we can: Use an index to refer to an individual character: Use slices.
Introduction to Eclipse Programming with an Integrated Development Environment.
The Development Process Compilation. Compilation - Dr. Craig A. Struble 2 Programming Process Problem Solving Phase We will spend significant time on.
Efficiently Solving Computer Programming Problems Doncho Minkov Telerik Corporation Technical Trainer.
PageRank. Un Motor de Búsqueda “obama” PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
English Ms Rivard Room 206.  Bellringers  Bellringers are worth 50 points each marking period.  Start on your bellringer as soon as you enter the class.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lab 1: Wikipedia Word Count Aidan Hogan
Welcome to Indiana University Clusters
Managing a Project User Documentation.
IUIE Reporting Basics Workshop
Setting up FTP for CAST Click on Manage Sites
Whatcha doin'? Aims: To start using Python. To understand loops.
Welcome to Indiana University Clusters
Introducing Instructions
CSE 374 Programming Concepts & Tools
Creating and Modifying Text part 2
Computer Architecture “Bomb Lab Hints”
Advanced Java Programming
<INSERT_WITTY_QUOTE_HERE>
What's wrong with Easter jokes? They crack you up
Lecture 21 Logistics Last lecture Today HW7 due Wednesday
Inputs, Outputs and Assignment
Presentation transcript:

Labs 3: Bi-Grams

Step 1: Get Started Login: – Username: nombre\cc5212 – Password on board – C:/Program Files (x86)/eclipse/ (in Spanish ) – File > Import > … – Only if you weren’t here last week (half marks) Use es-abstracts.txt.gz from the last time

Scale! … knowing how to build a scalable system over many machines requires knowing how to build a scalable system on one machine first How can we count a large set of bi-grams on one machine! Won’t fit in memory so what do we do?

Phrasing Bi-grams! – Phrase of two adjacent words When we counted words … – Counting done in memory – Merging done in memory – Faster on one machine! More bi-grams than single words! – So how can we scale the computation? – Won’t fit in memory! (or will it?) Tengo a? Tengo de? Tengo que?

Step 2: Fix Some Noise … org.mdp.wc.WordParserIterator loadNext()

Step 2: Extract Bigrams to a File org.mdp.cli.ExtractBigrams – Small file for testing (): -i [path]\es-abstracts.txt.gz -igz -o [path]\bigrams-10k.txt –n – Large file for real run (GZipped): -i [path]\es-abstracts.txt.gz -igz -o [path]\bigrams.txt.gz –ogz

Step 3: Try In-memory Count org.mdp.cli.RunBigramCountInMemory -i [path]\bigrams.txt.gz –igz –k 500 Will it run for the big file?

External Merge-Sort 1: Batch Sort in batches bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram121 bigram732 Input on-disk (Input size: n) In-memory sort (Batch size b) Output batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram6 bigram42 bigram123

Step 4: Implement Batching org.mdp.cli.ExternalMergeSort Implement writeSortedBatches() – Load batchSize lines into memory ArrayList list – When list.size() == batchSize Dump the data to a batch String batchName = getBatchFileName(tmpFolder, batchId); PrintWriter batch = openBatchFileForWriting(batchName); Clear the list and close the batch file Add the batch-name to batchNames() – Do some logging! – Forget about reverseOrder for now

Step 5: Implement Merging org.mdp.cli.ExternalMergeSort Implement mergeSortedBatches() – Open files for reading BufferedReader[] brs = new BufferedReader[batches.size()]; – Read a line from each file into memory – Select the lowest line (from file i), write to out Load the next line from file I – Do some logging! – Forget about reverseOrder for now

External Merge-Sort 2: Merge bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 In-memory sortInput batches on-disk ( ⌈ n/b ⌉ batches) bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 Sorted output (Output size: n)

Step 6: Try Sorting 10k Bigrams org.mdp.cli.ExternalMergeSort -i [path]\bigrams-10k.txt -o [path]\bigrams-10k-sorted.txt –b 3000 If successful, try sorting the large file! Use batches of size (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

Counting bigrams is then easy? bigram6 bigram42 bigram121 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1 Could use merge-sort again to order by occurrence!

Step 7: Implement Counting org.mdp.cli.CountDuplicates Implement countDuplicates() – Store two lines: current and last – If current line same as last line, increment counter – If current line different from last line, print count and line to a file, reset count Use String sortNum = StringWithNumber.getSortableNumber(du pes);

Step 8: Try Counting 10k Bigrams org.mdp.cli.CountDuplicates -i [path]\bigrams-10k-sorted.txt -o [path]\bigrams-10k-counts.txt If successful, try counting the large file! (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

Step 9: Implement Reverse Order org.mdp.cli.ExternalMergeSort In writeSortedBatches() & externalMergeSort()

Step 10: Merge-Sort the Counts org.mdp.cli.ExternalMergeSort -i [path]\bigrams-10k-counts.txt -o [path]\bigrams-10k-counts-sorted.txt – b r If successful, try sorting the large file! Use batches of size (Don’t forget -igz / -ogz ) If not successful, try debugging. If stuck, ask me.

Step 11: Get the top 500 org.mdp.cli.CopyLinesFromFile -i [path]\bigrams-counts- sorted.txt.gz –igz -o [path]\bigrams-counts-sorted- top500.txt –n 500

Final Step: Profiling (Optional) Java Interactive Profiler Run ExternalMergeSort for a large file Use VM arguments: -javaagent:lib\profile.jar –noverify When finished, check profile.txt in your project’s root directory See if you can optimise something in “Most Expensive Methods”

Final Final Steps Remove tmp/ folder from mdp-lab3/ folder and recycle bin (Shift + Del) I set up tareas.