Week 6 Discussion Word Cloud
Word Cloud Steps to completing Word Cloud: Read in command line args Parse the common.txt file Parse the input file and record word frequencies Account for “common” words Sort list Print output
String[] args Have you ever wondered what the formal parameter “String[] args” in the main method did? EX: public static void main(String[] args) This array holds String values of what the user passed in from the command line, allowing us to get user input without having to use a Scanner. java WordCloud poems.txt 10 In this case args[0] is the filename to read from (poems.txt) args[1] is the number of words to print (10)
Command Line Arguments These are arguments that can be passed in via the command line. WordCloud will take two command line arguments: the input file and the number of words to print out java WordCloud input_file.txt 10 this will print out the 10 most common words in input_file.txt These arguments are stored into an array of Strings, and is passed into the main method as String[] args. args[0] = “input_file.txt”, args[1] = “10” NOTE THAT THE NUMBER IS BEING STORED AS AN STRING.
C.L.A. for WordCloud WordCloud will be expecting 2 command line arguments, the filename and the # of words. You should save these values in a local variable in order to use them. The filename should be saved as a string, but the #words should be saved as an integer. But wait… all command line arguments belong in a String array… Get the numeric value of the #words by using the Integer.parseInt() method. EX: String str = “45”; int i = Integer.parseInt(str); // i = 45
Parsing the Input File By now, you all should be familiar with using filestreams/Scanners. This assignment will require reading in and parsing two files, one of which will be passed in from the command line. The other file will always be common.txt, which is located in the directory: /home/linux/ieng6/cs11wb/public/HW6/ Scanner scnr = new Scanner(new File(“___insert file name here___”)); Must be in a try/catch block, or will not compile Use Scanner.next() to get each String delimited by a whitespace Use Scanner.nextLine() to get each String delimited by newline
Formatting the String Scanner.next() reads in every character until it reaches whitespace (spaces, tabs, newlines), so it will also read in punctuation. For our program, we don’t want to keep anything that is not an alphabetic symbol. Also, we only want to use lowercase letters. The following methods found in the String class will be helpful for formatting: replaceAll() toLowerCase()
replaceAll() and toLowerCase() replaceAll(String s1, String s2) - replaces all characters in s2 defined by the string pattern s1 String str = “ToDAY!?/@$”; String str2 =s.replaceAll(“[^a-zA-Z]”, “”); //str2 = “ToDAY” toLowerCase(String s) - changes all letters in s to lowercase letters String str3 = str2.toLowerCase(); //str3 = “today”
Ignoring “Common” Words There are 3 strategies your program can ignore words found in common.txt: Ignore common words when you are scanning the input file - RECOMMENDED Remove all common words from the HashMap after you scan the input file Don’t print them
common.txt However, in this program, we will be ignoring “common” words (like “the”, “an”, “is” etc.). We have given you a file named “common.txt” in the public directory which contains a list of all these words. We must therefore hold read through “common.txt” and save all the word in a data structure ArrayList HashSet - RECOMMENDED Remember, we want to go for SPEED
HashSets What are HashSets? HashSets are a data structure similar to ArrayLists and arrays. There are a few differences however. Each element of a HashSet is known as a KEY A HashSet has no order. You cannot iterate through it like you would an array. HashSets are really fast. Data can be access from them in O(1) time.
Reading/Storing “common.txt” Only care about whether word EXISTS, and not about anything else Therefore use HashSet add() to add elements contains() to check if element exists HashSet<String> common = new HashSet<String>(); Go through each word in “common.txt.” and store it in a HashSet
Reading/Storing Poems You will need to both store the word you parsed from the file AND the number of times that word appeared in the file. We recommend putting this logic in its own method. You can use two arrays or ArrayLists (one for words, one for frequencies), but it is very annoying to retrieve a word’s frequency from the other list We recommend you use a HashMap for this. (You could also use an array or ArrayList of Pairs. However, we will be discussing HashMaps today). Info on HashMaps can be found @ https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html
HashMap, Keys, and Values What are HashMaps? HashMaps are a data structure similar to ArrayLists and arrays. There are a few differences however. Each element of a HashMap is a Pair value, containing a Key and Value. A HashMap has no order. You cannot iterate through it like you would an array. HashMaps are really fast. Data can be access from them in O(1) time.
Using HashMaps You can create a HashMap using the constructor like so: HashMap<String, Integer> map = new HashMap<String, Integer>(); Integer num = new Integer(42); To put things into the map, you can use the put() method. If you wanted to store the pair (“Blaze”, 42) into the map, it would look like: map.put(“Blaze”, 42); // this stores the pair in map You can later change the value that a key is associated with using put() as well. map.put(“Blaze”, 420); // now “Blaze” is mapped to 420
Accessing Elements of the HashMap The most important part of HashMaps is getting back the value that is associated with the key. For example, let’s say we want to get the value for the key “Blaze”. int i = map.get(“Blaze”); // i = 42 You can also check to see if a key object is already in the map. boolean b = map.contains(“Blaze”); // b = true
HashMaps in WordCloud In order to associate each word with its frequency, we recommend you all use HashMaps. Not only will they make your program faster, but also easier to write. Your HashMap should map a String to an Int (aka it should look like: HashMap<String, int>). Make it a class variable so all your methods will have access to it. As you are reading in from the input file, check if the current word you scanned is already in the map via the contains() method. If it isn’t in the map, put it in there! Its frequency should be 1, since it’s the first occurrence of that word in the file. If it is already in the map, then you want to update the frequency. Increase the value that the word is associated with by 1.
Printing Output map.keySet() will give you a Set object containing all of the keys (Strings) Use this Set to construct an ArrayList to hold all the Strings Iterate through this ArrayList to find out which word has the highest frequency (you can check this using your map) . Print the most frequent word and its frequency. Then remove this word from the ArrayList. Repeat steps 1 and 2 #words times (#words is the command line argument passed in.
Pseudo-code ArrayList<String> words= new ArrayList<String>(map.keySet()); for (i=0 to #words) { int maxFrequency = 0; String mostFrequent = null; for (each word in the array list named words) { if (word isn’t common && frequency of word > maxFrequency) { maxFrequency = frequency of word; mostFrequent = current word; } print mostFrequent and maxFrequency remove mostFrequent from the array list words
Runtime of Other Solutions Printing algorithm we gave you is O(n^2) Can you make it faster? HINT: Most sorts take O(nlogn) Look up binary sort/merge sort ALTERNATIVE SOLUTION: Create a class that has custom sorting
Style: Split Your Logic into Methods Code looks nicer when it’s readable. Help make it readable by using methods. Pls Method for printing the words, removing the common words, parsing the input file, parsing common.txt, etc.
Style: Other Guidelines (again) Have class/file/method headers (method headers MUST be in javadoc format. look it up) Have inline comments (comments explaining how your code works) Have meaningful variable names (don’t just name them all a,b,c,d…) Don’t use “magic numbers” (Use constants. If you’re not sure if a number is magic, make it a constant anyways) Have logical blank space separators (space out your code so it’s readable) Don’t go over 80 characters in ANY line for ANY FILE Indent properly (use curly braces to make sure your indentations are correct)
Questions?