Calculating Word Frequency in a Document
11/6( 四 ) 這個星期四小考, 5. Threaded Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!
有關多一行的問題.. >> version ◦ ifstream input(argv[1]); ◦ while (!input.eof() && input.peek() > 0) { ◦ input >> buf; ◦ cout << buf ; ◦ input >> buf; ◦ input.get(); /* 拿走 ‘\n’ 這個 character */ ◦ cout << " " << buf << endl; ◦ }
Getline version ◦ ifstream input(argv[1]); ◦ while (!input.eof()) { ◦ input.getline(buf, 500); ◦ if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */ ◦ cout << buf << endl; ◦ } Another one ◦ ifstream input(argv[1]); ◦ while (input.getline(buf, 500)) { ◦ cout << buf << endl; ◦ }
有關於出現 的問題 ◦ 看到 demo 時候出現 就是你把 ‘\0’ ( 就是 0) output 到檔 案中了.. ◦ 以後多出這種 demo 程式就不會過, 就以錯誤計算 How to fix ? ◦ 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔 案中. ◦ int i; FILE* fw; char *a = "123"; ◦ fw = fopen(argv[1], "w"); ◦ /* 這樣不會 output 出 */ ◦ for(i=0; i<3; i++) fprintf(fw, "%c", a[i]); ◦ /* 這樣就會 output 出 */ ◦ for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); ◦ fclose(fw); 123\0
補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目 錄. ftp://mpc.cs.nctu.edu.tw 第一次 demo 成績 :
Input: a text file and a stop words list ◦ Using argc and argv ◦./a.out stopword textfile Output: pairs of word and the number of their occurrence ◦ To stdout (the screen)
Text file (without stop word) Hello, I ’ m Billy, not bi|ly or 6illy or b. Output ◦ Hello,:1 ◦ I’m:1 ◦ Billy,: ◦ not:1 ◦ bi|ly: 1 ◦ or: 2 ◦ 6illy: 1 ◦ b.: 1
Text file (same) Stop word list ◦ and ◦ not ◦ or Output ◦ Hello,:1 ◦ I’m:1 ◦ Billy,: ◦ bi|ly: 1 ◦ 6illy: 1 ◦ b.: 1
Text file ◦ a b c d e f g h i j a b c d e Stop words list ◦ a b c d Output ◦ e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1
Input ◦ Text file Every words are spited by ‘ ‘, ’ \t ’, or ‘ \n ’. Case sensitive. Do and do are different words There ’ s at most 2000 chars in one line. There will be no Chinese input. Not only one line in a text file. There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. Program executive time are limited.
Input ◦ Stop words list One word one line No space, ’ \t ’ in one line No more than 2000 chars one line Correct ◦ Haha ◦ Hehe ◦ kerker Incorrect ◦ 囧 oo ◦ A b
Word occurrence ◦ String+ ’ ‘ +number+ ’’ \n ’ A 3 B 5 String orders won’t matter. B 5 A 3
You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case) One array for your string, another for the occurrence Your data structure must be fast in insertion and selection (search).
We ’ ll use program to judge your homework ◦ Please take care about the I/O format You can not read the whole file in one time ◦ You have to read at most one line in one time We ’ ll release some test data. Due: 11/21 Your bonus will depend on the efficiency of your program
Large case ◦ A lot of different words (more than ) ◦ A lot of words in a text file ◦ 30% ◦ One of them will be released 10% per test case We will release 2 normal test case and 1 large test case for testing.
Some simple algorithm Assume STOPWORD has N word, TEXTFILE has M word. We build SW_LIST to store stop words, TXT_LIST to store text file words.
Read in STOPWORD, store it as SW_LIST foreach ( word read from TEXTFILE ) { if ( the word is in SW_LIST ) then continue to read another word. else ( the word is not in SW_LIST ) then if ( the word is in TXT_LIST ) then add count of the word 1 else ( the word is not in TXT_LIST ) then insert word into TXT_LIST } O(N) O(M) O(N)
這個作業寫的比較快的會有 Bonus. 到時候會把大家的程式拿到某台神秘的工作站上面 跑, 看誰快誰慢. 如果對於加分部份的公平性有疑問請在 11/6( 四 ) 上課前提出.
先到 ftp://mpc.cs.nctu.edu.tw 建立自己學號的 資料夾.ftp://mpc.cs.nctu.edu.tw 上傳可 compile, run 的 C/C++ source code 檔 案到 ftp://mpc.cs.nctu.edu.twftp://mpc.cs.nctu.edu.tw
Any questions ?