Presentation is loading. Please wait.

Presentation is loading. Please wait.

Calculating Word Frequency in a Document.   11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Similar presentations


Presentation on theme: "Calculating Word Frequency in a Document.   11/6( 四 ) 這個星期四小考, 5. Threaded Binary."— Presentation transcript:

1 Calculating Word Frequency in a Document

2  http://mpc.cs.nctu.edu.tw/forum/ http://mpc.cs.nctu.edu.tw/forum/  11/6( 四 ) 這個星期四小考, 5. Threaded Binary Tree 不考  11/15( 六 ) 10:10~12:00 期中考!

3  有關多一行的問題..  >> version ◦ ifstream input(argv[1]); ◦ while (!input.eof() && input.peek() > 0) { ◦ input >> buf; ◦ cout << buf ; ◦ input >> buf; ◦ input.get(); /* 拿走 ‘\n’ 這個 character */ ◦ cout << " " << buf << endl; ◦ }

4  Getline version ◦ ifstream input(argv[1]); ◦ while (!input.eof()) { ◦ input.getline(buf, 500); ◦ if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */ ◦ cout << buf << endl; ◦ }  Another one ◦ ifstream input(argv[1]); ◦ while (input.getline(buf, 500)) { ◦ cout << buf << endl; ◦ }

5  有關於出現 ^@ 的問題 ◦ 看到 demo 時候出現 ^@ 就是你把 ‘\0’ ( 就是 0) output 到檔 案中了.. ◦ 以後多出這種 demo 程式就不會過, 就以錯誤計算  How to fix ? ◦ 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔 案中. ◦ int i; FILE* fw; char *a = "123"; ◦ fw = fopen(argv[1], "w"); ◦ /* 這樣不會 output 出 ^@ */ ◦ for(i=0; i<3; i++) fprintf(fw, "%c", a[i]); ◦ /* 這樣就會 output 出 ^@ */ ◦ for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); ◦ fclose(fw); 123\0

6  補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目 錄. ftp://mpc.cs.nctu.edu.tw  第一次 demo 成績 : http://www.cs.nctu.edu.tw/~hhyou/ds.php http://www.cs.nctu.edu.tw/~hhyou/ds.php

7  Input: a text file and a stop words list ◦ Using argc and argv ◦./a.out stopword textfile  Output: pairs of word and the number of their occurrence ◦ To stdout (the screen)

8  Text file (without stop word) Hello, I ’ m Billy, not bi|ly or 6illy or b.  Output ◦ Hello,:1 ◦ I’m:1 ◦ Billy,: ◦ not:1 ◦ bi|ly: 1 ◦ or: 2 ◦ 6illy: 1 ◦ b.: 1

9  Text file (same)  Stop word list ◦ and ◦ not ◦ or  Output ◦ Hello,:1 ◦ I’m:1 ◦ Billy,: ◦ bi|ly: 1 ◦ 6illy: 1 ◦ b.: 1

10  Text file ◦ a b c d e f g h i j a b c d e  Stop words list ◦ a b c d  Output ◦ e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1

11  Input ◦ Text file  Every words are spited by ‘ ‘, ’ \t ’, or ‘ \n ’.  Case sensitive.  Do and do are different words  There ’ s at most 2000 chars in one line.  There will be no Chinese input.  Not only one line in a text file.  There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’.  Program executive time are limited.

12  Input ◦ Stop words list  One word one line  No space, ’ \t ’ in one line  No more than 2000 chars one line  Correct ◦ Haha ◦ Hehe ◦ kerker  Incorrect ◦ 囧 oo ◦ A b

13  Word occurrence ◦ String+ ’ ‘ +number+ ’’ \n ’ A 3 B 5  String orders won’t matter. B 5 A 3

14  You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case)  One array for your string, another for the occurrence  Your data structure must be fast in insertion and selection (search).

15  We ’ ll use program to judge your homework ◦ Please take care about the I/O format  You can not read the whole file in one time ◦ You have to read at most one line in one time  We ’ ll release some test data.  Due: 11/21  Your bonus will depend on the efficiency of your program

16  Large case ◦ A lot of different words (more than 1000000) ◦ A lot of words in a text file ◦ 30% ◦ One of them will be released  10% per test case  We will release 2 normal test case and 1 large test case for testing.

17  Some simple algorithm  Assume STOPWORD has N word, TEXTFILE has M word.  We build SW_LIST to store stop words, TXT_LIST to store text file words.

18  Read in STOPWORD, store it as SW_LIST  foreach ( word read from TEXTFILE )  {  if ( the word is in SW_LIST )  then continue to read another word.  else ( the word is not in SW_LIST )  then  if ( the word is in TXT_LIST )  then add count of the word 1  else ( the word is not in TXT_LIST )  then insert word into TXT_LIST  } O(N) O(M) O(N)

19  這個作業寫的比較快的會有 Bonus.  到時候會把大家的程式拿到某台神秘的工作站上面 跑, 看誰快誰慢.  如果對於加分部份的公平性有疑問請在 11/6( 四 ) 上課前提出.

20  先到 ftp://mpc.cs.nctu.edu.tw 建立自己學號的 資料夾.ftp://mpc.cs.nctu.edu.tw  上傳可 compile, run 的 C/C++ source code 檔 案到 ftp://mpc.cs.nctu.edu.twftp://mpc.cs.nctu.edu.tw

21  Any questions ?


Download ppt "Calculating Word Frequency in a Document.   11/6( 四 ) 這個星期四小考, 5. Threaded Binary."

Similar presentations


Ads by Google