Download presentation
Presentation is loading. Please wait.
Published byPaula Doyle Modified over 9 years ago
1
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting
2
2 Introduction Summaries save readers’ time This is not a new phenomena A system which will summarized a large amount of news from different sources had been developed This paper describe how multi-document summaries are built and evaluated Summarization of text can be done by selecting the most important sentence of the documents To do that one should measure the centroid of the words of the sentences
3
3 Corpus Development Scheme Algorithm: Get the user's input: the starting URL and the desired file type. Add the URL to the currently empty list of URLs to search. While the list of URLs to search is not empty, { 1. Get the first URL in the list. 2. Move the URL to the list of URLs already searched. 3. Check the URL to make sure its protocol is HTTP (if not, break out of the loop, back to "While"). 4. See whether there's a robots.txt file at this site that includes a "Disallow“ statement. (If so, break out of the loop, back to "While".) Try to "open" the URL (that is, retrieve that document From the Web). If it's not an HTML file, break out of the loop, back to "While." 5. Step through the HTML file. While the HTML text contains another link { Validate the link's URL and make sure robots are allowed (just as in the outer loop). If it's an HTML file, If the URL isn't present in either the to-search list or the already-searched list, add it to the to-search list. Else if it's the type of the file the user requested, Add it to the list of files found. }
4
4 Working principle of the system The program developed by us will not accept any keywords to search It will only take address of yahoo news home page as input and start searching that address and links and goes on searching until there is no addresses left to search It first loads the HTML source-code in a string variable and then searches for keyword “class=storyheadline” in which main theme of news is kept
5
5 Design of an application based on Corpus Development An application which will search all yahoo news URL addresses and will download the news, its tiles, its writer and time if occurrence have been designed Kept special TAG: Template of the documents in the corpus: 、 、 、 、 、
6
6 Design of an application based on Corpus Development The corpus application designed by us can download news from the end of previous download which was interrupted by some reason It can download news only from yahoo’s news website The system that would summarize some related documents mainly on the basis of centroid is introduced The information of words of sentences (DF and count) can be stored in database CIDR computes Coount*IDF in an iterative fashion, updating its values as more articles are inserted in a given cluster
7
7 Centroid-based algorithm INPUT: A collection of related documents. OUTPUT: A summary. STEPS TO SUMMARIZE : –a. Finding Cluster Centroid: Count * idf(w)=count(w) * (log(DN ⁄ df(w))) where df(w)=document frequency for each word. DN=number of documents in the corpus. –b. Finding Sentence Position Score: The score of ith sentence (Si) is computed as: Pscore(Si)= max(1 ⁄ i, 1 ⁄ (n-i-1)) where i=sentence number n=number of sentences –c. Finding Sentence Length Score: The length here means the number of characters in the sentence. Lscore(Si) = 0 ( if Li≤ Lmin) =(Li-Lmin) ⁄ Li (otherwise) where Li=length of each sentence Lmin=20,
8
8 Centroid-based algorithm STEPS TO SUMMARIZE : –d. Finding Headline Score: Hscore(Si)= t / N where t=number of words in the sentence that match with the words in the headline N= number of words in the sentence –e. Compute Sentence Score: SCORE(S)=∑ (wc.Ci + wp.Pi + wf.Fi + wl.Li) where i (1≤i≤n) n=number of sentences within the cluster. Ci=Centroid value of the sentence Pi=sentence position score Fi=headline score Li=sentence length score Wc= wI = wf = wl =1 –f. Extract Sentences: d= r * n where r = Compression Rate and n = total number of sentences taken from input documents.
9
9 Conclusion There are many other techniques related to text summarization based on position of sentences or length of sentences of the documents. It will be more reliable if the sentences are parsed in phrase level using Link Grammar parser. The information of the word means ‘subject’, ‘time’, ’space/ location’, ‘action i.e. verb’ etc. Using these information the sentences are clustered on the basis of same ‘subject’ or ‘action’ etc. The clusters are extracted from top order until required summary length is achieved. Experiments are also going on other several features of sentences. It will be very useful for the busy persons who have no time to go through all the news.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.