Presentation is loading. Please wait.

Presentation is loading. Please wait.

By : asef poormasoomi autumn 2009 1. Introduction summary: brief but accurate representation of the contents of a document 2.

Similar presentations


Presentation on theme: "By : asef poormasoomi autumn 2009 1. Introduction summary: brief but accurate representation of the contents of a document 2."— Presentation transcript:

1 By : asef poormasoomi autumn 2009 1

2 Introduction summary: brief but accurate representation of the contents of a document 2

3 Motivation Abstracts for Scientific and other articles News summarization (mostly Multiple document summarization)‏ Classification of articles and other written data Web pages for search engines Web access from PDAs, Cell phones Question answering and data gathering 3

4 Extract vs. abstract lists fragments of text vs. re-phrases content coherently. example : He ate banana, orange and apple=> He ate fruit Generic vs. query-oriented provides author’s view vs. reflects user’s interest. example : question answering system Personal vs. general consider reader’s prior knowledge vs. general. Single-document vs. multi-document source based on one text vs. fuses together many texts. Indicative vs. informative used for quick categorization vs. content processing. Genres 4

5 Summarization In 3 steps ( Lin and Hovy -1997 ) Content/Topic Identification goal : find/extract the most important material. techniques : methods based on position, cue phrases, concept counting, word frequency. Conceptual/Topic Interpretation application : just for abstract summaries methods : merging or fusing related topics into more general ones, removing redundancies, etc. example: He sat down, read the menu, ordered, ate and left => He visited the restaurant. Summary Generation: say it in your own words Simple if extraction if preformed 5

6 Methods Statistical scoring methods (Pseudo) Higher semantic/syntactic structures Network (graph) based methods Other methods (rhetorical analysis, lexical chains, co- reference chains) AI methods 6

7 Statistical scoring (Pseudo) General method: 1. score each entity (sentence, word) ; 2. combine scores; 3. choose best sentence(s) Scoring tecahniques: Word frequencies throughout the text (Luhn 58) Position in the text (Edmunson 69, Lin&Hovy 97) Title method (Edmunson 69) Cue phrases in sentences (Edmunson 69) Bayesian Classifier (Kupiec at el 95) 7

8 Word frequencies (Luhn 58) Very first work in automated summarization Claim: words which are frequent in a document indicate the topic discussed Frequent words indicate the topic Clusters of frequent words indicate summarizing sentence Stemming should be used “stop words” (i.e.”the”, “a”, “for”, “is”) are ignord 8

9 Word frequencies (Luhn 58) Calculate term frequency in document: f(term) Calculate inverse log-frequency in corpus : if(term) Words with high f(term)if(term) are indicative Sentence with highest sum of weights is chosen 9

10 Claim : Important sentences occur in specific positions Position depends on type(genre) of text inverse of position in document works well for the “news” Important information occurs in specific sections of the document (introduction/conclusion) Assign score to sentences according to location in paragraph Assign score to paragraphs and sentences according to location in entire text Position in the text (Edmunson 69, Lin&Hovy 97) 10

11 Claim : title of document indicates its content (Duh!) words in title help find relevant content create a list of title words, remove “stop words” Use those as keywords in order to find important sentences Title method (Edmunson 69) 11

12 Cue phrases method (Edmunson 69) Claim : Important sentences contain cue words/indicative phrases “The main aim of the present paper is to describe…” (IND) “The purpose of this article is to review…” (IND) “In this report, we outline…” (IND) “Our investigation has shown that…” (INF) Some words are considered bonus others stigma bonus: comparatives, superlatives, conclusive expressions, etc. stigma: negatives, pronouns, etc. Implemented for French (Lehman ‘97) Paice implemented a dictionary of Grammar for indicative expressions In + skip(0) + this + skip(2) + paper + skip(0) + we +... Cue words can be learned (Teufel’98) 12

13 Feature combination (Edmundson ’69) Linear contribution of 4 features title, cue, keyword, position the weights are adjusted using training data with any minimization technique The following results were obtained best system cue + title + position 13

14 Uses Bayesian classifier: Assuming statistical independence: Bayesian Classifier (Kupiec at el 95) Higher probability sentences are chosed to be in the summary Performance: For 25% summaries, 84% precision 14

15 Methods Statistical scoring methods problems : Synonymy: one concept can be expressed by different words. example cycle and bicycle refer to same kind of vehicle. Polysemy: one word or concept can have several meanings. example, cycle could mean life cycle or bicycle. Phrases: a phrase may have a meaning different from the words in it. An alleged murderer is not a murderer (Lin and Hovy 1997) Higher semantic/syntactic structures Network (graph) based methods Other methods (rhetorical analysis, lexical chains, co-reference chains) AI methods 15

16 Higher semantic/syntactic structures Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures. Classes of approaches lexical similarity (WordNet, lexical chains); word co-occurrences; co-reference; combinations of the above. 16

17 lexical cohesion : ( Hasan, Halliday ) reiteration synonym antonym hyperonym collocation co occurance example : او به عنوان معلم در مدرسه کار می کند Lexical chain : Sequence of words which have lexical cohesion(Reiteration/Collocation) Lexical chain 17

18 Method for creating chain: Select a set of candidate words from the text. For each of the candidate words, find an appropriate chain, relying on a relatedness criterion among members of the chains and the candidate words. If such a chain is found, insert the word in this chain and update it accordingly; else create a new chain. Scoring the chains : synonym =10, antonym=7, hyponym=4 Strong chain must select Sentence selection for summary H1: select the first sentence that contains a member of a strong chain example : Chain: AI=2 ; Artificial Intelligence =1 ; Field=7 ; Technology=1 ; Science=1 H2: select the first sentence that contains a “representative” (frequency) member of the chain H3: identify a text segment where the chain is highly dense (density is the proportion of words in the segment that belong to the chain) Lexical chain 18

19 Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achineve much closer monitoring of the pump feeding the anesthetic into the patient. Lexical chain 19

20 Network based method (Salton&al’97) Vector Space Model each text unit represented as vector Standard similarity metric Construct a graph of paragraphs or other entities. Strength of link is the similarity metric Use threshold to decide upon similar paragraphs or entities (pruning of the graph) paragraph selection heuristics bushy path select paragraphs with many connections with other paragraphs and present them in text order depth-first path select one paragraph with many connections; select a connected paragraph (in text order) which is also well connected; continue 20

21 Text relation map C A B D E F C=2 A=3 B=1 D=1 E=3 F=2 sim>thr sim<thr similarities links based on thr 21

22 22

23 Motivation summaries which are generic in nature do not cater to the user’s background and interests results show that each person has different perspective on the same text  Marcu-1997: found percent agreement of 13 judges over 5 texts from scientific America is 71 percent.  Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap  Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap 23

24 Data Click:  when a user clicks on a document, the document is considered to be of more interest to the user than other unclicked ones Query History:  is the most widely used implicit user feedback at present. example : http://www.google.com/psearch Attention Time :  often referred to as display time or reading time Other types of implicit user feedbacks :  Other types of implicit user feedbacks include, scrolling, annotation, bookmarking and printing behaviors Users Feedback 24

25 Summarization Using Data click use extra knowledge of the clickthrough data to improve Web-page summarization collection of clickthrough data, can be represented by a set of triples Typically, a user's query words, reflect the true meaning of the target Web-page content Problems : incomplete click problem noisy data click 25

26 Attention Time MAIN IDEA The key idea is to rely on the attention (reading) time of individual users spent on single words in a document. The prediction of user attention over every word in a document is based on the user’s attention during his previous reads algorithm tracks a user’s attention times over individual words using a vision- based commodity eye-tracking mechanism. use simple web camera and an existent eye-tracking algorithm “Opengazer project” The error of the detected gaze location on the screen is between 1–2 cm, depending which area of the screen the user is looking at (a 19” screen monitor). 26

27 Attention Time  Anchoring Gaze Samples onto Individual Words  the detected gaze central point is positioned at (x; y) on the screen space  compute the central displaying point of the word which is denoted as (xi; yi).  For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner.  The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process  During processing, remove the stop words. 27

28 attention time prediction for a word is based on the semantic similarity of two words. for an arbitrary word w which is not among, calculate the similarity between w and every wi(i = 1,…, n) select k words which share the highest semantic similarity with w. Predicting User Attention for Sentences Attention Time 28

29 Attention Time 29

30 Attention Time 30

31 Other types of implicit user feedbacks extract the personal information of the user using information available on the web put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”) ’n’ top documents are taken and retrieved. After performing the removal of stop words and stemming, a unigram language model is learned on the extracted text content. User Specific Sentence Scoring : sentence score : 31

32 Example  Topic of summary generation is ”Microsoft to open research lab in India”  8 articles published in different new sources forms the news cluster  User A is from NLP domain and User B from network security domain. Generic summary: The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. In Line With Microsoft’s Research Strategy Worldwide, The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said. Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company Other types of implicit user feedbacks 32

33 User A Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition. In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab Other types of implicit user feedbacks 33

34 User B Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty Other types of implicit user feedbacks 34

35 FarsiSum A Persian text summarizer By : Nima Mazdak, Martin Hassel Department of Linguistics Stockholm University 2004 35

36 FarsiSum Tokenizer : Sentence boundaries are found by searching for periods, exclamations, question marks and (the HTML new line) and the Persian question mark (؟), “.”, “,”, “!”, “?”, “ ”, “:”, spaces, tabs and new lines Sentence Scoring: Text lines are put into a data structure16 for storing key/value called text table value 36

37 FarsiSum Sentence Scoring: Word score = (word frequency) * (a keyword constant) Sentence Score = Σ word score (for all words in the current sentence) average sentence length (ASL) Average sentence length (ASL) = Word-count / Line-count Sentence score = (ASL * Sentence Score)/ (nr of words in the current sentence) 37

38 Notes on the Current Implementation :  Word Boundary Ambiguity : stop (.) marks a sentence boundary, but it may also appear in the formation of abbreviations or acronyms. Compound words and light verb constructions may also appear with or without a space.  Ambiguity in morphology  Word Order : The canonical word order in Persian is SOV, but Persian is a free word order language  Possessive Construction FarsiSum 38

39 thanks 39


Download ppt "By : asef poormasoomi autumn 2009 1. Introduction summary: brief but accurate representation of the contents of a document 2."

Similar presentations


Ads by Google