Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional.

Similar presentations


Presentation on theme: "Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional."— Presentation transcript:

1 Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional position of this processing phase: –Server-side, event-oriented, background task –subsequent and/or parallel to speech recognition (voice messages) or image processing (faxes); previous to text summarizing

2 Useful applications (1) Name/Date/Subject identification (this task specifically useful for fax and voice messages: no standardized fields for storing this information) –“You have 1 fax message from Mrs Diaconu about ‘attending the Barcelona meeting’…” Backup information: user’s addressbook (PABX info yields sender’s phone number)

3 Useful applications (2) Message filtering: –“You have received 14 personal E-mail messages, among which 3 messages from friends, 6 requests from students or colleagues, and 5 spam messages; you have received 26 mailing list messages, among which 3 call for papers, 11 conference announcements, and 12 other.” Backup information: RFC-822 “From” and “Subject” fields.

4 Techniques (1) Text statistics measures: –Frequency of occurrence of certain words/morphological categories/syntactical structures in different types of messages E.g. ratio noun/verb frequency higher in technical texts; style markers specific to some text genres (e.g. frequent use of ‘!’ or ‘$’ in advertisements; ‘loose style’ abbreviations like ‘CU’, ‘IMHO’ in English, or ‘A+’ in French)

5 Techniques (2) Text skimming: –Spotting “good candidates” for specific word types (e.g. proper names): selecting capitalized words… –… comparing with entries in common first names / family names database, and/or… –… using local grammars to disambiguate other cases.

6 Techniques (3) Merging visual clues and textual clues for mutual reinforcement of identification probability. E.g. Probability of an unidentified, capitalized character string to be the proper name of a fax’s sender increases if it stands alone on a line at the top of the image.

7 Content Extraction: Current Developments Toolbox for text statistics (word frequency, contextual windows, co-occurrence frequency…) Tool for determining fuzzy membership to a given class of words Tool for determining document language and segmenting multilingual documents

8 Content Extraction: Future Developments Text categorization module for message sorting and filtering Text genre database with (user-controlled) learning capabilities


Download ppt "Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional."

Similar presentations


Ads by Google