Download presentation
Presentation is loading. Please wait.
Published byMyrtle Lesley Campbell Modified over 9 years ago
1
Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional position of this processing phase: –Server-side, event-oriented, background task –subsequent and/or parallel to speech recognition (voice messages) or image processing (faxes); previous to text summarizing
2
Useful applications (1) Name/Date/Subject identification (this task specifically useful for fax and voice messages: no standardized fields for storing this information) –“You have 1 fax message from Mrs Diaconu about ‘attending the Barcelona meeting’…” Backup information: user’s addressbook (PABX info yields sender’s phone number)
3
Useful applications (2) Message filtering: –“You have received 14 personal E-mail messages, among which 3 messages from friends, 6 requests from students or colleagues, and 5 spam messages; you have received 26 mailing list messages, among which 3 call for papers, 11 conference announcements, and 12 other.” Backup information: RFC-822 “From” and “Subject” fields.
4
Techniques (1) Text statistics measures: –Frequency of occurrence of certain words/morphological categories/syntactical structures in different types of messages E.g. ratio noun/verb frequency higher in technical texts; style markers specific to some text genres (e.g. frequent use of ‘!’ or ‘$’ in advertisements; ‘loose style’ abbreviations like ‘CU’, ‘IMHO’ in English, or ‘A+’ in French)
5
Techniques (2) Text skimming: –Spotting “good candidates” for specific word types (e.g. proper names): selecting capitalized words… –… comparing with entries in common first names / family names database, and/or… –… using local grammars to disambiguate other cases.
6
Techniques (3) Merging visual clues and textual clues for mutual reinforcement of identification probability. E.g. Probability of an unidentified, capitalized character string to be the proper name of a fax’s sender increases if it stands alone on a line at the top of the image.
7
Content Extraction: Current Developments Toolbox for text statistics (word frequency, contextual windows, co-occurrence frequency…) Tool for determining fuzzy membership to a given class of words Tool for determining document language and segmenting multilingual documents
8
Content Extraction: Future Developments Text categorization module for message sorting and filtering Text genre database with (user-controlled) learning capabilities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.