Content Extraction in Majordome Overall Objective: Quick detection of short information elements for Message Filtering and Reporting to User Functional position of this processing phase: –Server-side, event-oriented, background task –subsequent and/or parallel to speech recognition (voice messages) or image processing (faxes); previous to text summarizing
Useful applications (1) Name/Date/Subject identification (this task specifically useful for fax and voice messages: no standardized fields for storing this information) –“You have 1 fax message from Mrs Diaconu about ‘attending the Barcelona meeting’…” Backup information: user’s addressbook (PABX info yields sender’s phone number)
Useful applications (2) Message filtering: –“You have received 14 personal messages, among which 3 messages from friends, 6 requests from students or colleagues, and 5 spam messages; you have received 26 mailing list messages, among which 3 call for papers, 11 conference announcements, and 12 other.” Backup information: RFC-822 “From” and “Subject” fields.
Techniques (1) Text statistics measures: –Frequency of occurrence of certain words/morphological categories/syntactical structures in different types of messages E.g. ratio noun/verb frequency higher in technical texts; style markers specific to some text genres (e.g. frequent use of ‘!’ or ‘$’ in advertisements; ‘loose style’ abbreviations like ‘CU’, ‘IMHO’ in English, or ‘A+’ in French)
Techniques (2) Text skimming: –Spotting “good candidates” for specific word types (e.g. proper names): selecting capitalized words… –… comparing with entries in common first names / family names database, and/or… –… using local grammars to disambiguate other cases.
Techniques (3) Merging visual clues and textual clues for mutual reinforcement of identification probability. E.g. Probability of an unidentified, capitalized character string to be the proper name of a fax’s sender increases if it stands alone on a line at the top of the image.
Content Extraction: Current Developments Toolbox for text statistics (word frequency, contextual windows, co-occurrence frequency…) Tool for determining fuzzy membership to a given class of words Tool for determining document language and segmenting multilingual documents
Content Extraction: Future Developments Text categorization module for message sorting and filtering Text genre database with (user-controlled) learning capabilities