Final Project Presentation Information Extraction Learning to Extract Signature and Reply Lines from Vitor R. Carvalho
Sig Lines Reply lines Idea:
Directions Motivation: Text-to-Speech, automatic personal address management, anonymization of corpora, preprocessing for classification experiments Related work Sproat, Chen & Hu; “Emu: An preprocessor for text-to-speech”, “geometrical and linguistic analysis for signature” Pinto et al., McCallum et al., Classification of lines on FAQ pages and Tables in text documents using machine learning algorithms. 2 tasks: sig detection and line extraction Compare state-of-the-art algorithms Supervised learning
Data 20 Newsgroups dataset Searched for pairs of messages from the same sender, whose last K lines were identical. K ≤ 1 Unlikely to have a sig Manually checked: 586 Messages without Sigs K ≥ 6 Likely to have a sig Manually Checked + Sig and Reply-to Lines Annotated 617 Messages Total: lines (3321 sig lines, 5587 reply-to lines)
Sig Detection Features
Sig Detection Results Sproat et al. (1999): “SIG fields are rarely longer than ten lines”.
Sig Extraction Features
Sig Extraction Results
Reply Extraction Results
Sig & Reply Extraction Results
Last Lines Efficient method to extract sig and reply-to lines in messages – sequence of line representation Comparison of state-of-the-art learning algorithms References: R. Sproat, J. Hu, and H. Chen. Emu: An preprocessor for text-to-speech. In 1998 Workshop on Multimedia Signal Processing, pages , Redondo Beach, CA, December R. Sproat, J. Hu, and H. Chen. Emu: An preprocessor for text-to-speech. In 1998 Workshop on Multimedia Signal Processing, pages , Redondo Beach, CA, December H. Chen, J. Hu, and R. Sproat. Integrating geometrical and linguistic analysis for signature block parsing. ACM Transactions on Information Systems, 17(4): , October H. Chen, J. Hu, and R. Sproat. Integrating geometrical and linguistic analysis for signature block parsing. ACM Transactions on Information Systems, 17(4): , October A. McCallum, D. Freitag and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of the ICML-2000, 2000 A. McCallum, D. Freitag and F. Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation. Proceedings of the ICML-2000, 2000 D. Pinto, A. McCallum, X. Wei and W. B. Croft. Table Extraction Using Conditional Random Fields, SIGIR, ACM, Toronto, Canada, 2003 D. Pinto, A. McCallum, X. Wei and W. B. Croft. Table Extraction Using Conditional Random Fields, SIGIR, ACM, Toronto, Canada, 2003