Scanned Documents LBSC 796/INFM 718R Douglas W. Oard Week 8, October 29, 2007
Expanding the Search Space Scanned Docs Identity: Harriet “… Later, I learned that John had not heard …”
High Payoff Investments Searchable Fraction Transducer Capabilities OCR MT Handwriting Speech
The Big Picture Find the words Index the words Do ranked retrieval Use that system to find what you want
Some Issues Language-based search without language! –Shape codes Accuracy-selection effect of ranked retrieval –Poor recognition scatters in the query-term space Blind relevance feedback – Based on clean text Image-domain summaries
Some Applications Case management for litigation Duplicate detection for declassification productivity and anti-tiling Knowledge management from everything I have ever xeroxed or faxed
Some Applications Legacy Tobacco Documents Library – Google Books – George Washington’s Papers –