Download presentation
Presentation is loading. Please wait.
1
Document Data Mining Design Review November 18, 2010 1 Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.
2
The Problem State Board collects meeting minutes and other documents recording decisions made Board members want to retrieve text from old documents that relate to current issues – May not recall when issue was discussed – May not know exact keywords to search for 2
3
The Existing Solution Currently, all files exist on a large, unorganized shared network drive. Finding information recorded in documents requires knowing when it was recorded, and in which document. 3
4
Requirements / Design Decisions 4
5
Multiple File types System limited to more major file types – Word documents (.doc,.docx) – PDF files (.pdf) – Excel (.xls,.xlsx) – Text (.txt) Lacking – WordPerfect (.wpd) – PDF files that were scanned in – Open Office document types 5
6
Multi-User Access Web Based Pros: – Information searchable anywhere – Only one index required – Index on regular basis without interrupt Cons: – File permissions Individual User Application Pros: – Can be programmed to learn user behavior – Apply more emphasis to files he/she used before (Looks at search history to aid in new searches) Cons: – Software package installed on each users machine 6
7
Search Collection of Documents Efficiently Real Time Searching – Pros: Easy No initial overhead – Cons : Time consuming (> 100,000 words) Unable to find non- exact search results Reverse Indexing – Pros: Fast and efficient Able to find useful information without exact search text known – Cons: Large initial overhead (pre-analyze all documents) Keep index file up to date Storage space necessary Results displayed in less than a second 7
8
8
9
Find Useful Information Without Exact String Specification (A: Stemming) Create our own – Pros: Pay attention to details that may be lacking in existing algorithms (aglet vs. readable) More efficient Define special cases – Cons: Requires a lot of time Use existing algorithm – Pros: Readily available Spend more time on other important details – Cons: Special cases incorrect Some root words are truncated 9
10
Porter Stemming Algorithm Large set of steps based on English Natural Language to determine root of word Extensively used in programs Outdated: Results not always correct 10
11
Find Useful Information Without Exact String Specification (B: Thesaurus) Own Model – Pros : Fine tune thesaurus to have only relevant terms (terms that exist inside our index file) – Cons: Very time consuming and complex Using pre-built Thesaurus – Pros: Quick and easy to use Very extensive – Cons: Has irrelevant search term results Unnecessary terms for State Board 11
12
Searching User types in a search criteria – Determine whether they want Narrow Search results or Broad Search Results May retrieve too many results in Broad Search Search algorithm converts each typed word into a list of possible stems and synonyms Tries all possible permutations of words, trying to find the closest match to the search Calculate standard deviation of the distance between all of the words 12
13
Searching (cont.) Each file is ranked based on the number of matches it contains – Exact matches rank highest – Reordering of exact match is ranked next – Stems, synonyms, partial matches, and large spacing between searched words rank lowest All rank values found inside a file are summed Highest ranked files considered most relevant 13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
Unit Testing 21
22
Unit Testing Benefits Goal Facilitates change Limitations Not omnipotent Low cost performance 22
23
DocumentTest: /// Returns the document location public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); } Unit Testing 23
24
/// creates word count in alphabetical order for all words located inside PDF public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); } Unit Testing 24
25
End of Semester Status Goals: – Working, tested prototype – Documentation for future teams Plenty of areas open for extension or improvement 25
26
Future Possibilities: File Types Currently supported file types – Microsoft Word – Microsoft Excel – PDF No optical character recognition Our system will allow for easy extension 26
27
27
28
Future Possibilities: Indexing We have a relatively simple indexing scheme More complex indexing would lead to decreased search time Our indexing scheme is very general – Could be specific to the State Board – Could lead to more relevant results 28
29
Future Possibilities: Searching Search time increases quickly as search terms are added Thesaurus is broad – Large number of synonyms can slow search – Could be trimmed to fit domain Porter stemming algorithm could be replaced 29
30
Future Possibilities: Correlation Related documents should be correlated – By date? – Using a tagging system? 30
31
Future Possibilities: Decision Database A client need that is not addressed by our software Many board decisions have been passed, with varying lifetimes A database could track all board decisions and lifespan Possible connection to our search engine? 31
32
Future Possibilities: Web-Based Interface Software will be installed on each user’s computer GUI could be web based, with access restricted to State Board employees Users could search from home or while on the road, not just in the office Indexing would be simplified 32
33
Questions? 33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.