Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov.

Similar presentations


Presentation on theme: "Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov."— Presentation transcript:

1 Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov

2 HomePage Project Finding personal information in the web is not an easy task. We want to create an automatic tool that will find and present personal information for the requested person.

3 Technical Issues  We need an effective way to find information in the web. We will use Yahoo BOSS.  Personal information on the web is not in a standart format. We focus on working with IEEE pdf documents.

4 Technical Issues  How will we parse the info and identify the differnt details? PDF to Text - using special Java package. Using the standrt structure of the IEEE documents.  How will we avoid confusion between different people with the same name (name ambiguity)? Divide the info to clusters. Let the user make the choise between the clusters*.

5 Technologies Java Will be used to build the Windows desktop application. Yahoo! BOSS Provides free access to Yahoo search index. PDFbox Java library. Used for extracting text from PDF documents

6 BOSS Yahoo! Search BOSS (Build your Own Search Service) is a Yahoo! initiative that gives the developers free access to the Yahoo! Search index. The results can be supplied into the developer's application so that they can manipulate the resources according to their needs. Up to 500 results can be retrieved. Based on Wikipedia

7 HomePage functionality Desktop Java application. Gets from the user the search target. Searches the web using Yahoo! BOSS. Downloads and parses PDF documents and Images and produces HTML page with the information which was found. (Currently it is: email, publication titles, publication short summary, images, and links to the full document)

8 HomePage functionality Devides the information to clusters (based on the key=email) Gets the user choise to decide which info fits. Produces HTML page with all the details.

9 Sceenshots

10 Clustering algorithm It is very hard to the computer to solve name ambiguity. We leave this task to the user. Each group of information items (cluster) will be defined by its key (email) and the user make the choise. The result page will be produced from the chosen clusters

11 Workflow

12 Class Diagram

13 Flow Diagram

14 Challenges PDFbox appeared to be not reliable and problematic. It is not the best solution to PDF parsing. Perhaps the main challange was the semantic parsing (finding information in the text). We discovered that the sematic parsing by itself very problematic task, that requires time and resourses beyond the project scope.

15 Conclusions We learned the principle of the BOSS project, and used the power that it provides We prepared a well-designed object oriented infrastructure for the task. HomePage can be a good infrastructure for adding additional algorithms that find additional information in the texts. In order to extract and identify information from the text, we need to use specific algorithms and methods.

16


Download ppt "Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov."

Similar presentations


Ads by Google