Practical Project of the 2006 Joint International Master’s Degree
Agenda Introduction Technologies in use Architecture Demonstration Remaining Issues Work packages for Semester II Questions & Comments
Introduction Practical project during the course of studies Timeframe: two terms Topic: Prototype of a semantic search engine using UIMA Objectives of the first semester Study the UIMA-Framework and OpenNLP library Search for players, teams, matches and dates Semantic search for goal events Implement an executable prototype
Technologies in Use UIMA-Framework OpenNLP Java / Java Server Pages Tomcat-Server Python (Webcrawler)
Architecture Overview
Architecture Webcrawler Usage of web crawler for preselection of Texts Implemented in Python Crawls ca pages in 20 minutes Presently based on keywords Transfer of results to Jimgle still manual
Architecture NLP-Annotator Usage of the OpenNLP-Tools & API Rule based approach Tagging of paragraphs, sentences and words Part-of-Speech-Tagging Implementation in UIMA as separate annotator Results are used by consecutive annotators Internal usage only, not displayed in the search index
Architecture Identification of players of the WM2006 Rule based implementation Usage of the OpenNLP word-annotations Matching against the player database (XML- File) Consideration of last names and nicknames Player-Annotator
Architecture Date & Time-Annotator Identification of time and date information Usage of the OpenNLP word-annotations Presently custom, rule based implementation Detecs standard conform time & date information Detection of relative or colloquial time information not implemented yet
Architecture Match-Annotator Identification of matches Based on 3 components Detection of locality Detection of participating teams Detection of the match result Usage of upstream annotators OpenNLP word-annotations Player annotations Date- & time-annotations
Architecture Goal-Event Annotator Description of goals are too complex for a rule- based detection Therefore: Machine based learning Usage of the OpenNLP library Based on statistical information of sentences Comprehensive training necessary Implementation as OpenNLP component Integration into UIMA by wrapper-classes
Architecture Persistent Indexing Functionality Import of all files in a specific directory Annotation of all available texts Compilation of XML-Files with CAS-data of every source text Adjacent creation of a search index Provision of index files for the web-server
Architecture Graphical User Interface Linux server with tomcat installation Simple operation via web-based GUI Search queries are handled by Java server pages Processing of requests by Java beans
Demonstration Search engine
Open Issues Further proceeding…? Search for attributes e.g. Player AND Germany (presently only via OmniFind) Automate processing of search engine results Further training of the components Usage improvements at front- and backend
New scenarios… …for the second semester Automated analysis of s Search for phone numbers Search for customer contacts of employee Find employees with specific skills Find links & relations between employees Competitive analysis Compare own products with ones from competitors Find out about customer opinions in internet portals Further ideas??
Ideas… …for the second semester Natural language based search queries Design templates for customizable annotators Machine based learning for the Web-Crawler Mark annotations in the search results Automated processing of search results Implement more anotators via OpenNLP Provide annotators as web-services Further ideas??
JIMGLE JIM Master-Project Questions? Suggestions?
JIMGLE JIM Master-Project Thanks for your attention…