AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, Cowell 416 Midterm Presentation
PRESENTATION OUTLINE Background and Motivation Goals Design Challenges Timeline and Milestones Current Progress 1 12/4/2015 AfterCollege Scrape Utility
AFTERCOLLEGE BACKGROUND Customized career network for colleges and professional organizations across the country Goal: create a better way for job seeking students and alumni to connect with the right employer 2 12/4/2015 AfterCollege Scrape Utility
3 3 12/4/2015 AfterCollege Scrape Utility
WHAT’S ALREADY THERE? 12/4/2015 AfterCollege staff manually creates configuration files A simple crawler running periodically Output of Crawler is posted on AfterCollege’s website 4 AfterCollege Scrape Utility
LIMITATIONS Scalability Unable to handle POST requests Unable to handle dynamic websites Expensive to maintain Requires technical knowledge 5 12/4/2015 AfterCollege Scrape Utility
DESIGN OVERVIEW 12/4/ A new GUI Tool assists staffs through configuration process Web Proxy captures user activities Crawler uses pattern matching based on new configuration file AfterCollege Scrape Utility
GOALS: GUI TOOL Guides users through configuration process Deal with dynamic websites 7 12/4/2015 AfterCollege Scrape Utility
GOALS: WEB PROXY Capture user activities Generate configuration files 8 12/4/2015 AfterCollege Scrape Utility
GOALS: CRAWLER Scrape job posts Check result integrity 9 12/4/2015 Crawl Job List page Get Configuration file Pattern-Match Application Generate Job List Result AfterCollege Scrape Utility
DESIGN ISSUES FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities Dojo vs. YUI - Fade-In/Out, Drag & Drop - Deals with different browsers - XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI 10 12/4/2015 AfterCollege Scrape Utility
DESIGN OVERVIEW Browser 12/4/ Rendered HTML page Injected YUI Javascript Web Proxy Apache HTTP Client Tomcat Web/App Server HTML Parser Job List Sites Crawler Loader/ Scheduler Parser HTTP Client Config.xml JobFeed.xml Feed Generator AfterCollege Scrape Utility
CHALLENGES DOM objects analysis at runtime for those websites using AJAX to dynamically generate DOM objects at client side Deal with tricky Javascript Embedded HTML pages 12 12/4/2015 AfterCollege Scrape Utility
MILESTONES GUI Tool (March 20) Work flow support Capture job information Web Proxy (March 20) Render html pages Capture HTTP communications Web Crawler (April 13) Pattern Matching ability given configuration file Integrity check Integration Test (April 20) Testing (April 27) 13 12/4/2015 AfterCollege Scrape Utility
CURRENT FOCUS Web Proxy Ability to deal with Javascript Session/Cookie support GUI Tool Embedded web pages Allow user modifications 14 12/4/2015 AfterCollege Scrape Utility
CURRENT PROGRESS Demo 15 12/4/2015 AfterCollege Scrape Utility
RESOURCES Course Instructor Dr. Jeff Buckwalter Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System Dasidae SVN from Perry Wiki Site Knowledge share, work log, resource portal Google group Discussion and information exchange medium 16 12/4/2015 AfterCollege Scrape Utility
` Questions? 17 12/4/2015 AfterCollege Scrape Utility
Thank You 18 12/4/2015 AfterCollege Scrape Utility