Download presentation
Presentation is loading. Please wait.
Published byLeo Bryan Modified over 9 years ago
1
AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation
2
PRESENTATION OUTLINE Background and Motivation Goals Design Challenges Timeline and Milestones Current Progress 1 12/4/2015 AfterCollege Scrape Utility
3
AFTERCOLLEGE BACKGROUND Customized career network for colleges and professional organizations across the country Goal: create a better way for job seeking students and alumni to connect with the right employer 2 12/4/2015 AfterCollege Scrape Utility
4
3 3 12/4/2015 AfterCollege Scrape Utility
5
WHAT’S ALREADY THERE? 12/4/2015 AfterCollege staff manually creates configuration files A simple crawler running periodically Output of Crawler is posted on AfterCollege’s website 4 AfterCollege Scrape Utility
6
LIMITATIONS Scalability Unable to handle POST requests Unable to handle dynamic websites Expensive to maintain Requires technical knowledge 5 12/4/2015 AfterCollege Scrape Utility
7
DESIGN OVERVIEW 12/4/2015 6 A new GUI Tool assists staffs through configuration process Web Proxy captures user activities Crawler uses pattern matching based on new configuration file AfterCollege Scrape Utility
8
GOALS: GUI TOOL Guides users through configuration process Deal with dynamic websites 7 12/4/2015 AfterCollege Scrape Utility
9
GOALS: WEB PROXY Capture user activities Generate configuration files 8 12/4/2015 AfterCollege Scrape Utility
10
GOALS: CRAWLER Scrape job posts Check result integrity 9 12/4/2015 Crawl Job List page Get Configuration file Pattern-Match Application Generate Job List Result AfterCollege Scrape Utility
11
DESIGN ISSUES FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities Dojo vs. YUI - Fade-In/Out, Drag & Drop - Deals with different browsers - XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI 10 12/4/2015 AfterCollege Scrape Utility
12
DESIGN OVERVIEW Browser 12/4/2015 11 Rendered HTML page Injected YUI Javascript Web Proxy Apache HTTP Client Tomcat Web/App Server HTML Parser Job List Sites Crawler Loader/ Scheduler Parser HTTP Client Config.xml JobFeed.xml Feed Generator AfterCollege Scrape Utility
13
CHALLENGES DOM objects analysis at runtime for those websites using AJAX to dynamically generate DOM objects at client side Deal with tricky Javascript Embedded HTML pages 12 12/4/2015 AfterCollege Scrape Utility
14
MILESTONES GUI Tool (March 20) Work flow support Capture job information Web Proxy (March 20) Render html pages Capture HTTP communications Web Crawler (April 13) Pattern Matching ability given configuration file Integrity check Integration Test (April 20) Testing (April 27) 13 12/4/2015 AfterCollege Scrape Utility
15
CURRENT FOCUS Web Proxy Ability to deal with Javascript Session/Cookie support GUI Tool Embedded web pages Allow user modifications 14 12/4/2015 AfterCollege Scrape Utility
16
CURRENT PROGRESS Demo 15 12/4/2015 AfterCollege Scrape Utility
17
RESOURCES Course Instructor Dr. Jeff Buckwalter Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System Dasidae SVN from Perry Wiki Site Knowledge share, work log, resource portal Google group Discussion and information exchange medium 16 12/4/2015 AfterCollege Scrape Utility
18
` Questions? 17 12/4/2015 AfterCollege Scrape Utility
19
Thank You 18 12/4/2015 AfterCollege Scrape Utility
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.