Download presentation
Presentation is loading. Please wait.
Published byOscar Leon O’Brien’ Modified over 9 years ago
1
Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: Simanta Mitra Client: Matt Good, Kingland Systems
2
Problem Statement Large companies have many layers of corporate hierarchy. Financial and data records sometimes conflict between various layers/entities. Accurate and comprehensive company records are needed for auditing and stock conflict resolution. There is a need for “Data Mastering”, to take multiple conflicting sources of data and determine the reality of the matter.
3
Basic Requirements System shall autonomously traverse publicly available websites and collect information System shall store parsed information in a flat file System shall maintain a normalized database System shall expose functionality through web services A single run of system shall complete execution in less than six hours
4
Design Decisions Implementation in C# ASP.NET GUI with jQuery UI widgets Operable in a Windows environment (XP or later) Risks Site data structures or hierarchies can change at any time Reliance on third party PDF text parser, grid control, and AJAX library Inconsistencies in data
5
System Diagram Flat File Database ETL Tool Normalized Kingland Data Analyst UI DAL No Conflicts? External Client UI Web Svcs. WWW Data Scraper Tool HTML Parser PDF Parser Create Read Update Delete
6
Harvester Module The harvester performs the work of gathering data from the external sites After the data is scraped and parsed, the harvester constructs XML files for each data source Finally, the ETL is notified the data is ready Scraper Flat File (XML) World Wide Web Parser PDF Parser HTML Parser
7
Harvester Difficulties Constructing a POST request to retrieve the PDFs required extracting a complex view state Difficult to extract text from PDF Inconsistencies in extracted text City names were occasionally malformed Extra formatting characters were present in extracted text
8
ETL (Extract, Transform, Load) The ETL performs cleanup operations on the data from the harvester If there are malformed tags or invalid characters, they are escaped here Maintains an error log Loads data into database through DAL (Data Access Layer) ETL ToolDAL Flat File (XML)
9
ETL Difficulties Implementing multi-threaded execution for better performance Dealing with malformed input
10
DAL (Data Access Layer) Maintains a normalized MySql database Provides CRUD operations (Create, Read, Update, Delete) No particular difficulties encountered in database creation DAL Database User Interface ETL Tool Add() Find() Update() Delete() DAL Difficulties
11
Web Services Expose the DAL for access from external web apps Accessed by HTTP GET or POST requests Returns JSON objects containing data Returning large JSON objects to the UI Services Read() Progress() Write() Update() Delete() Web Services Difficulties
12
GUI (Graphical User Interface)
13
GUI Difficulties Implementing auto complete functionality for query efficiency Progress bar updates Grid configuration and updating Retrieving large amounts of data from web services
14
Overall Test Plan Test each module individually to ensure independent functionality As modules are completed, test integration pairs to ensure channel adequacy When all modules are integrated, test system end-to-end using web app
15
Harvester / Parser Test Plan Ensure harvester can connect to site for scraping and retrieve the appropriate data Maintain a list of input files that produce specific output after parsing Define corner cases for sub-function robustness evaluation / testing Ensure errors are caught and handled appropriately
16
ETL Test Plan Maintain a list of input files that produce specific output after data cleanup Ensure errors are caught and handled appropriately Confirm ETL can talk to DAL
17
DAL Test Plan Ensure database can have records created, read, updated, and deleted Define corner cases and error handling for invalid database operations Create list of operations with expected results
18
Web Services Test Plan Call each web service with expected input and check return values Call web services with invalid input and check return values
19
Project Future Database model can be generalized to include any number of data sources Harvester can be separated from ETL so additional data sources will not require ETL change Optimization / multithreading of harvester and parser for greater efficiency User access control features in web application Two separate GUIs: one for Kingland clients, and one for Kingland data analysts
20
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.