Web-Based Data Collection and Analysis

Slides:



Advertisements
Similar presentations
Visit the ccScan Website Scan, Import, and Automatically File documents to the Cloud SCAN, IMPORT, AND AUTOMATICALLY FILE DOCUMENTS TO SALESFORCE ® Introduction.
Advertisements

1099 Pro, Inc. – Software for Pro Enterprise Edition Features.
A comparison of MySQL And Oracle Jeremy Haubrich.
This presentation is intended as a detailed WebEx, to bring potential customers to an understanding of Dream Report capabilities. This presentation focuses.
Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.
Pocket PC For small projects Shazia Naz Subhani Registries Core Facility, BESC King Faisal Specialist Hospital & Research Centre.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Server-Side vs. Client-Side Scripting Languages
SAS Training at OSEDA Fall, 2001 Objectives are to teach –Fundamentals of the SAS system (“SAS Classic +”) –Basics of the current OSEDA / MCDC data archives.
PHP: HYPERTEXT PRE PROCESSOR BY: KAILA ULINE, HILARY PETROKUBI, HAIDAN HU, EMILY MARTIN.
Asset: Academic Survey System & Evaluation Tool Bert G. Wachsmuth Seton Hall University.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Bar|Scan ® Asset Inventory System The leader in asset and inventory management.
About the CMS WordPress A brief overview of both Wordpress.org & WordPress.com WordPress is one of the most popular content management and blog publishing.
Presentation Overview Background Accessing Retail Data Warehouse Using ACL Accessing ODBC Accounting Package Using ACL Accessing AS400 Using ACL Accessing.
Chapter 10 Publishing and Maintaining Your Web Site.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Agenda Overview 2.What is SharePoint? 3.NCDOT Websites 4.Roles 5.Search 6.SharePoint Interface.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Overview of Features and Reports Version 2.0 Send inquiries to:
Crystal Hoyer Program Manager IIS Team Preview of features that will be announced at MIX09 Please do not blog, take pictures or video of session.
March 20, 2008Electronic Resources and Libraries College Center for Library Automation Tallahassee, FL Susan B. Campbell Susan.
Crystal And Elliott Edward M. Kwang President. Crystal Version Standard - $145 Professional - $350 Developer - $450.
Eclipse Overview Introduction to Web Programming Kirkwood Continuing Education Fred McClurg © Copyright 2015, Fred McClurg, All Rights Reserved.
Setting Up an RSS Feed 1 Project by iWEBbic.com 1.
All rights reserved. © 2009 Tableau Software Inc. Productizing Data with Tableau Experian Automotive’s AutoCount Vehicles in Operation Heidi B. Haupt,
Introduction to Microsoft Access 2003 Mr. A. Craig Dixon CIS 100: Introduction to Computers Spring 2006.
Practical Project of the 2006 Joint International Master’s Degree.
Nobody’s Unpredictable Ipsos Portals. © 2009 Ipsos Agenda 2 Knowledge Manager Archway Summary Portal Definition & Benefits.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Taylor Hawes Controller – Global Platforms & Operations Microsoft Corporation SEC Voluntary Filing Program.
CS 105 Perl: Course Introduction Nathan Clement 13 May 2014.
The Art of DR Auditing By Acmeware, Inc. Edward Chisam – Senior Consultant.
EASI a free web database application for collecting and managing monitoring records.
FITT Fostering Interregional Exchange in ICT Technology Transfer Communication & Collaboration Tools.
0 eCPIC User Training: Resource Library These training materials are owned by the Federal Government. They can be used or modified only by FESCOM member.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Contact: Phil Benjamin: Web site for this presentation: eden.rutgers.edu/~pmben Office hour: Wed:
CERN-PH-SFT-SPI August Ernesto Rivera Contents Context Automation Results To Do…
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
Drush: The Drupal Shell Utility Trevor Mckeown Founder & Owner Sublime Technologies
This application does require access to the BW (Baan Windows) client and authorization to the OLE Daemon Introducing … XQL - Excel Query Language How about.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Part 4 Processing and saving data with CGI/Perl Psychological Science on the Internet: Designing Web-Based Experiments From the Ground Up R. Chris Fraley.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Comparison of different output options from Stata
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
PhpMyAdmin Matthew Walsh April 28, 2003 CMSC Shawn Sivy.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
11 Computers, C#, XNA, and You Session 1.1. Session Overview  Find out what computers are all about ...and what makes a great programmer  Discover.
WebDat: A Web-based Test Data Management System J.M.Nogiec January 2007 Overview.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
XAMPP.
Fab25 User Training Cerium Labs LabCollector - LIMS Lynette Ballast.
Microsoft Power Query: an Excel Users Dream for Data Extraction and Cleansing Presented by: Belinda Allen Smith & Allen Consulting, Inc.
The Challenges of Digital Preservation in a Changing Environment Andrew Pitt Pfizer eArchive Service Team Global Records Management Services DPC Digital.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
How to use Drupal Awdhesh Kumar (Team Leader) Presentation Topic.
L1Calo DBs: Status and Plans ● Overview of L1Calo databases ● Present status ● Plans Murrough Landon 20 November 2006.
Dynamics GP – You Own It … Why Not Use It? Financial November 8, 2016
To the OASIS Royalty Reporting Online Training Course
New free text search engine for
ICT Database Lesson 1 What is a Database?.
Let’s Blog Using a Blog as a Communication Tool
Automating reports with Python
asset: Academic Survey System & Evaluation Tool
To the OASIS Royalty Reporting Online Training Course
Web Application Development Using PHP
Presentation transcript:

Web-Based Data Collection and Analysis Andy Leone June 2014

Course Overview Tools for extracting and analyzing text Software SEC Filings Websites PDF Files Software Perl – ActivePerl Perl “Front End” Komodo MySQL

Why Perl? There Are Alternatives to Perl (e.g., Python, Ruby, R) Perl – Practical Extraction and Reporting Language It is a mature language with considerable support. Countless packages (add-ons) About the best there is when it comes to “Regular Expressions.”

The big picture- What I do SAS Connect WRDS SAS Pass-through Perl Edgar SQL ODBC ODS Proc Export Excel

SEC FTP Files can be obtained from the SEC’s FTP site: ftp.sec.gov Note that the files are not made available on the FTP site until 24 hours after they are filed. It is possible to get real-time feeds. You are supposed to do large downloads after 9:00PM Eastern.

Index Files Index files are archived each quarter and the most recent one us updated every day: /edgar/full-index/2006/QTR1/company.zip The zip file contains a file “company.idx”

Index File, continued -First 10 lines - description then data- -Company Name -Form Type -CIK Number -Filing Date -Directory and File Name Description: Master Index of EDGAR Dissemination Feed by Company Name Last Data Received: September 6, 2006 Comments: webmaster@sec.gov Anonymous FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name --------------------------------------------------------------------------------------------------------------------------------------------- 033 ASSET MANAGEMENT LLC / 13F-HR 1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt 1 800 CONTACTS INC 10-Q 1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt

Data Files SGML format All files contain basic header information tagged by <SEC-HEADER> This is followed by each file submitted with the filing. Example- AMREP Corp. Submitted a 10-K/A (Text Version)

Challenge Lack of uniformity Another Example: Smart-tek Solutions Inc. (Text)

Perl Practical Extracting and Reporting Language A powerful scripting language for working with unformatted text.

Perl The key feature of Perl is Regular Expressions. It is relatively easy to search through text files and match words or word patterns. For example, to identify firm with an internal control deficiency when wording differs. “We have identified a material weakness” “A material weakness exists” “A control deficiency exists” “We identified a control deficiency” “No material weakness exists” (Don’t want to match this one)

Expectations Using PERL can be very helpful. But.. More refined measures (e.g., modified audit opinions). Large samples relative to hand collection. But.. Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error. It would take an infinite amount of time to get success rates up to 99-100%. Need to accept 90-95% and live with some “noise”

Example: Audit Opinions How do you find the opinion? Look for something like “We have audited.” But PWC has to start with “In our opinion.” How do you find the end of the opinion? Look for a date (e.g., March 15, 2005) a the start of a line But some dates in the opinion happen to fall at start of line (e.g., for year ended December 31, 2005) Then require the date to be followed by spaces. But some opinions are dated with “except for.” Foreign auditors might date the report in day- month-year order (15 March, 2005). Some firms have multiple opinions. Sometimes there are typos.

When is PERL most useful? Extracting specific text and looking for keywords and phrases Audit Opinions Internal Controls Footnotes Not great for unstructured financial information. For example, use of proceeds in IPO prospectus. Sometimes you still need to code the data by hand.

What I do Download all index files and write them to a MYSQL database (via PERL). Download filings I work with fairly often (e.g., 10-K, 10-Q, 8-K) Create header header tables in MYSQL for each filing type. Create project-specific tables as needed. Note: MYSQL allows for easy access from SAS, STATA, excel, etc. via ODBC.

MYSQL, PERL and SAS DBI Module This makes it really easy to work with data via SQL or statistics software (SAS, STATA). SAS Example: proc sql; connect to odbc (datasrc=Compustat user=me password=xxxx); create table myrestates as select * from connection to odbc (select gvkey, file, sic, hlink, company_name, to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date from 8kdata.restate_sample_jul2006 a); quit;

Software Installation Checklist ActivePerl 5.10 Let’s go to the Package Manager and install a couple of modules: DBD::mysqlPP HTML::Format Komodo Edit 4.4 MySQL 5.0 MySQL GUI Tools SAS 9.1.3 (or higher).