Download presentation
Presentation is loading. Please wait.
Published byJeffrey Doyle Modified over 9 years ago
1
Working With Large Datasets in Corporate Settings Ed Bassin www.profsoft-health.com
2
Background—About ProfSoft Medical/pharma claims analysis software – Main uses are provider profiling, quality analysis 14 clients range from 15K to 2.6M members – Databases from 900K to 110M claim lines Compete with Fortune 100 companies by stressing content, task-appropriate technology Stata is the core of our product – 25,000 lines of ado files – Stata do file generators
3
Challenges We Face Managing that quantity of data End-users are not statisticians – Want point-and-click tools – Do not understand complicated techniques Stata is largely unknown at our clients. SAS is the standard “heavy duty” data package. Integrating Stata with the technology of corporate America.
4
Why We Chose Stata Performance Relative ease of programming Chose for analytic capabilities, not UI I knew it reasonably well
5
Interfacing with databases Create_table ado reads Stata structure and writes appropriate SQL to build, load, and index tables – Write delimited text files with DBMS/Copy – Call native DBMS tools to load gigabytes of data – Support Oracle, Microsoft SQL Server, MySQL Execsql ado calls native DBMS tools to run SQL scripts Process is fast, easy, invisible
6
Web-Based, Point-and-click Stata Use PHP to write do files PHP executes Stata, calls do file Stata writes HTML and closes PHP page displays output End-user doesn’t know Stata in background Process can be both synch and asynch
7
Integrating Stata with Excel Excel is everyday app for our users Use Excel web queries to get to Stata – Build URL through forms or user actions – Two ways of getting Stata output to Excel Store Stata output in DBMS Run Stata jobs through PHP – Create HTML table & return results to Excel – Excel manipulates & formats Stata output
8
What Works Well Analytic flexibility Performance Calling Stata from a web server is easy Getting Stata datasets to HTML Integration with DBMS systems Hiding Stata from end-users
9
Lessons Learned Segment data as much as possible – Be prepared to write special programs to run routine statistical procedures – Stata statistical programs work with raw data, not aggregated data If missing data is not an issue, write your own egen or collapse routines Automate memory setting by examining structure of dataset you want to use DBMS/Copy to handle reading/writing of large datasets Version control with CVS
10
Problems Integration with other data formats – Infile, outfile are very slow for large datasets – DBMS/Analyst was not maintained for Stata 8 Limitations of merge command Abbreviations drive us nuts No IDE (integrated development environment) Stata datasets aren’t indexed Stata has no name in corporate America Recruiting Stata programmers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.