Merging data using Excel & Stata Mark Bruyneel & Matthijs de Zwaan Research Data Services Merging data using Excel & Stata Mark Bruyneel & Matthijs de Zwaan - Welcome to the course/ this web lecture on Data retrieval skills - My name is . . .
Program: Background (30 min.) - Getting your data - Working with data Exercise 1: Excel (30 min.) Exercise 2: Stata (40 min.)
think now, download later Getting your data Rule 1: think now, download later
Getting your data Formulate a research question What was the influence of using governance reporting standards on earnings management of companies? What data do I need? Variables? Sample: Geography? Time period? Is the data available?
What data do I need? Variables Sample Research question: What was the influence of using governance reporting standards on earnings management of companies? Variables What reporting standards? How to measure earnings management? Control variables: firm size, board members, … Sample Which countries: USA, Europe, the Netherlands Time period: recent, historical? Company type: public, SMEs? Model ? Relationships ?
Remarks: Is the data for each database in a single currency ? What control variables do I need? Do I need to download components to calculate variables I could not download? (Ratios etc.) Is the data for each database comparable in time? If you need more than 1 database: Do you need company identifiers? (see: Blackboard)
Which databases are relevant? Do I need several databases ? Do I need to combine datasets ? Do I need just one database ?
Which databases are available?
Research Data Services
Data Center on Blackboard
Research Data Services on Blackboard
Research Data Services on Blackboard
Research Data Services on Blackboard Help on software: manuals/websites
Research Data Services on Blackboard
Using several databases Company identifiers Search 1 Data 1 Data 2 Search 2 Company identifiers: codes that uniquely identify a company in 1 or more databases
Combining data(sets) Company identifiers Data 1 Data 2 Find out which (Company) identification codes are available in all relevant databases ! Examples: ISIN, Sedol, CUSIP, Tickers.
Research Data Services on Blackboard Additional information or tools
Research Data Services on Blackboard
Blackboard file:
Working with data Many different ways to organize data For analysis: One line (row) = one observation One column = one variable “Tidy” data Different ways to organize data. Best way depends on what you want to do. For example: clearest way to present data in thesis is not best way to organize data for analysis. Software requires certain structure in the data. Can be different for different software packages or even versions. Try to keep actual data/observations separate from comments etc Stata expects data as one observation in each line, and variables in columns: ‘tidy data’.
“Untidy” data: example 1 Name y2001 y2002 Alphabet - 2 Johnson & Johnson 16 11 Pfizer 3 1 Name Y(ear) Result Alphabet 2001 - Johnson & Johnson 16 Pfizer 3 2002 2 11 1 Headers have names and data in one cell: ‘y’ = the variable year 2001 and 2002 are values
“Untidy” data: example 2 Company Result T-2001 - MSFT-2001 16 GOOG-2001 3 T-2002 2 MSFT-2002 11 GOOG-2002 1 Company Year 2001 Year 2002 T - 2 MSFT 16 11 GOOG 3 1 Company Year Result T 2001 - MSFT 16 GOOG 3 2002 2 11 1 One column, two separate variable values: Name and Year
Working with data Basics of data merges: Merging data ≠ Appending data Merging = Adding variables Appending = Adding observations Merge data on key variables (ID / Codes) Must be available in all data files / datasets Uniquely identify observations (can be a combination of items)
A tidy dataset: example 1 ‘Name’ and ‘Year’ together uniquely identify a single observation The ‘Result’ column gives variable values Name Year Result Alphabet 2001 - Johnson & Johnson 16 Pfizer 3 2002 2 11 1 N.B.: Unique Company ID codes are often better than the name !
Working with data Warning: make sure to keep key ID codes in tact ! ID 00001324 00021234 03441234 ID 1324 21234 3441234
Restoring the ID Restore original length with Excel: REPT & LEN REPT() = Repeat LEN() = Length number of characters in a cell
Merging data Auditor Year GRI Score John 2001 - Jane 16 Mary 3 2002 2 11 1 Auditor Year Big4 ? John 2001 “Yes” Jane Mary “No” 2002
Working with data: merging Auditor Year GRI Score John 2001 - Jane 16 Mary 3 2002 2 11 1 Auditor Year Big4 John 2001 “Yes” Jane Mary “No” 2002 “1-to-1” Auditor Year GRI Score Big4 John 2001 - “Yes” Jane 16 Mary 3 “No” 2002 2 11 1
Working with data: merging Auditor Year GRI Score John 2001 - Jane 16 Mary 3 2002 2 11 1 Name Gender John “M” Jane “F” Mary
Working with data: merging Auditor Year GRI Score John 2001 - Jane 16 Mary 3 2002 2 11 1 Name Gender John “M” Jane “F” Mary “many-to-1” Auditor Year GRI Score Gender John 2001 - “M” Jane 16 “F” Mary 3 2002 2 11 1
Working with data: merging Auditor Year GRI Score John 2001 - Jane 16 Mary 3 2002 2 11 1 Name Gender John “M” Jane “F” Mary “1-to-m” Auditor Year GRI Score Gender John 2001 - “M” Jane 16 “F” Mary 3 2002 2 11 1
Exercise 1: Combining data using Excel Compustat Global data Preparing the Datastream data Combining both datasets
Exercise 2: Combining data using Stata Introduction Exercise: Compustat Global & Datastream
Exercise 2: Combining data using Stata https://download.vu.nl/
Exercise 2: Stata – Command line
Exercise 2: Stata – Scripts / Do files Basics about .do files: Text files with the .do file extension Commands are handled as if they were typed in on the Command line interface Typing “doedit” calls up the do-file editor. Advantages of scripting: Documents what you have done It makes finding mistakes and repairing them easier Add comments to your script(s) (your future self & your supervisor will be grateful)
Exercise 2: Stata – Combining the data Let’s get to work! - Go to the Data Center Blackboard course - Download the data files - Start up the program Stata
Need help? The library is there for you ! Website: http://ub.vu.nl Blackboard: http://bb.vu.nl Email: ResearchDataServices.ub@vu.nl