10-K filing annual report word and document statistics

1 10-K filing annual report word and document statistics
David Ling

2 Document statistics Downloaded S&P 500 companies 10-K filings
to 1 filing per year, 6 reports per company (some are less due to newly joined) Using regexp to extract item 7 Items are stored as separated files

3 For documents with words < 4000, we may consider it as a fail extraction:
Incomplete extraction (part of them are extracted) Referring to some where else Regexp cannot be found

4 Extracted document statistics
Total documents: 2859 Documents with words > 4000: 2459 (valid extraction) Companies with valid extraction for recent 3 years: 409 Companies with valid extraction for recent 6 years: 369 We can rank that 409 companies Extracted number of words for some companies: [CIK, 2016, 2015, 2014, 2013, 2012, 2011] ['93751' ] ['9389' ] ['940944' ] ['943819' ] ['96021' ] ['97476' ]

5 Top 50 frequent words among valid extracted
59290 distinct words in valid extracted Did not apply Stemming and lemmatization (eg. cat and cats, play and played, company and company’s are distinct) They are distinct in downloaded GloVe data Frequency in valid extracted

6 Frequency percentile About 10% of words appear only 1 times
Frequency are highly dominated by 1% of the frequent words Percentage (59290 words) Frequency in valid extracted Percentile 1 20 2 40 4 60 11 80 51 95 781 96 1148 97 1954 98 3617 99 9369 99.5 20254 100

7 Some selected uncommon words
Rank, word, freq., doc freq. 58783,lncome,1,1 58951,padding-bottom,1,1 58784,quality.,1,1 58952,post-january,1,1 58785,2.53x,1,1 58953,disappear,1,1 58786,amrisc,1,1 58954,low-point,1,1 58787,1.85x,1,1 58955,-balance,1,1 58788,2.09x,1,1 58956,earnings.we,1,1 58789,1.36x,1,1 58957,non- deductible.our,1,1 58790,mid-fifties,1,1 58958,decemberr,1,1 Some are due to: Numbers without spaces Full stop without followed by a capital letter (‘…quality. table of …’) Missing space (blue) Hyphen Wrong spelling As their appear frequency is small, we may just ignore them, or regard them as noise at this stage.

8 Discussions Next step: term weighting and stop words
Filtering stop words by stop word list on internet (Bill McDonald) Examples: A ABOUT ABOVE ACROSS AFOREMENTIONED AFORESAID AFTER AFTERWARDS AGAIN AGAINST ALL ALMOST ALONE ALONG ALREADY ALSO ALTHOUGH ALWAYS AMONG AMONGST AN AND ANOTHER ANY ANYHOW ANYONE ANYTHING ANYWHERE ARE AROUND AS AT BE BECAME BECAUSE Filtering stop words by inverse document frequency Idf = log( 1/ document frequency) As document length is long, this is not able to differentiate frequent word and stop words, eg. Both ‘the’ and ‘income’ appear on all documents (same idf) , but ‘income’ is much more meaningful than ‘the’

