Download presentation
Presentation is loading. Please wait.
Published byTheodore Melton Modified over 8 years ago
1
Data Mining of USTPO – Technology Investment Trends SAS Paper 101-2013 Kenneth M. Potter, SAIC Robert N. Hatton, SAIC
2
Study Purpose and Background PURPOSE: Analyze technology trends within major information technology companies via intellectual property holdings in the form of granted patents. BACKGROUND: U.S. Patent and Trademark Office (U.S. PTO) Single authority for granting patents in the United States Patent types »Utility (only utility patent grants were considered for this paper) »Design »Other types (plant patents, reissued patents, etc.) Patent documents published every Tuesday »Patent applications »Patent grants Patent data available via U.S. PTO web site and via Google ® Google is a registered trademark of Google Inc. in the U.S. and/or other countries.
3
Ground Rules and Constraints All patent grants between January 1, 1990, and September 18, 2012, were captured Only patent grants (patent applications not considered) Only utility patents (design patents do not have an abstract) Large quantity of data to process 1,187 Tuesdays in target timeframe = 1,187 files to download 45.4 GB total compressed file size 241 GB total uncompressed file size Computing resource limitations Single virtual machine Two processor cores 2GB RAM
4
A Decision Parse the entire corpus one time Pro: No need to reprocess the same document more than once Con: Requires a large amount of storage space for data that will never be used Parse the target documents on demand Pro: Does not require significant additional storage space beyond the size of the corpus of raw documents Con: Potentially requires reprocessing the same document numerous times
5
Download Source Files Generate list of target URLs (we used Visual Basic ® for Applications and exported this list from Excel ® to a simple text file) cURL for Windows ® and Batch Script to download files for /F "tokens=1,2" %i in (E:\PatentAnalysis\1990_Google_Links.txt) do curl.exe --output "E:\PatentAnalysis\PatentDownloads\%j.zip" "%i" First few lines of file 1990_Google_Links.txt: Batch script using cURL for Windows to download all of the target files listed in 1990_Google_Links.txt: The %i parameter refers to the first column from 1990_Google_Links.txt, which is the target URL. The %j parameter refers to the second column from 1990_Google_Links.txt, which is the publication date and is used for naming the.ZIP file that results from the download. All downloads are saved to the PatentDownloads folder. Visual Basic, Excel, and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other countries.
6
Extract Source Files Generate list of downloaded.ZIP Files and desired folders for extracted files (we used Visual Basic ® for Applications and exported this list from Excel ® to a simple text file) WinZip ® command line and Batch Script to extract files for /F "tokens=1,2" %i in (E:\PatentAnalysis\Zip_file_Links_1990-1999.txt) do "C:\Program Files\WinZip\wzunzip" "%i" "%j" First few lines of file Zip_file_Links_1990-1999.txt: Batch script using WinZip Command Line to extract all of the target files listed in Zip_file_Links_1990-1999.txt: The %i parameter refers to the first column from Zip_file_Links_1990-1999.txt, which is the.ZIP file to be decompressed. The %j parameter refers to the second column from Zip_file_Links_1990-1999.txt, which is the destination folder for the extraction. Visual Basic and Excel are registered trademarks of Microsoft Corporation in the U.S. and/or other countries. WinZip is a registered trademark of WinZip International, LLC in the U.S. and/or other countries.
7
Grant Documents Exist in Three Basic Formats 1990 – 2000.txt files 2001.sgm files * 2002 - present.xml files ** * Standard Generalized Markup Language ** XML files for 2002 – 2004 maintain same structure as 2001 SGML files
8
Parsing a Source Document DATA Step for capturing assignee information from pre- 2001 patent grant data
9
Build Patents-to-Assignees Lookup Datasets Parse all documents in the corpus to extract assignees ParseAssignees.sas iterates through the folders of downloaded files, passing each folder path to GetFileNamesMacro.sas GetFileNamesMacro.sas finds each file in the target folder and, based on the year found in the folder name, passes the target file to one of three macros used to parse the three formats (text, SGML, or XML) Result is a lookup dataset per year consisting of patent number, assignee name, and the path to the source file in which the patent can be found
10
Parse Patents for a Specific Company FindCompanySubset.sas consolidates all of the lookup datasets created in the previous step Uses a subsetting IF statement to only keep observations where the Assignee value contains a literal value entered manually by the user (for example, “Company X”) Result is a dataset containing only the patent numbers (both design and utility) and source file paths for patents granted to the target company
11
Parse Patents for a Specific Company ParsePatents.sas creates a unique list of source file paths to be processed and passes each to one of three macros to parse the three formats (text, SGML, or XML) Each of the three macros extracts basic information such as patent number, issue date, and abstract The results are separate datasets for the company’s design and utility patents. The utility patent datasets for each company form the basis for the analysis portion of the study.
12
Results
14
Questions Contact Information: Kenneth M. Potter, SAIC (w) 256-319-8440 kenneth.m.potter@saic.comkenneth.m.potter@saic.com Robert N. Hatton, SAIC (w) 256-319-8403 robert.n.hatton@saic.comrobert.n.hatton@saic.com
15
Paper 101-2013
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.