Data Mining of USTPO – Technology Investment Trends SAS Paper 101-2013 Kenneth M. Potter, SAIC Robert N. Hatton, SAIC.

Slides:



Advertisements
Similar presentations
The following 10 questions test your knowledge of desired configuration management in Configuration Manager Configuration Manager Desired Configuration.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Little Used, but Powerful Features with GP Cathy Fregelette, CPA, PMP Practice Manager BroadPoint Technologies September 20, 2012.
Sunday Business Systems Using Access More Efficiently Tips and tricks to make things easy.
SpreadsheetML Basics.
Alternative FILE formats
Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall
EndNote. What is EndNote:  EndNote is referencing software that enables you to create a database of references from your readings. Your database of references.
Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.
Getting Started: Ansoft HFSS 8.0
Copyright 2003 The McGraw-Hill Companies, Inc CHAPTER Application Software computing ESSENTIALS    
ESupport Shifting Customers to the Internet for Support Published: January 2002.
Touchdevelop api api: messaging sending sms Disclaimer: This document is provided “as-is”. Information and views expressed in this document, including.
Tutorial 11: Connecting to External Data
FIRST COURSE Creating Web Pages with Microsoft Office 2007.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Fiddler. Introducing Fiddler HTTP/HTTPS Debugger Runs as a proxy server on the local machine or on a remote server Written in C# (.NET Framework v2.0)
Office Open XML Developer Workshop SpreadsheetML Basics.
Hardware vs. Software Computer systems consist of both hardware and software. Hardware refers to anything you can physically touch. Keyboards, mice, monitors,
Classroom User Training June 29, 2005 Presented by:
Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 Committed to Shaping the Next Generation of IT Experts. Chapter 1: Finding Your Way Through.
1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Access 2010 by Robert Grauer, Keith Mast, and Mary Anne.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall 1 Committed to Shaping the Next Generation of IT Experts. Chapter 1: Finding Your.
10-1 aslkjdhfalskhjfgalsdkfhalskdhjfglaskdhjflaskdhjfglaksjdhflakshflaksdhjfglaksjhflaksjhf.
© 2002 by Prentice Hall 1 David M. Kroenke Database Processing Eighth Edition Chapter 14 Networks, Multi-Tier Architectures, and XML.
MICROSOFT EXCEL – CHAPTER 2 CONTD. MICROSOFT EXCEL – CHAPTER 10 Sravanthi Lakkimsetty Aug 31,2015
Great Leads for the Savvy Sales Whiz A MINT Skills Workshop Professional Development Institute February 3, 2004.
To enhance learning, service, and research through an advanced information technology environment. Our Mission:To enhance learning, service,and research.
Fundamentals of Web Design Copyright ©2004  Department of Computer & Information Science Introducing XHTML: Module A: Web Design Basics.
MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective.
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
Assignee Name Harmonization Efforts at the U.S. Patent and Trademark Office US Patent and Trademark Office Office of Electronic Information Products Patent.
Active Server Pages  In this chapter, you will learn:  How browsers and servers interacted on the Internet when the Internet first became popular 
Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.
08 | Advanced Features Jerry Nixon | Microsoft Developer Evangelist Daren May | President & Co-founder, Crank211.
Chapter 4 Data and Databases. Learning Objectives Upon successful completion of this chapter, you will be able to: Describe the differences between data,
Define your Own SAS® Command Line Commands Duong Tran – Independent Contractor, London, UK Define your Own SAS® Command Line Commands Duong Tran – Independent.
 Database Administration Installing Oracle 11g & Creating Database.
Advanced topics in touchdevelop touchdevelop vs. apps with Visual Studio comparison Disclaimer: This document is provided “as-is”. Information and views.
EndNote. What is EndNote? EndNote is referencing software that enables you to create a database of references from your readings.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Chapter 11 Using SAS ® Web Report Studio. Section 11.1 Overview of SAS Web Report Studio.
October RefWorks Basics Creating accounts and folders Adding references (manually & electronically) Sorting, editing and linking Creating a bibliography.
1 Using Analytics as a retrieval tool for global data update Yoel Kortick Senior Librarian, Ex Libris.
An Enterprise Clinical Data Search Solution. is Designed for: Informatics professionals, clinicians, statisticians, data managers and process/quality.
Microsoft Access 4 Database Creation and Management.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Microsoft Excel Illustrated Introductory Workbooks and Preparing them for the Web Managing.
Pre-Production Meet with the client to create a project plan:
Microsoft Virtual Academy
Introducing XHTML: Module A: Web Design Basics
Introducing XHTML: Module A: Web Design Basics
Exploring Microsoft Office Access 2007
6/10/ :23 PM TOOL-504T A deep dive into Visual Studio 11 Express for designing Metro style apps using XAML Joanna Mason & Unni Ravindranathan Program.
6/22/2018 2:09 PM BRK3102 How Microsoft Legal drives down eDiscovery costs with machine learning in Office 365 Rachi Messing Senior Program Manager, O365.
SPC2012 – IT-Pro 8/8/2018 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Exploring Microsoft Office Access
OneDrive for Business User Guide
Lesson 1: Introduction to Trifacta Wrangler
Exploring Microsoft Office Access 2007
Exploring Microsoft Office Access 2007
Office 365 Development.
Lesson 1: Introduction to Trifacta Wrangler
EndNote by: fatimah alotaibi.
Lesson 1 – Chapter 1B Chapter 1B – Terminology
Agenda OAuth Concepts Programming OAuth.
Architecture + system-based How to assign passwords
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Using Access More Efficiently
Exploring Microsoft Office Access
Presentation transcript:

Data Mining of USTPO – Technology Investment Trends SAS Paper Kenneth M. Potter, SAIC Robert N. Hatton, SAIC

Study Purpose and Background PURPOSE: Analyze technology trends within major information technology companies via intellectual property holdings in the form of granted patents. BACKGROUND:  U.S. Patent and Trademark Office (U.S. PTO)  Single authority for granting patents in the United States  Patent types »Utility (only utility patent grants were considered for this paper) »Design »Other types (plant patents, reissued patents, etc.)  Patent documents published every Tuesday »Patent applications »Patent grants  Patent data available via U.S. PTO web site and via Google ® Google is a registered trademark of Google Inc. in the U.S. and/or other countries.

Ground Rules and Constraints  All patent grants between January 1, 1990, and September 18, 2012, were captured  Only patent grants (patent applications not considered)  Only utility patents (design patents do not have an abstract)  Large quantity of data to process  1,187 Tuesdays in target timeframe = 1,187 files to download  45.4 GB total compressed file size  241 GB total uncompressed file size  Computing resource limitations  Single virtual machine  Two processor cores  2GB RAM

A Decision  Parse the entire corpus one time  Pro: No need to reprocess the same document more than once  Con: Requires a large amount of storage space for data that will never be used  Parse the target documents on demand  Pro: Does not require significant additional storage space beyond the size of the corpus of raw documents  Con: Potentially requires reprocessing the same document numerous times

Download Source Files  Generate list of target URLs (we used Visual Basic ® for Applications and exported this list from Excel ® to a simple text file)  cURL for Windows ® and Batch Script to download files for /F "tokens=1,2" %i in (E:\PatentAnalysis\1990_Google_Links.txt) do curl.exe --output "E:\PatentAnalysis\PatentDownloads\%j.zip" "%i" First few lines of file 1990_Google_Links.txt: Batch script using cURL for Windows to download all of the target files listed in 1990_Google_Links.txt: The %i parameter refers to the first column from 1990_Google_Links.txt, which is the target URL. The %j parameter refers to the second column from 1990_Google_Links.txt, which is the publication date and is used for naming the.ZIP file that results from the download. All downloads are saved to the PatentDownloads folder. Visual Basic, Excel, and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other countries.

Extract Source Files  Generate list of downloaded.ZIP Files and desired folders for extracted files (we used Visual Basic ® for Applications and exported this list from Excel ® to a simple text file)  WinZip ® command line and Batch Script to extract files for /F "tokens=1,2" %i in (E:\PatentAnalysis\Zip_file_Links_ txt) do "C:\Program Files\WinZip\wzunzip" "%i" "%j" First few lines of file Zip_file_Links_ txt: Batch script using WinZip Command Line to extract all of the target files listed in Zip_file_Links_ txt: The %i parameter refers to the first column from Zip_file_Links_ txt, which is the.ZIP file to be decompressed. The %j parameter refers to the second column from Zip_file_Links_ txt, which is the destination folder for the extraction. Visual Basic and Excel are registered trademarks of Microsoft Corporation in the U.S. and/or other countries. WinZip is a registered trademark of WinZip International, LLC in the U.S. and/or other countries.

Grant Documents Exist in Three Basic Formats 1990 – 2000.txt files 2001.sgm files * present.xml files ** * Standard Generalized Markup Language ** XML files for 2002 – 2004 maintain same structure as 2001 SGML files

Parsing a Source Document DATA Step for capturing assignee information from pre patent grant data

Build Patents-to-Assignees Lookup Datasets  Parse all documents in the corpus to extract assignees  ParseAssignees.sas iterates through the folders of downloaded files, passing each folder path to GetFileNamesMacro.sas  GetFileNamesMacro.sas finds each file in the target folder and, based on the year found in the folder name, passes the target file to one of three macros used to parse the three formats (text, SGML, or XML)  Result is a lookup dataset per year consisting of patent number, assignee name, and the path to the source file in which the patent can be found

Parse Patents for a Specific Company  FindCompanySubset.sas consolidates all of the lookup datasets created in the previous step  Uses a subsetting IF statement to only keep observations where the Assignee value contains a literal value entered manually by the user (for example, “Company X”)  Result is a dataset containing only the patent numbers (both design and utility) and source file paths for patents granted to the target company

Parse Patents for a Specific Company  ParsePatents.sas creates a unique list of source file paths to be processed and passes each to one of three macros to parse the three formats (text, SGML, or XML)  Each of the three macros extracts basic information such as patent number, issue date, and abstract  The results are separate datasets for the company’s design and utility patents. The utility patent datasets for each company form the basis for the analysis portion of the study.

Results

Questions Contact Information: Kenneth M. Potter, SAIC (w) Robert N. Hatton, SAIC (w)

Paper