This repository contains the code used in the paper "Estimating Technology Performance Improvement Rates by Mining Patent Data", coauthored with Jeff Alstott and Chris Magee, published by Technological Forecasting and Social Change and freely available at https://www.sciencedirect.com/science/article/pii/S0040162520309264.
Patent data used by the code is available at http://dx.doi.org/10.17632/f4fj887y67.1. Processed patent citation data is available at https://zenodo.org/record/3902550#.Xu-DlWhKiUk. Raw data from the USPTO is available at https://www.patentsview.org/download/. The content of the data files as well as the methods used to create them are described in the article G. Triulzi, C.L. Magee, "Functional performance improvement data and patent sets for 30 technology domains with measurements of patent centrality and estimations of the improvement rate", published by Data in Brief, available at https://doi.org/10.1016/j.dib.2020.106257. If you use the data, please cite both articles.
The code is structured as a series of Jupyter Notebooks, which use two datasets available on Mendeley Data and Zenodo.
- Notebook: Compute entropy of assignees in domains
- What it does: for each of the 30 technology domains, it computes the normalized entropy of the share of patents by assignee over time, from 1976 to 2015. Entropy is normalized as explained in the paper.
- Inputs:
- Domains_patent_info.csv (available on the Mendeley Data repository)
- PERFORMANCE_DOMAINS_K_Kr2.csv (available on the Mendeley Data repository)
- assignee.tsv (available at https://www.patentsview.org/download/)
- rawassignee.tsv (available at https://www.patentsview.org/download/)
- Output:
- DF_Normalized_Entropy_Domains_over_time.csv
- Notebook: Compute Normalized Knowledge Obsolescence Index at patent and domain levels
- What it does: it computes a normalized index of knowledge obsolescence based on the age profile of citations made by patents belonging to each of the 30 domains. The normalization procedure is explained in the paper.
- Inputs:
- Domains_patent_info.csv (available on the Mendeley Data repository)
- All_patents_info.csv (available on the Mendeley Data repository)
- CITATION_INFO_no_neg_citlag.csv (available on the Zenodo repository)
- Output:
- observed_vs_expected_citation_age_patent_level.csv
- DataFrame_Normalized_Obsolescence_Domains_over_time.csv
- Notebook: Monte Carlo Cross Validation (with paper figures)
- What it does: for a set of candidate patent-based predictors of the technology performance improvement rate (TIR), the code computes the predictor using data only up to a year (from 1980 to 2015), randomly sample half of the technology domains, train a regression to predict TIR and test it on the remaining half. It then produces a figure showing the correlation over time between the observed log of the TIR (a.k.a. the parameter “K”) and the predictor, the coefficient of the predictor and the intercept. Note that notebooks “Compute entropy of assignees in domains “ and “Compute Normalized Knowledge Obsolescence Index at Patent and Domain Levels” should be run before notebook “Monte Carlo Cross Validation (with paper figures)” as the latter uses inputs produced by the former two.
- Inputs:
- Domains_patent_info.csv (available on the Mendeley Data repository)
- All_patents_info.csv (available on the Mendeley Data repository)
- PERFORMANCE_DOMAINS_K_Kr2.csv (available on the Mendeley Data repository)
- DF_Normalized_Entropy_Domains_over_time.csv (output of code “Compute entropy of assignees in domains”)
- DataFrame_Normalized_Obsolescence_Domains_over_time.csv (output of code “Compute Normalized Knowledge Obsolescence Index at Patent and Domain Levels”)
- CITATIONS_DOMAINS.csv (available on the Zenodo repository)
- Output:
- DF_stability_prediction_over_time_MONTE_CARLO_COMPARISON.csv
- Figure 3 and S6 from the paper.
- Notebook: Estimate TIR for new domains
- What it does: It runs a regression that uses the best possible predictor identified by the Monte Carlo cross validation exercise to estimate TIRs for the 30 domains based on that predictor only. It then used the estimated coefficients of the regression and the values of the predictor for estimating TIR of 5 out-of-sample domains related to Bio-electronic Medicine for which we only have patent data (no available observation of the empirical TIR). This notebook can be used to estimate the yearly performance improvement rate for any new technology domain. You only need a list of US patent numbers for the new domain. If you input a patent dataset that has more than one domain it will also compute the likelihood that one is faster than the other(s).
- Inputs:
- Domains_patent_info.csv (available on the Mendeley Data repository)
- All_patents_info.csv (available on the Mendeley Data repository)
- PERFORMANCE_DOMAINS_K_Kr2.csv (available on the Mendeley Data repository)
- PATENT_SET_BIOEL_MED.csv (available here)
- Output:
- REGRESSION_DATA.xlsx
- TABLE_estimated_k_meanSPNPcited_1year_before_randomized_zscore_RPbyYear.xlsx
- Figure “density_estimated_rate_new_domains” in PDF and TIFF formats
- Figure “likelihood_domain_faster_than_other_domain” in PDF and TIFF formats
An accompanying code that can be used to compute patent centrality indicators normalized by randomizing the US patent citation network is available on my co-author page at: https://github.com/jeffalstott/patent_centralities.