non-JSON formatted simpler summary data desired #142

rossmounce · 2016-12-06T11:09:17Z

The nested data structure of the eupmc_results.json output makes it a little tricky to get human-readable summaries of the results. Particularly the journal title per result, which is nested within journalInfo, and further nested within journal -> title (title incidentally is also a non-unique key, this key is also used to describe the article title). I'm not suggesting a change to the structuring of the results.json's, just that a simpler overview csv could be created for people who find JSON hard/intimidating, as a non-default option within getpapers.

As a workaround I have created a short R script to create this non-interactively from the JSON, although it's far from ideal as it requires the installation of an R package (jsonlite) which users probably won't have.

Script below:

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
  stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)==1) {
  # default output file
  args[2] = "summary.csv"
}
#install.packages('jsonlite')
library(jsonlite)
mymatrix <- fromJSON(args[1])
journals <- data.frame(rep(NA,dim(mymatrix)[1]))
for (i in 1:dim(mymatrix)[1]) {
  if (is.null(mymatrix$journalInfo[[i]]$journal[[1]]$title) == TRUE) {
    journals[i,1] <- "not published in a journal"
  } else {
  journals[i,1] <- (mymatrix$journalInfo[[i]]$journal[[1]]$title) 
  }
}
zzz <- cbind(as.character(mymatrix$pmcid),as.character(mymatrix$title),journals[1],as.character(mymatrix$pubYear),as.character(mymatrix$authorString),as.character(mymatrix$doi),as.character(mymatrix$hasPDF),as.character(mymatrix$hasSuppl),as.character(mymatrix$isOpenAccess),as.character(mymatrix$citedByCount),as.character(mymatrix$electronicPublicationDate))
colnames(zzz) <- c("pmcid","article.title","journal","pubYear","authorString","doi","hasPDF","hasSuppl","isOpenAccess","citedByCount","electronicPublicationDate")
write.csv(zzz,file=args[2])

Example command-line usage:

Rscript json-to-csv.R eupmc_results.json output.csv

This creates an overview csv file with these (much reduced) fields of information, including all the things that 90% of users are most likely to want to know e.g. journal, article title, year of publication - the basics

csvcut -n output.csv 
  1: 
  2: pmcid
  3: article.title
  4: journal
  5: pubYear
  6: authorString
  7: doi
  8: hasPDF
  9: hasSuppl
 10: isOpenAccess
 11: citedByCount
 12: electronicPublicationDate

The text was updated successfully, but these errors were encountered:

blahah · 2016-12-06T13:22:00Z

Right now we just take the eupmc API response object and serialise it to JSON. My personal opinion is that it might be out of scope for getpapers to do more with it, and that there's space for more tools in the ecosystem that do more. We could link to other tools that handle the output, including linking to this issue.

rossmounce · 2016-12-06T14:34:28Z

I've made a minor update to my script with a is.null in the for loop. Patents do not have a journal title and were creating NULLs that broke my simple for loop.

rossmounce added the feature request label Dec 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-JSON formatted simpler summary data desired #142

non-JSON formatted simpler summary data desired #142

rossmounce commented Dec 6, 2016 •

edited

Loading

blahah commented Dec 6, 2016

rossmounce commented Dec 6, 2016

non-JSON formatted simpler summary data desired #142

non-JSON formatted simpler summary data desired #142

Comments

rossmounce commented Dec 6, 2016 • edited Loading

blahah commented Dec 6, 2016

rossmounce commented Dec 6, 2016

rossmounce commented Dec 6, 2016 •

edited

Loading