You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Amazing work. Finding these functions saved me a lot of time and effort - thanks so much for making these available.
Just one small issue, using efetch, my xml results have a declaration that includes the encoding, eg
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
When running clean_api_xml on results, this xml declaration is missed. This then results in the return from the function having two xml declarations.
The following adds another gsub to handle cases where encoding is included in the xml declaration. It also allows clean_api_xml to work on xml results in memory without needing to read/write to disk.
## clean pubmed XML returned from either the reutils or rentrez packages and save the cleaned XML to a new file
clean_api_xml <- function(infile, outfile = NULL, usedisk = FALSE) {
if (usedisk == TRUE){
theData <- readChar(infile, file.info(infile)$size, useBytes = TRUE)
} else {
theData <- as.character(infile)
}
theData <- gsub("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "", theData, fixed = TRUE) # efetch xml results sometimes have encoding in the xml declaration
theData <- gsub("<?xml version=\"1.0\" ?>", "", theData, fixed = TRUE)
theData <- gsub("<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "", theData, fixed = TRUE, useBytes = TRUE)
theData <- gsub("<PubmedArticleSet>", "", theData, fixed = TRUE)
theData <- gsub("</PubmedArticleSet>", "", theData, fixed = TRUE)
theData <- gsub("<U\\+\\w{4}>", "", theData) ## note: with some files this doesn't catch everything; potential issue with <OtherAbstract> tags especially
theData <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "<PubmedArticleSet>", theData, "</PubmedArticleSet>", sep = "\n")
#theData <- paste(theData, "</PubmedArticleSet>")
theData <- iconv(theData, to = "UTF-8", sub = "")
if (usedisk == TRUE){
writeLines(theData, outfile, sep = " ")
}
return(theData)
}
The text was updated successfully, but these errors were encountered:
Amazing work. Finding these functions saved me a lot of time and effort - thanks so much for making these available.
Just one small issue, using efetch, my xml results have a declaration that includes the encoding, eg
When running
clean_api_xml
on results, this xml declaration is missed. This then results in the return from the function having two xml declarations.The following adds another
gsub
to handle cases where encoding is included in the xml declaration. It also allowsclean_api_xml
to work on xml results in memory without needing to read/write to disk.The text was updated successfully, but these errors were encountered: