Changed XML declaration? #1

gc5011 · 2021-02-10T02:10:51Z

Amazing work. Finding these functions saved me a lot of time and effort - thanks so much for making these available.

Just one small issue, using efetch, my xml results have a declaration that includes the encoding, eg

<?xml version=\"1.0\" encoding=\"UTF-8\"?>

When running clean_api_xml on results, this xml declaration is missed. This then results in the return from the function having two xml declarations.

The following adds another gsub to handle cases where encoding is included in the xml declaration. It also allows clean_api_xml to work on xml results in memory without needing to read/write to disk.

## clean pubmed XML returned from either the reutils or rentrez packages and save the cleaned XML to a new file
clean_api_xml <- function(infile, outfile = NULL, usedisk = FALSE) {
	if (usedisk == TRUE){
		theData <- readChar(infile, file.info(infile)$size, useBytes = TRUE)
    } else {
    	theData	 <- as.character(infile)
    }
	theData <- gsub("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "", theData, fixed = TRUE) # efetch xml results sometimes have encoding in the xml declaration
	theData <- gsub("<?xml version=\"1.0\" ?>", "", theData, fixed = TRUE)
	theData <- gsub("<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "", theData, fixed = TRUE, useBytes = TRUE)
	theData <- gsub("<PubmedArticleSet>", "", theData, fixed = TRUE)
	theData <- gsub("</PubmedArticleSet>", "", theData, fixed = TRUE)
	theData <- gsub("<U\\+\\w{4}>", "", theData) ## note: with some files this doesn't catch everything; potential issue with <OtherAbstract> tags especially
	theData <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "<PubmedArticleSet>", theData, "</PubmedArticleSet>", sep = "\n")
	#theData <- paste(theData, "</PubmedArticleSet>")
	theData <- iconv(theData, to = "UTF-8", sub = "")
	if (usedisk == TRUE){
	    writeLines(theData, outfile, sep = " ")
    }
	return(theData)
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed XML declaration? #1

Changed XML declaration? #1

gc5011 commented Feb 10, 2021 •

edited

Loading

Changed XML declaration? #1

Changed XML declaration? #1

Comments

gc5011 commented Feb 10, 2021 • edited Loading

gc5011 commented Feb 10, 2021 •

edited

Loading