Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed XML declaration? #1

Open
gc5011 opened this issue Feb 10, 2021 · 0 comments
Open

Changed XML declaration? #1

gc5011 opened this issue Feb 10, 2021 · 0 comments

Comments

@gc5011
Copy link

gc5011 commented Feb 10, 2021

Amazing work. Finding these functions saved me a lot of time and effort - thanks so much for making these available.

Just one small issue, using efetch, my xml results have a declaration that includes the encoding, eg

<?xml version=\"1.0\" encoding=\"UTF-8\"?>

When running clean_api_xml on results, this xml declaration is missed. This then results in the return from the function having two xml declarations.

The following adds another gsub to handle cases where encoding is included in the xml declaration. It also allows clean_api_xml to work on xml results in memory without needing to read/write to disk.

## clean pubmed XML returned from either the reutils or rentrez packages and save the cleaned XML to a new file
clean_api_xml <- function(infile, outfile = NULL, usedisk = FALSE) {
	if (usedisk == TRUE){
		theData <- readChar(infile, file.info(infile)$size, useBytes = TRUE)
    } else {
    	theData	 <- as.character(infile)
    }
	theData <- gsub("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "", theData, fixed = TRUE) # efetch xml results sometimes have encoding in the xml declaration
	theData <- gsub("<?xml version=\"1.0\" ?>", "", theData, fixed = TRUE)
	theData <- gsub("<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "", theData, fixed = TRUE, useBytes = TRUE)
	theData <- gsub("<PubmedArticleSet>", "", theData, fixed = TRUE)
	theData <- gsub("</PubmedArticleSet>", "", theData, fixed = TRUE)
	theData <- gsub("<U\\+\\w{4}>", "", theData) ## note: with some files this doesn't catch everything; potential issue with <OtherAbstract> tags especially
	theData <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">", "<PubmedArticleSet>", theData, "</PubmedArticleSet>", sep = "\n")
	#theData <- paste(theData, "</PubmedArticleSet>")
	theData <- iconv(theData, to = "UTF-8", sub = "")
	if (usedisk == TRUE){
	    writeLines(theData, outfile, sep = " ")
    }
	return(theData)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant