This executable report describes how data about the global market shares of publishers were retrieved from the Web of Science (WoS) in-house database hosted by the German Competence Center for Bibliometrics (KB). This document is written in R Markdown. If you have access to WoS-KB data infrastructure, you will be able to replicate the methods used by calling
require(rmarkdown)
require(tidyverse)
require(RJDBC)
require(rJava)
require(writexl)
rmarkdown::render("004_kb_fetch_publisher_global.Rmd")
Obtain publication data from the Web of Science database hosted by the German Competence Centre for Bibliometrics
# deal with rJava memory allocation
# https://stackoverflow.com/questions/34624002/r-error-java-lang-outofmemoryerror-java-heap-space
options(java.parameters = "-Xmx1024m")
require(tidyverse)
require(RJDBC)
require(rJava)
.jinit()
jdbcDriver <-
JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = "../inst/jdbc_driver/ojdbc8.jar") # you may need to change the path to your db driver
jdbcConnection <-
dbConnect(
jdbcDriver,
"jdbc:oracle:thin:@//biblio-p-db01:1521/bibliodb01.fiz.karlsruhe",
# login credentials are stored in the .Rprofile file
Sys.getenv("kb_user"),
Sys.getenv("kb_pwd")
)
select
wos_b_2019.items.pubyear,
wos_b_2019.issues.issn,
count(distinct(ut_eid)) as pubs
from
wos_b_2019.items
inner join
wos_b_2019.databasecollection
on wos_b_2019.databasecollection.fk_items = wos_b_2019.items.pk_items
inner join
wos_b_2019.issues
on wos_b_2019.issues.pk_issues = wos_b_2019.items.fk_issues
where
wos_b_2019.databasecollection.edition_value in (
'WOS.SCI', 'WOS.SSCI', 'WOS.AHCI'
)
and wos_b_2019.items.doctype in (
'Article', 'Review'
)
and wos_b_2019.items.pubyear in (
2014, 2015, 2016, 2017, 2018
)
group by
wos_b_2019.items.pubyear,
wos_b_2019.issues.issn
head(rp_publisher) %>%
knitr::kable()
PUBYEAR | ISSN | PUBS |
---|---|---|
2016 | 0045-6535 | 1645 |
2018 | 0013-4651 | 1527 |
2017 | 0016-5107 | 223 |
2018 | 0167-4544 | 418 |
2016 | 2163-0755 | 310 |
2018 | 0736-4679 | 325 |
# dump
write_csv(rp_publisher, "../data/publisher_league_14_18.csv")
Publisher names used in the Web of Science are ambigue, and sometimes
journals can change the publisher. To obtain most frequent publisher and
journals names, Crossref was queried for every distinct journal by ISSN
as described in fetch_data.R
.
As a result, a matching table between Crossref and the Web of Science journals was created. In the following, this table is joined with the WoS data.
cr_journals <- readr::read_csv("../data/cr_wos.csv") %>%
# merge springer nature brands
mutate(publisher = ifelse(grepl("Springer", publisher, fixed = FALSE), "Springer Nature", publisher))
issn_l <- readr::read_tsv("../data/20190818.ISSN-to-ISSN-L.txt") %>%
# manual fix Journal - American Water Works
add_row(ISSN = "2164-4535", `ISSN-L` = "0003-150X")
rp_df <- rp_publisher %>%
left_join(issn_l, by = "ISSN") %>%
left_join(cr_journals, by = c(ISSN = "issn")) %>%
distinct()
Crossref indexed 90.3495386 % of Web of Science journals.
The following table shows the number of journals per publisher.
rp_df %>%
distinct(ISSN, publisher) %>%
count(publisher, sort = TRUE) %>%
mutate(prop = n /sum(n) * 100)
## # A tibble: 1,186 x 3
## publisher n prop
## <chr> <int> <dbl>
## 1 Elsevier BV 1862 13.5
## 2 Springer Nature 1813 13.2
## 3 Informa UK Limited 1394 10.1
## 4 <NA> 1328 9.65
## 5 Wiley 1292 9.39
## 6 SAGE Publications 661 4.80
## 7 Oxford University Press (OUP) 328 2.38
## 8 Cambridge University Press (CUP) 301 2.19
## 9 Ovid Technologies (Wolters Kluwer Health) 213 1.55
## 10 Walter de Gruyter GmbH 201 1.46
## # … with 1,176 more rows
Next, the number and proportion of original articles and reviews published per publisher is presented:
rp_df %>%
group_by(publisher) %>%
summarise(n = sum(PUBS)) %>%
mutate(prop = n /sum(n) * 100) %>%
arrange(desc(prop))
## # A tibble: 1,186 x 3
## publisher n prop
## <chr> <dbl> <dbl>
## 1 Elsevier BV 1954416 24.0
## 2 Springer Nature 1137230 14.0
## 3 Wiley 712835 8.74
## 4 Informa UK Limited 415088 5.09
## 5 <NA> 346172 4.25
## 6 American Chemical Society (ACS) 219525 2.69
## 7 SAGE Publications 192969 2.37
## 8 Royal Society of Chemistry (RSC) 189669 2.33
## 9 Institute of Electrical and Electronics Engineers (IEEE) 179006 2.20
## 10 Oxford University Press (OUP) 177365 2.18
## # … with 1,176 more rows
Bruns et al. provide a matching table for fully open access journals, joining multiple sources into one dataset.
Bruns, A., Lenke, C., Schmidt, C., & Taubert, N. C. (2019). ISSN-Matching of Gold OA Journals (ISSN-GOLD-OA) 3.0. Bielefeld University. doi:[10.4119/unibi/2934907](https://doi.org/10.4119/unibi/2934907)
The linking ISSN is used for matching.
u <- "https://pub.uni-bielefeld.de/download/2934907/2934908/ISSN_Gold-OA_3.0.csv"
bie_oa <- readr::read_csv(u) %>%
# remove missing entries with missing ISSN_L
filter(!is.na(ISSN_L))
# add info to rp_df
rp_df_oa <- rp_df %>%
mutate(oa_journal = `ISSN-L` %in% bie_oa$`ISSN_L`)
The following table shows the number and proportion of open access journals
rp_df_oa %>%
distinct(`ISSN-L`, oa_journal) %>%
count(oa_journal) %>%
mutate(prop = n / sum(n))
## # A tibble: 2 x 3
## oa_journal n prop
## <lgl> <int> <dbl>
## 1 FALSE 11609 0.854
## 2 TRUE 1981 0.146
CSV
-based dump
rp_df_oa %>%
select(
issn_wos = ISSN,
publication_year = PUBYEAR,
articles = PUBS,
journal_title, publisher, oa_journal,
issn_l = `ISSN-L`
) %>%
write_csv("../data/journal_publisher_14_18.csv")
Dump to Excel
readr::read_csv("../data/journal_publisher_14_18.csv") %>%
writexl::write_xlsx("../data/journal_publisher_14_18.xlsx")
Data Schema:
Data Schema:
Variable | Description | Source |
---|---|---|
issn_wos |
ISSN, a standardized journal id. | KB Web of Science: wos_b_2019.issues.issn |
publication_year |
Year of publication, obtained from KB Web of Science | KB Web of Science: wos_b_2019.items.pubyear |
articles |
Number of original articles and reviews published. | KB Web of Science: Grouped counts over wos_b_2019.issues.issn and wos_b_2019.items.pubyear |
journal_title |
Most frequently used journal title in terms of articles published between 2014 - 2018. If missing, the journal was not indexed in Crossref | Crossref |
publisher |
Most frequently used publisher name in terms of articles published between 2014 - 2018. If missing, the journal was not indexed in Crossref | Crossref |
oa_journal |
Is the journal publishing all articles open access without delay (full open access)? | Bielefeld GOLD OA List V3 |
issn_l |
Linking ISSN, a journal id that groups the different media of the same serial publication, e.g. ISSN for print with electronic issn. | CIEPS |