-
Notifications
You must be signed in to change notification settings - Fork 1
Fragments by Underwood Sellers
Assumptions/hypotheses made in the figshare article that could be tested, or rather: do we see evidence in the code that this happened?
This is based on the figshare, because…
In the final publication the methodological section is removed and no mention in the body text is made of code or open process etc. That is relegated to a footnote on page 326: "5 We used the logistic regression functions from Pedregosa et al. 2011 and visualized results using Wickham 2009. For our own code, see Underwood 2015."
That is very interesting: we see two levels of 'hiding' (need better word) of the method(ological) aspects. First there's remarks like the eye-roll in the code, which means that certain things aren't spelled out in the figshare/article. Second there's the marked changes between figshare and publication: there's no methodology section, and no references to repo, data in the repo, or code (apart from bibliographic reference). Of course, we don't know if that's authors, editors, or reviewers.)
- Peer review problem (lit. background)
- Approach (what is defactoring?)
- also (i think) levels of review: content/literature theoretical (no), methodological (yes), code (yes)
- Overview of methodological clues and hooks in figshare
- Defactoring (actual notebook) + interweave methodological 'clues' with notebook
- Discussion of marked differences between code, figshare, publication, on the methodological level
- Discussio of defactoring and this type of peer review as a whole
-
If we choose venues of review that turn out to have no selection standards in common, the works in our reviewed sample won’t share many characteristics, and it will be hard to distinguish them from the random sample. (3)
-
Likewise, if it turns out that the nature of literary prestige changed completely over the course of this century, we will discover that it’s impossible to model the century as a single unit. (3)
-
They (want to) distinguish 4 periods of 20 years between 1840-1919.However: "the notion that standards would remain stable even for twenty years was not an assumption we made: it was a hypothesis we set out to test."
Assumption
being reviewed at all is a mark of literary prestige. They select journals that are selective by default (i.e.: being selected for these is a social marker for literariness). (3) "[W]e winnowed that list by choosing journals that seemed especially selective in their literary reviewing." [ JZ_20161214_1557: Hmm.. it is discussable of course what they then selected as to sample. They might have been mistaken capacity of editorial boards with literary judgements. This type of controlling co-variables is however ot what we are after(?), item(?): They list their samples on page 5, are there reasons to question the longitudinal spread? ]
"We also needed a sample that would be likely to contain books reviewed less often." They sample from Hathi Trust's 1820-1919 corpus (758,400 books in English), about 53,200 poetry containing books.
Can we discover whether there’s any relationship between poetic language and reception? We test models’ accounts of that relationship by asking it to make predictions about volumes they haven’t yet seen. (6)
They gathered 360 reviewed and 360 random volumes.
Avoid circularity, do not train on authors that are tested. Therefore the say they need 636 models, each one leaving out on of the authors in the corpus. (Hence there are 636 authors in the 720 volume corpus.)
Can we reproduce the result?
Including:
77.5% right by taking 50% divider for right/wrong.
79.2% if publication date is taken into the equation as a factor (skewed line)
(i.e. are those calculations right?)
Is the skewing interesting to look at from methodological point of view? Are there alternative explanations? One is maybe: there are simply more words later in the century that qualify as 'reviewed'? Growing vocabulary? Growing sample?
Continuity aspect
But "models trained only on a quarter-century of the evidence are still right (on average) about 76.8% of the volumes in the whole dataset. That’s already strong evidence for continuity." (11) Can we find evidence for this in the code, was it tested?
Data aspect
"The model we’re training here uses 3200 variables — the frequencies of the 3200 words most common in the collection." (13)
Can we concur with an inference of
"For instance, although the model generally loves gloom, it does roll its eyes at heavy-handed signals like “bleak” and “dire.”" (15)
Lacks methodological accountability
"Nothing about the modeling process itself compels this chronological pattern to appear. We don’t see a strongly-marked, consistent tilt if we model other social boundaries, like authorial gender." (18) "A model is a simplified representation of the world: the whole point is to leave some things out. Problems arise, however, when a social variable is not simply left out of a model, but used in unacknowledged ways to shape the model’s conclusions. For instance, if poems by women rarely got reviewed, and women also disproportionately used a particular vocabulary — say, a language of sentiment we’ve seen the model reject (“homes”, first-person plural, etc.) — then our model might be confounding literary prestige with gender. […] In the case of gender, we’ve checked carefully and can say confidently that we’re not seeing large distortions." (21–22)
Also lacking methodological accountability
"nationality probably is a confounding factor in this model, although it doesn’t by any means explain away all the effects we’re observing."
Selection
They argue how they selected journals, explain at length why the threw out Tait's Endinburgh Magazine (reviewed much more, more indiscrimenately than the others). [ JZ_20161215_1135: Hmm... that's rather biassing your data? ]. There's a eery clue in the last part of this discussion: "But it’s a debatable choice, which is why we’ve explained it at length here. In particular, taking out Tait’s made it difficult to achieve our original plan of distributing volumes evenly across the timeline; coverage now gets less dense as you move backward in time."
Sampling and data preparation
Seems to be have been responsibly and sensibly done. The random sample from Hathi Trust was cleaned from front and back matter etc. based on automated genre metadat added by the Trust. This is 97% correct. They don't seem to correct for that. So how does this influence the data and analysis? (33)
Training and interpreting models
They used "regularized logistic regression" (34) "we’ve limited our interpretive flexibility here by choosing a constant, and a number of features, that maximized predictive accuracy on the data" (34)
"A regression model with 3200 variables is not guaranteed to be transparent. The coefficient assigned to a word tells you only how variation in that word’s frequency will affect a prediction. It’s not necessarily a measure of statistical significance, because variables can interact in odd ways. A group of strongly predictive words that always appeared together could end up with small coefficients because predictive weight got “shared” across the group. For this reason, among others, it’s risky to place a lot of interpretive emphasis on single words that happen to be near the top or bottom of a list; instead, we’ve tried to emphasize broad patterns." (35)
"In training the model we “normalize” word frequencies by the standard deviation for each word (across the whole dataset). So when we use the model to illuminate specific passages, we also divide coefficients by the standard deviation. This tells us, roughly, how much a single occurrence of a given word would affect the model’s prediction, which is what we’re trying to dramatize when we quote a passage. We’ve rendered words in red if they’re in the top 1300 features by this metric, and colored them blue if they’re in the bottom 1300. The main weakness of this strategy is that it understates the aggregate importance of common words. We mentioned that feminine pronouns contribute to a poem’s odds of being reviewed, and that first- person plural pronouns detract. But there are a number of other syntactic preferences latent in the model. A paratactic style is prestigious (“and,” “but,” “or”). The future tense is not. These rhetorical patterns are harder to interpret than the thematic patterns we’ve foregrounded, but they could be at least as important." (35)
Bibliographic data
"Bibliographic information about the volumes we used is available on Github at github.com/tedunderwood/paceofchange. That site also includes our model’s predictions for all volumes, and the weights it assigned to different words. We’ve also shared our code and raw word-frequency data for the volumes, so readers who have Python 3 (with the scikit-learn module) can replicate our results." (36)
--JZ_20161215_1249