Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted / blank page PDF downloads #145

Open
rossmounce opened this issue Dec 6, 2016 · 5 comments
Open

Corrupted / blank page PDF downloads #145

rossmounce opened this issue Dec 6, 2016 · 5 comments

Comments

@rossmounce
Copy link
Member

Very bizarre. Getpapers appears to be downloading PDF files of the right size for me (they are not 0-byte files) but when I open them there are completely blank. Blank pages. The right number of pages, but just completely blank. Nor is it a problem with my local PDF viewing software: cloud PDF viewing services also show that these PDF files are seemingly blank pages despite MB file sizes.

I have zipped up the entire output project folder so you can inspect the files yourself (only 12 'hits' for the search): https://github.com/rossmounce/tmpfilestorage/raw/master/testaardvark.zip

ross@ross-envy:~/workspace/contentmine/teststuff$ node --version
v4.0.0
ross@ross-envy:~/workspace/contentmine/teststuff$ npm version
{ npm: '3.10.8',
  ares: '1.10.1-DEV',
  http_parser: '2.5.0',
  modules: '46',
  node: '4.0.0',
  openssl: '1.0.2d',
  uv: '1.7.3',
  v8: '4.5.103.30',
  zlib: '1.2.8' }
ross@ross-envy:~/workspace/contentmine/teststuff$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04 LTS
Release:	16.04
Codename:	xenial
ross@ross-envy:~/workspace/contentmine/teststuff$ getpapers -V
0.4.10
ross@ross-envy:~/workspace/contentmine/teststuff$ getpapers -q 'aardvark AND FIRST_PDATE:[2016-01-01 TO 2016-12-01]' -o testaardvark --pdf
info: Searching using eupmc API
info: Found 12 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [=======================] 100% (12/12) [1.6s elapsed, eta 0.0]
info: All downloads succeeded!
ross@ross-envy:~/workspace/contentmine/teststuff$ tree testaardvark
testaardvark
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC4731086
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4798954
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4841245
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4920337
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4924314
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4965448
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4973251
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4982594
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5025827
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5028775
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5061548
│   ├── eupmc_result.json
│   └── fulltext.pdf
└── PMC5089389
    ├── eupmc_result.json
    └── fulltext.pdf

12 directories, 26 files

@petermr
Copy link
Member

petermr commented Dec 6, 2016

are they all from EPMC?
and is this all the files or are some correct?

I have downloaded PMC4841245 and it gives a PDF of 38 Mbytes which doesn't open.
So it looks like there is a corruption somewhere.

@petermr
Copy link
Member

petermr commented Dec 6, 2016

The header shows it to be a PDF:

"fulltext.pdf" may be a binary file.  See it anyway? 
%PDF-1.5
%����
28 0 obj
<<
/Length 3925      
/Filter /FlateDecode
>>
stream

@rossmounce
Copy link
Member Author

rossmounce commented Dec 7, 2016

Are they all from EUPMC? Yes.

The header shows it to be a PDF: Yes.

Some of the PDFs open (for me with evince), and the correct number of pages are shown, some are 3 pages, some are say 27 pages. But all the pages are white/blank. Ordinarily I would assume that this is something wrong with my local PDF viewing software, so I also tried viewing these getpaper downloaded files in the cloud. The cloud software also "sees" them as blank pages, therefore the problem is real I think.

I have this problem on two independent machines too. Reproducible.

It's not just that specific query either.

Other EUPMC API queries (this one with just 3 open access hits) also give the same problem:

getpapers -q 'Gasteria AND FIRST_PDATE:[2015-01-01 TO 2016-08-20]' -o gasteria --pdf

The downloading of fulltext XML (--xml) and SI (--supp) is unaffected/working fine.

This bug also affects PDFs downloaded from the arxiv API. I tried both sample queries, both of which return corrupted PDFs, all the same size ~2.1kb:

getpapers --api arxiv --query 'all:transcriptome' -o arxiv --pdf
getpapers --api arxiv --query 'au:"del maestro" AND ti:checkerboard' -o arxiv --pdf 

@rossmounce
Copy link
Member Author

Just to say, I also appear to be getting blank page PDFs in Windows 8.1 getpapers too. This problem is not confined to linux installations.

@sedimentation-fault
Copy link

For those who still have this issue: take a look at #152 and the commit 99b93d8 that resolved it - it may help resolve this bug too. Please give it a try and post your experiences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants