-
Notifications
You must be signed in to change notification settings - Fork 0
samsmith/pdf2excel
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
license: BSD For programmers: - json of the raw parse is dumped into all the excel comment fields where there's stuff in the text - you can just ignore the actual content - if you're careful, you can, possibly, try getting with a full size bounding box for page 0 and it might work (for some PDFs), which means you can automate this in 2 get requests. We'll work on this simple. The code you get back is the MD5 hex checksum of the pdf you ask it for. -It requires the pdftohtml to be in /usr/local/bin (the one with the -xml option - pdf<i>2</i>html wont do), and the following perl modules: - use CGI qw/param/; - use LWP::Simple; - use File::Temp; - use Digest::MD5 qw/md5_hex/; - use XML::Simple; - use JSON qw/to_json/; - use Spreadsheet::WriteExcel; - Please don't hit this script with automated scrapers yet. At some point soon, I'll stick it on github so you can run it yourself once I'm happy it's actually substantially ok and passes the minimum stuff you need it to pass (at the moment, it has a couple of major missing bits). contact: [email protected]
About
PDF2excel converter
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published