Document scraper for getting invoices automagically as pdf (useful for taxes or DMS)
🏠 Homepage
All settings can be changed via CLI
, env variable (even when using docker).
Setting | Description | Default value |
---|---|---|
AMAZON_USERNAME | Your Amazon username | null |
AMAZON_PASSWORD | Your amazon password | null |
AMAZON_TLD | Amazon top level domain | de |
AMAZON_YEAR_FILTER | Only extracts invoices from this year (i.e. 2023) | 2023 |
AMAZON_PAGE_FILTER | Only extracts invoices from this page (i.e. 2) | null |
ONLY_NEW | Tracks already scraped documents and starts a new run at the last scraped one | true |
FILE_DESTINATION_FOLDER | Destination path for all scraped documents | ./documents/ |
FILE_FALLBACK_EXTENSION | Fallback extension when no extension can be determined | .pdf |
DEBUG | Debug flag (sets the loglevel to DEBUG) | false |
SUBFOLDER_FOR_PAGES | Creates subfolders for every scraped page/plugin | false |
LOG_PATH | Sets the log path | ./logs/ |
LOG_LEVEL | Log level (see https://github.com/winstonjs/winston#logging-levels) | info |
RECURRING | Flag for executing the script periodically. Needs 'RECURRING_PATTERN' to be set. Default true when using docker container |
false |
RECURRING_PATTERN | Cron pattern to execute periodically. Needs RECURRING to true | */30 * * * * |
TZ | Timezone used for docker enviroments | Europe/Berlin |
npm install
$ npm install -g @disane-dev/docudigger
$ docudigger COMMAND
running command...
$ docudigger (--version)
@disane-dev/docudigger/2.0.7 linux-x64 node-v20.18.0
$ docudigger --help [COMMAND]
USAGE
$ docudigger COMMAND
...
Important
Don't forget to include --ignore-scripts
in your install command.
Scrapes all websites periodically (default for docker environment)
USAGE
$ docudigger scrape all [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l <value>] [-c <value> -r]
FLAGS
-c, --recurringCron=<value> [default: * * * * *] Cron pattern to execute periodically
-d, --debug
-l, --logPath=<value> [default: ./logs/] Log path
-r, --recurring
--logLevel=<option> [default: info] Specify level for logging.
<options: trace|debug|info|warn|error>
GLOBAL FLAGS
--json Format output as json.
DESCRIPTION
Scrapes all websites periodically
EXAMPLES
$ docudigger scrape all
Used to get invoices from amazon
USAGE
$ docudigger scrape amazon -u <value> -p <value> [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l
<value>] [-c <value> -r] [--fileDestinationFolder <value>] [--fileFallbackExentension <value>] [-t <value>]
[--yearFilter <value>] [--pageFilter <value>] [--onlyNew]
FLAGS
-c, --recurringCron=<value> [default: */30 * * * *] Cron pattern to execute periodically
-d, --debug
-l, --logPath=<value> [default: ./logs/] Log path
-p, --password=<value> (required) Password
-r, --recurring
-t, --tld=<value> [default: de] Amazon top level domain
-u, --username=<value> (required) Username
--fileDestinationFolder=<value> [default: ./data/] Amazon top level domain
--fileFallbackExentension=<value> [default: .pdf] Amazon top level domain
--logLevel=<option> [default: info] Specify level for logging.
<options: trace|debug|info|warn|error>
--onlyNew Gets only new invoices
--pageFilter=<value> Filters a page
--yearFilter=<value> Filters a year
GLOBAL FLAGS
--json Format output as json.
DESCRIPTION
Used to get invoices from amazon
Scrapes amazon invoices
EXAMPLES
$ docudigger scrape amazon
docker run \
-e AMAZON_USERNAME='[YOUR MAIL]' \
-e AMAZON_PASSWORD='[YOUR PW]' \
-e AMAZON_TLD='de' \
-e AMAZON_YEAR_FILTER='2024' \
-e AMAZON_PAGE_FILTER='1' \
-e LOG_LEVEL='info' \
-v "C:/temp/docudigger/:/home/node/docudigger" \
ghcr.io/disane87/docudigger
npm install
[Change created .env for your needs]
npm run start
👤 Marco Franke
- Website: http://byte-style.de
- Github: @Disane87
- LinkedIn: @marco-franke-799399136
Contributions, issues and feature requests are welcome!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a ⭐️ if this project helped you!
This README was generated with ❤️ by readme-md-generator