帮助文档

概述

这是一个用来处理PDF文件的命令行工具，其目标是无缝地将PDF集成到各种笔记工作流中。目前，该工具提供以下功能：

*管理目录*：从PDF导出目录到纯文本列表格式，并在修改后将其导入回PDF。该格式用户易读，方便修改，可以在此处查看。
注释管理
- *导出格式化的文本注释*：从PDF中提取如高亮、文本、方框等类型的注释，存储相关的图像，并支持从这些图像中OCR提取文本。使用 Mako 模板，您可以自由定制文本格式，导入到您喜欢的笔记系统中。
- *管理XFDF注释*：从PDF中导出XFDF注释并将其导入回PDF。该文件与XChange等PDF阅读器兼容。由于某些OCR软件在OCR过程中会压平注释，因此可以在OCR之前导出XFDF，然后在OCR之后导入XFDF以保留完整的注释。
- *删除注释*：方便分享原始PDF。
*页面标签和页码的转换*：有时笔记中记录的是页码，但阅读器需要基于页面标签进行导航。此功能处理此类差异。

开始

git clone --depth 1 https://github.com/yuchen-lea/pdfhelper.git
cd pdfhelper
make install
pip install -r requirements.txt

然后开始使用 pdfhelper 命令行工具吧！

对于后续的更新，只需从git拉取最新的代码并运行=make update=。

使用方法

pdfhelper -h

Some useful functions to process a PDF file.

positional arguments: {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,export-info,import-info,page-label-to-number,page-number-to-label} export-toc Export the TOC of the PDF. import-toc Import TOC from a file into the PDF. delete-annot Delete annotations from the PDF. export-xfdf-annot Export XFDF annotations of the PDF. import-xfdf-annot Import XFDF annotations of the PDF. export-annot Export formatted text annotations of the PDF. export-info Export information of the PDF. import-info Import information of the PDF. page-label-to-number Convert page label to page number. page-number-to-label Convert page number to page label. INFILE PDF file to process

options: -h, –help show this help message and exit –version, -v show program’s version number and exit

目录格式

实例：

@label 1=A
@label 8=[p-]II
@label 16=1
- Cover#1
- The Ten Commandments#2
- The Five Rules#3
- Contents#8
- Foreword#10
- Preface#12
# first = 16
- 1. Toys#2
- 2. Do It, Do It Again, and Again, and Again ...#14
- 3. Cons the Magnificent#32
# +2
- 4. Numbers Games#58

共有以下几种设置方式：

设置页码: 为第3页设置目录“The Five Rules”
```
- The Five Rules#3
    
```
- 🙋‍ 列表缩进即目录的缩进
设置起始页: 为第17页设置目录“1. Toys” （因为首页从16开始，所以2指向物理第17页）
```
# 1 = 16
- 1. Toys#2
    
```
- 🙋‍注意：使用模式“# number1=number2”可以将PDF的物理页面number2视为number1。任何后续设置为’x’的页面编号实际上将指向=x+number2-number1=的物理页面。这同样可以设置第二卷的起始页，例如“# 250=5”。
设置页面 gap: 为第75页（58 + (16-1) + 2）设置目录“4. Numbers Games”
```
# +2
- 4. Numbers Games#58
    
```
- 当有缺失或额外的页面时很有用。在缺失页面的位置（例如，计入页码的空白页已被删除），设置“# -[缺失的页面数]”。在额外页面的位置（例如，插图页面未计入页码），设置“# +[添加的页面数]”。
设置label：
- 从第1页开始，页码样式为大写字母，第1页的页码会显示为 “A”，第2页为 “B”，依此类推，直到第7页。
- 从第8页开始，页码样式切换为带有前缀 “p-” 的大写罗马数字。第8页的页码会显示为 “p-II”，第9页为 “p-III”，依此类推，直到第15页。
- 从第16页开始，第16页的页码会显示为 “1”，第17页为 “2”，依此类推，直到文档结束。

导出注释

目前，支持以下注释类型：

Type	Result
便签 Text	comment
文本 FreeText	comment
方框 Square	comment + picture (set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)
高亮 Highlight	comment + text (extract from the PDF)
下划线 Underline	comment + text (extract from the PDF)
波浪线 Squiggly	comment + text (extract from the PDF)
删除线 StrikeOut	comment + text (extract from the PDF)
手写 Ink	comment + picture (保存文档中标记高度内的内容，而不仅仅是标记本身。 set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)
线条/箭头 Line	comment + picture (保存文档中标记高度内的内容，而不仅仅是标记本身。 set the zoom factor by `--image-zoom`) + text (extract from the PDF, or use the `--ocr-service` and `--ocr-language` to recognize text within images.)

You can customize the note format by

--with-toc
--toc-list-item-format
--annot-list-item-format

Credits

此项目受到以下工具的启发：

0xabu/pdfannots: Extracts and formats text annotations from a PDF file: based on pdfminer and format as markdown text. It deals with hyphens but donot extract rectangle annot.
PDFPatcher(Chinese) a great pdf utility tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_CN.org

README_CN.org

帮助文档

概述

开始

使用方法

目录格式

导出注释

Credits

Files

README_CN.org

Latest commit

History

README_CN.org

File metadata and controls

帮助文档

概述

开始

使用方法

目录格式

导出注释

Credits