Skip to content

Commit

Permalink
Add Widget Support in Method "Document.insert_pdf"
Browse files Browse the repository at this point in the history
We previously omitted form fields in source PDFs when merging PDFs via "target.insert_pdf(source)".
This feature has frequently been requested.
This fix now adds the feature as an optional category of page objects, alongside the already supported annotations and  links.
  • Loading branch information
JorjMcKie committed Jan 26, 2025
1 parent 518c213 commit 6c8ef05
Show file tree
Hide file tree
Showing 8 changed files with 196 additions and 37 deletions.
14 changes: 7 additions & 7 deletions docs/document.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1155,7 +1155,7 @@ For details on **embedded files** refer to Appendix 3.

Please consider that annotations are complex objects and may consist of more data "underneath" their visual appearance. Examples are "Text" and "FileAttachment" annotations. When "baking in" annotations / widgets with this method, all this underlying information (attached files, comments, associated PopUp annotations, etc.) will be lost and be removed on next garbage collection.

Use this feature for instance for methods :meth:`Document.insert_pdf` (which supports no copying of widgets) or :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.
Use this feature for instance for :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.


:arg bool annots: convert annotations.
Expand Down Expand Up @@ -1293,13 +1293,12 @@ For details on **embedded files** refer to Appendix 3.
pair: rotate; Document.insert_pdf
pair: links; Document.insert_pdf
pair: annots; Document.insert_pdf
pair: widgets; Document.insert_pdf
pair: show_progress; Document.insert_pdf

.. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, show_progress=0, final=1)
.. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, widgets=True, show_progress=0, final=1)

* Changed in v1.19.3 - as a fix to issue `#537 <https://github.com/pymupdf/PyMuPDF/issues/537>`_, form fields are always excluded.

PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links and annotations can be excluded in the target, see below. All page numbers are 0-based.
PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links, annotations and widgets can be excluded in the target, see below. All page numbers are 0-based.

:arg docsrc: An opened PDF *Document* which must not be the current document. However, it may refer to the same underlying file.
:type docsrc: *Document*
Expand All @@ -1313,13 +1312,14 @@ For details on **embedded files** refer to Appendix 3.
:arg int rotate: All copied pages will be rotated by the provided value (degrees, integer multiple of 90).

:arg bool links: Choose whether (internal and external) links should be included in the copy. Default is `True`. *Named* links (:data:`LINK_NAMED`) and internal links to outside the copied page range are **always excluded**.
:arg bool annots: *(new in v1.16.1)* choose whether annotations should be included in the copy. Form **fields can never be copied** -- see below.
:arg bool annots: choose whether annotations should be included in the copy.
:arg bool widgets: choose whether annotations should be included in the copy. If `True` and at least one of the source pages contains form fields, the target PDF will be turned into a Form PDF (if not already being one).
:arg int show_progress: *(new in v1.17.7)* specify an interval size greater zero to see progress messages on `sys.stdout`. After each interval, a message like `Inserted 30 of 47 pages.` will be printed.
:arg int final: *(new in v1.18.0)* controls whether the list of already copied objects should be **dropped** after this method, default *True*. Set it to 0 except for the last one of multiple insertions from the same source PDF. This saves target file size and speeds up execution considerably.

.. note::

1. This is a page-based method. Document-level information of source documents is therefore ignored. Examples include Optional Content, Embedded Files, `StructureElem`, `AcroForm`, table of contents, page labels, metadata, named destinations (and other named entries) and some more. As a consequence, specifically, **Form Fields (widgets) can never be copied** -- although they seem to appear on pages only. Look at :meth:`Document.bake` for converting a source document if you need to retain at least widget **appearances.**
1. This is a page-based method. Document-level information of source documents is therefore mostly ignored. Examples include Optional Content, Embedded Files, `StructureElem`, table of contents, page labels, metadata, named destinations (and other named entries) and some more.

2. If `from_page > to_page`, pages will be **copied in reverse order**. If `0 <= from_page == to_page`, then one page will be copied.

Expand Down
2 changes: 1 addition & 1 deletion docs/the-basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ With :meth:`Document.insert_file` you can invoke the method to merge :ref:`suppo

**Taking it further**

It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation. This Wiki `article <https://github.com/pymupdf/PyMuPDF/wiki/Inserting-Pages-from-other-PDFs>`_ contains a full description.
It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation.

The GUI script `join.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/join-documents/join.py>`_ uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:

Expand Down
50 changes: 27 additions & 23 deletions src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4589,12 +4589,14 @@ def insert_file(self,
def insert_pdf(
self,
docsrc,
*,
from_page=-1,
to_page=-1,
start_at=-1,
rotate=-1,
links=1,
annots=1,
widgets=1,
show_progress=0,
final=1,
_gmap=None,
Expand All @@ -4609,6 +4611,7 @@ def insert_pdf(
rotate: (int) rotate copied pages, default -1 is no change.
links: (int/bool) whether to also copy links.
annots: (int/bool) whether to also copy annotations.
widgets: (int/bool) whether to also copy form fields.
show_progress: (int) progress message interval, 0 is no messages.
final: (bool) indicates last insertion from this source PDF.
_gmap: internal use only
Expand All @@ -4626,6 +4629,26 @@ def insert_pdf(
sa = start_at
if sa < 0:
sa = self.page_count
outCount = self.page_count
srcCount = docsrc.page_count

# local copies of page numbers
fp = from_page
tp = to_page
sa = start_at

# normalize page numbers
fp = max(fp, 0) # -1 = first page
fp = min(fp, srcCount - 1) # but do not exceed last page

if tp < 0:
tp = srcCount - 1 # -1 = last page
tp = min(tp, srcCount - 1) # but do not exceed last page

if sa < 0:
sa = outCount # -1 = behind last page
sa = min(sa, outCount) # but that is also the limit

if len(docsrc) > show_progress > 0:
inname = os.path.basename(docsrc.name)
if not inname:
Expand Down Expand Up @@ -4663,25 +4686,6 @@ def insert_pdf(
else:
pdfout = _as_pdf_document(self)
pdfsrc = _as_pdf_document(docsrc)
outCount = mupdf.fz_count_pages(self)
srcCount = mupdf.fz_count_pages(docsrc.this)

# local copies of page numbers
fp = from_page
tp = to_page
sa = start_at

# normalize page numbers
fp = max(fp, 0) # -1 = first page
fp = min(fp, srcCount - 1) # but do not exceed last page

if tp < 0:
tp = srcCount - 1 # -1 = last page
tp = min(tp, srcCount - 1) # but do not exceed last page

if sa < 0:
sa = outCount # -1 = behind last page
sa = min(sa, outCount) # but that is also the limit

if not pdfout.m_internal or not pdfsrc.m_internal:
raise TypeError( "source or target not a PDF")
Expand All @@ -4692,7 +4696,9 @@ def insert_pdf(
self._reset_page_refs()
if links:
#log( 'insert_pdf(): calling self._do_links()')
self._do_links(docsrc, from_page = from_page, to_page = to_page, start_at = sa)
self._do_links(docsrc, from_page=fp, to_page=tp, start_at=sa)
if widgets:
self._do_widgets(docsrc, _gmap, from_page=fp, to_page=tp, start_at=sa)
if final == 1:
self.Graftmaps[isrt] = None
#log( 'insert_pdf(): returning')
Expand Down Expand Up @@ -20150,9 +20156,6 @@ def page_merge(doc_des, doc_src, page_from, page_to, rotate, links, copy_annots,
continue
if mupdf.pdf_name_eq( subtype, PDF_NAME('Popup')):
continue
if mupdf.pdf_name_eq( subtype, PDF_NAME('Widget')):
mupdf.fz_warn( "skipping widget annotation")
continue
if mupdf.pdf_name_eq(subtype, PDF_NAME('Widget')):
continue
mupdf.pdf_dict_del( o, PDF_NAME('Popup'))
Expand Down Expand Up @@ -21295,6 +21298,7 @@ def _atexit():
Annot.get_textbox = utils.get_textbox

Document._do_links = utils.do_links
Document._do_widgets = utils.do_widgets
Document.del_toc_item = utils.del_toc_item
Document.get_char_widths = utils.get_char_widths
Document.get_oc = utils.get_oc
Expand Down
5 changes: 0 additions & 5 deletions src/extra.i
Original file line number Diff line number Diff line change
Expand Up @@ -346,11 +346,6 @@ static void page_merge(
mupdf::PdfObj subtype = mupdf::pdf_dict_get(o, PDF_NAME(Subtype));
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Link))) continue;
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Popup))) continue;
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget)))
{
mupdf::fz_warn("skipping widget annotation");
continue;
}
if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget))) continue;
mupdf::pdf_dict_del(o, PDF_NAME(Popup));
mupdf::pdf_dict_del(o, PDF_NAME(P));
Expand Down
127 changes: 126 additions & 1 deletion src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1675,6 +1675,131 @@ def set_toc(
return toclen


def do_widgets(
tar: pymupdf.Document,
src: pymupdf.Document,
graftmap,
from_page: int = -1,
to_page: int = -1,
start_at: int = -1,
) -> None:
"""Insert widgets contained in copied page range into destination PDF.
Parameter values **must** equal those of method insert_pdf(). Method
insert_pdf() which must have been previously executed.
"""
if not src.is_form_pdf: # nothing to do: source PDF has no fields
return

def get_acroform(doc):
"""Retrieve the AcroForm dictionary form a PDF."""
pdf = mupdf.pdf_document_from_fz_document(doc)
# AcroForm (= central form field info)
return mupdf.pdf_dict_getp(mupdf.pdf_trailer(pdf), "Root/AcroForm")

tarpdf = mupdf.pdf_document_from_fz_document(tar)
srcpdf = mupdf.pdf_document_from_fz_document(src)

if tar.is_form_pdf:
# target is a Form PDF, so use its AcroForm to include source fields
acro = get_acroform(tar)
# Important arrays of indirect objects
tar_fields = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("Fields"))
tar_co = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("CO"))
if not mupdf.pdf_is_array(tar_co):
tar_co = mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)
else:
# target is no Form PDF, so copy over source AcroForm
acro = mupdf.pdf_deep_copy_obj(get_acroform(src)) # make a copy

# Clear "Fields" and "CO" arrays: will be populated by page fields.
# This is required to avoid copying unneeded objects.
mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("Fields"))
mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("Fields"), 5)
mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("CO"))
mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)

# Enrich AcroForm for copying to target
acro_graft = mupdf.pdf_graft_mapped_object(graftmap, acro)

# Insert AcroForm into target PDF
acro_tar = mupdf.pdf_add_object(tarpdf, acro_graft)
tar_fields = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("Fields"))
tar_co = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("CO"))

# get its xref and insert it into target catalog
tar_xref = mupdf.pdf_to_num(acro_tar)
acro_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)
root = mupdf.pdf_dict_get(mupdf.pdf_trailer(tarpdf), pymupdf.PDF_NAME("Root"))
mupdf.pdf_dict_put(root, pymupdf.PDF_NAME("AcroForm"), acro_tar_ind)

if from_page <= to_page:
src_range = range(from_page, to_page + 1)
else:
src_range = range(from_page, to_page - 1, -1)

for i in range(len(src_range)):
# read first page that was copied over
tar_page = tar[start_at + i]

# convert it to a formal PDF page
tar_page_pdf = mupdf.pdf_page_from_fz_page(tar_page)

# extract its annotations array
tar_annots = mupdf.pdf_dict_get(tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"))
if not mupdf.pdf_is_array(tar_annots):
tar_annots = mupdf.pdf_dict_put_array(
tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"), 5
)

# read the original page in the source PDF
src_page = src[src_range[i]]

# now walk through source page widgets and copy over
w_xrefs = [ # widget xrefs of the source page
xref
for xref, wtype, _ in src_page.annot_xrefs()
if wtype == pymupdf.PDF_ANNOT_WIDGET # pylint: disable=no-member
]

# Remove page references from widgets to prevent duplicate copies
# of the page in the target.
for xref in w_xrefs:
w_obj = mupdf.pdf_load_object(srcpdf, xref)
mupdf.pdf_dict_del(w_obj, pymupdf.PDF_NAME("P"))

for xref in w_xrefs:
w_obj = mupdf.pdf_load_object(srcpdf, xref)

# check if field is a member of inter-field validations
temp = mupdf.pdf_dict_getp(w_obj, "AA/C")
if mupdf.pdf_is_dict(temp):
is_aac = True
else:
is_aac = False

# recursively complete the widget object with all referenced objects
w_obj_graft = mupdf.pdf_graft_mapped_object(graftmap, w_obj)

# add the completed widget object to the target PDF
w_obj_tar = mupdf.pdf_add_object(tarpdf, w_obj_graft)

# extract its generated target xref number
tar_xref = mupdf.pdf_to_num(w_obj_tar)

# create an indirect object from it
w_obj_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)

# insert this xref reference into the page,
mupdf.pdf_array_push(tar_annots, w_obj_tar_ind)

# and also into "AcroForm/Fields",
mupdf.pdf_array_push(tar_fields, w_obj_tar_ind)
# and also into "AcroForm/CO" if a computation field.
if is_aac:
mupdf.pdf_array_push(tar_co, w_obj_tar_ind)


def do_links(
doc1: pymupdf.Document,
doc2: pymupdf.Document,
Expand Down Expand Up @@ -5354,7 +5479,7 @@ def has_annots(doc: pymupdf.Document) -> bool:
for i in range(doc.page_count):
for item in doc.page_annot_xrefs(i):
# pylint: disable=no-member
if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET):
if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET): # pylint: disable=no-member
return True
return False

Expand Down
Binary file added tests/resources/cms-etc-filled.pdf
Binary file not shown.
Binary file added tests/resources/interfield-calculation.pdf
Binary file not shown.
35 changes: 35 additions & 0 deletions tests/test_insertpdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,38 @@ def test_3789():
# If this is the last split file, exit the loop
if to_page == -1:
break


def test_widget_insert():
"""Confirm copy of form fields / widgets."""
from pymupdf import mupdf
tar = pymupdf.open(os.path.join(resources, "cms-etc-filled.pdf"))
pc0 = tar.page_count # for later assertion
src = pymupdf.open(os.path.join(resources, "interfield-calculation.pdf"))
pc1 = src.page_count # for later assertion

tarpdf = pymupdf._as_pdf_document(tar)
tar_field_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
)
tar_co_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
)
srcpdf = pymupdf._as_pdf_document(src)
src_field_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/Fields")
)
src_co_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/CO")
)

tar.insert_pdf(src)
new_field_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
)
new_co_count = mupdf.pdf_array_len(
mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
)
assert tar.page_count == pc0 + pc1
assert new_field_count == tar_field_count + src_field_count
assert new_co_count == tar_co_count + src_co_count

0 comments on commit 6c8ef05

Please sign in to comment.