Add Widget Support in Method "Document.insert_pdf"

We previously omitted form fields in source PDFs when merging PDFs via "target.insert_pdf(source)". This feature has frequently been requested. This fix now adds the feature as an optional category of page objects, alongside the already supported annotations and links.
pymupdf · Jan 26, 2025 · 6c8ef05 · 6c8ef05
1 parent 518c213
commit 6c8ef05
Show file tree

Hide file tree

Showing 8 changed files with 196 additions and 37 deletions.
diff --git a/docs/document.rst b/docs/document.rst
@@ -1155,7 +1155,7 @@ For details on **embedded files** refer to Appendix 3.
 
     Please consider that annotations are complex objects and may consist of more data "underneath" their visual appearance. Examples are "Text" and "FileAttachment" annotations. When "baking in" annotations / widgets with this method, all this underlying information (attached files, comments, associated PopUp annotations, etc.) will be lost and be removed on next garbage collection.
 
-    Use this feature for instance for methods :meth:`Document.insert_pdf` (which supports no copying of widgets) or :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.
+    Use this feature for instance for :meth:`Page.show_pdf_page` (which supports neither annotations nor widgets) when the source pages should look exactly the same in the target.
 
 
     :arg bool annots: convert annotations.
@@ -1293,13 +1293,12 @@ For details on **embedded files** refer to Appendix 3.
      pair: rotate; Document.insert_pdf
      pair: links; Document.insert_pdf
      pair: annots; Document.insert_pdf
+     pair: widgets; Document.insert_pdf
      pair: show_progress; Document.insert_pdf
 
-  .. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, show_progress=0, final=1)
+  .. method:: insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True, widgets=True, show_progress=0, final=1)
 
-    * Changed in v1.19.3 - as a fix to issue `#537 <https://github.com/pymupdf/PyMuPDF/issues/537>`_, form fields are always excluded.
-
-    PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links and annotations can be excluded in the target, see below. All page numbers are 0-based.
+    PDF only: Copy the page range **[from_page, to_page]** (including both) of PDF document *docsrc* into the current one. Inserts will start with page number *start_at*. Value -1 indicates default values. All pages thus copied will be rotated as specified. Links, annotations and widgets can be excluded in the target, see below. All page numbers are 0-based.
 
     :arg docsrc: An opened PDF *Document* which must not be the current document. However, it may refer to the same underlying file.
     :type docsrc: *Document*
@@ -1313,13 +1312,14 @@ For details on **embedded files** refer to Appendix 3.
     :arg int rotate: All copied pages will be rotated by the provided value (degrees, integer multiple of 90).
 
     :arg bool links: Choose whether (internal and external) links should be included in the copy. Default is `True`. *Named* links (:data:`LINK_NAMED`) and internal links to outside the copied page range are **always excluded**. 
-    :arg bool annots: *(new in v1.16.1)* choose whether annotations should be included in the copy. Form **fields can never be copied** -- see below.
+    :arg bool annots: choose whether annotations should be included in the copy.
+    :arg bool widgets: choose whether annotations should be included in the copy. If `True` and at least one of the source pages contains form fields, the target PDF will be turned into a Form PDF (if not already being one).
     :arg int show_progress: *(new in v1.17.7)* specify an interval size greater zero to see progress messages on `sys.stdout`. After each interval, a message like `Inserted 30 of 47 pages.` will be printed.
     :arg int final: *(new in v1.18.0)* controls whether the list of already copied objects should be **dropped** after this method, default *True*. Set it to 0 except for the last one of multiple insertions from the same source PDF. This saves target file size and speeds up execution considerably.
 
   .. note::
 
-     1. This is a page-based method. Document-level information of source documents is therefore ignored. Examples include Optional Content, Embedded Files, `StructureElem`, `AcroForm`, table of contents, page labels, metadata, named destinations (and other named entries) and some more. As a consequence, specifically, **Form Fields (widgets) can never be copied** -- although they seem to appear on pages only. Look at :meth:`Document.bake` for converting a source document if you need to retain at least widget **appearances.**
+     1. This is a page-based method. Document-level information of source documents is therefore mostly ignored. Examples include Optional Content, Embedded Files, `StructureElem`, table of contents, page labels, metadata, named destinations (and other named entries) and some more.
 
      2. If `from_page > to_page`, pages will be **copied in reverse order**. If `0 <= from_page == to_page`, then one page will be copied.
 

diff --git a/docs/the-basics.rst b/docs/the-basics.rst
@@ -198,7 +198,7 @@ With :meth:`Document.insert_file` you can invoke the method to merge :ref:`suppo
 
     **Taking it further**
 
-    It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation. This Wiki `article <https://github.com/pymupdf/PyMuPDF/wiki/Inserting-Pages-from-other-PDFs>`_ contains a full description.
+    It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation.
 
     The GUI script `join.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/join-documents/join.py>`_ uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:
 

diff --git a/src/__init__.py b/src/__init__.py
@@ -4589,12 +4589,14 @@ def insert_file(self,
     def insert_pdf(
             self,
             docsrc,
+            *,
             from_page=-1,
             to_page=-1,
             start_at=-1,
             rotate=-1,
             links=1,
             annots=1,
+            widgets=1,
             show_progress=0,
             final=1,
             _gmap=None,
@@ -4609,6 +4611,7 @@ def insert_pdf(
             rotate: (int) rotate copied pages, default -1 is no change.
             links: (int/bool) whether to also copy links.
             annots: (int/bool) whether to also copy annotations.
+            widgets: (int/bool) whether to also copy form fields.
             show_progress: (int) progress message interval, 0 is no messages.
             final: (bool) indicates last insertion from this source PDF.
             _gmap: internal use only
@@ -4626,6 +4629,26 @@ def insert_pdf(
         sa = start_at
         if sa < 0:
             sa = self.page_count
+        outCount = self.page_count
+        srcCount = docsrc.page_count
+
+        # local copies of page numbers
+        fp = from_page
+        tp = to_page
+        sa = start_at
+
+        # normalize page numbers
+        fp = max(fp, 0) # -1 = first page
+        fp = min(fp, srcCount - 1)  # but do not exceed last page
+
+        if tp < 0:
+            tp = srcCount - 1   # -1 = last page
+        tp = min(tp, srcCount - 1)  # but do not exceed last page
+
+        if sa < 0:
+            sa = outCount   # -1 = behind last page
+        sa = min(sa, outCount)  # but that is also the limit
+
         if len(docsrc) > show_progress > 0:
             inname = os.path.basename(docsrc.name)
             if not inname:
@@ -4663,25 +4686,6 @@ def insert_pdf(
         else:
             pdfout = _as_pdf_document(self)
             pdfsrc = _as_pdf_document(docsrc)
-            outCount = mupdf.fz_count_pages(self)
-            srcCount = mupdf.fz_count_pages(docsrc.this)
-
-            # local copies of page numbers
-            fp = from_page
-            tp = to_page
-            sa = start_at
-
-            # normalize page numbers
-            fp = max(fp, 0) # -1 = first page
-            fp = min(fp, srcCount - 1)  # but do not exceed last page
-
-            if tp < 0:
-                tp = srcCount - 1   # -1 = last page
-            tp = min(tp, srcCount - 1)  # but do not exceed last page
-
-            if sa < 0:
-                sa = outCount   # -1 = behind last page
-            sa = min(sa, outCount)  # but that is also the limit
 
             if not pdfout.m_internal or not pdfsrc.m_internal:
                 raise TypeError( "source or target not a PDF")
@@ -4692,7 +4696,9 @@ def insert_pdf(
         self._reset_page_refs()
         if links:
             #log( 'insert_pdf(): calling self._do_links()')
-            self._do_links(docsrc, from_page = from_page, to_page = to_page, start_at = sa)
+            self._do_links(docsrc, from_page=fp, to_page=tp, start_at=sa)
+        if widgets:
+            self._do_widgets(docsrc, _gmap, from_page=fp, to_page=tp, start_at=sa)
         if final == 1:
             self.Graftmaps[isrt] = None
         #log( 'insert_pdf(): returning')
@@ -20150,9 +20156,6 @@ def page_merge(doc_des, doc_src, page_from, page_to, rotate, links, copy_annots,
                     continue
                 if mupdf.pdf_name_eq( subtype, PDF_NAME('Popup')):
                     continue
-                if mupdf.pdf_name_eq( subtype, PDF_NAME('Widget')):
-                    mupdf.fz_warn( "skipping widget annotation")
-                    continue
                 if mupdf.pdf_name_eq(subtype, PDF_NAME('Widget')):
                     continue
                 mupdf.pdf_dict_del( o, PDF_NAME('Popup'))
@@ -21295,6 +21298,7 @@ def _atexit():
 Annot.get_textbox           = utils.get_textbox
 
 Document._do_links          = utils.do_links
+Document._do_widgets        = utils.do_widgets
 Document.del_toc_item       = utils.del_toc_item
 Document.get_char_widths    = utils.get_char_widths
 Document.get_oc             = utils.get_oc

diff --git a/src/extra.i b/src/extra.i
@@ -346,11 +346,6 @@ static void page_merge(
                 mupdf::PdfObj subtype = mupdf::pdf_dict_get(o, PDF_NAME(Subtype));
                 if (mupdf::pdf_name_eq(subtype, PDF_NAME(Link))) continue;
                 if (mupdf::pdf_name_eq(subtype, PDF_NAME(Popup))) continue;
-                if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget)))
-                {
-                    mupdf::fz_warn("skipping widget annotation");
-                    continue;
-                }
                 if (mupdf::pdf_name_eq(subtype, PDF_NAME(Widget))) continue;
                 mupdf::pdf_dict_del(o, PDF_NAME(Popup));
                 mupdf::pdf_dict_del(o, PDF_NAME(P));

diff --git a/src/utils.py b/src/utils.py
@@ -1675,6 +1675,131 @@ def set_toc(
     return toclen
 
 
+def do_widgets(
+    tar: pymupdf.Document,
+    src: pymupdf.Document,
+    graftmap,
+    from_page: int = -1,
+    to_page: int = -1,
+    start_at: int = -1,
+) -> None:
+    """Insert widgets contained in copied page range into destination PDF.
+
+    Parameter values **must** equal those of method insert_pdf(). Method
+    insert_pdf() which must have been previously executed.
+    """
+    if not src.is_form_pdf:  # nothing to do: source PDF has no fields
+        return
+
+    def get_acroform(doc):
+        """Retrieve the AcroForm dictionary form a PDF."""
+        pdf = mupdf.pdf_document_from_fz_document(doc)
+        # AcroForm (= central form field info)
+        return mupdf.pdf_dict_getp(mupdf.pdf_trailer(pdf), "Root/AcroForm")
+
+    tarpdf = mupdf.pdf_document_from_fz_document(tar)
+    srcpdf = mupdf.pdf_document_from_fz_document(src)
+
+    if tar.is_form_pdf:
+        # target is a Form PDF, so use its AcroForm to include source fields
+        acro = get_acroform(tar)
+        # Important arrays of indirect objects
+        tar_fields = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("Fields"))
+        tar_co = mupdf.pdf_dict_get(acro, pymupdf.PDF_NAME("CO"))
+        if not mupdf.pdf_is_array(tar_co):
+            tar_co = mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)
+    else:
+        # target is no Form PDF, so copy over source AcroForm
+        acro = mupdf.pdf_deep_copy_obj(get_acroform(src))  # make a copy
+
+        # Clear "Fields" and "CO" arrays: will be populated by page fields.
+        # This is required to avoid copying unneeded objects.
+        mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("Fields"))
+        mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("Fields"), 5)
+        mupdf.pdf_dict_del(acro, pymupdf.PDF_NAME("CO"))
+        mupdf.pdf_dict_put_array(acro, pymupdf.PDF_NAME("CO"), 5)
+
+        # Enrich AcroForm for copying to target
+        acro_graft = mupdf.pdf_graft_mapped_object(graftmap, acro)
+
+        # Insert AcroForm into target PDF
+        acro_tar = mupdf.pdf_add_object(tarpdf, acro_graft)
+        tar_fields = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("Fields"))
+        tar_co = mupdf.pdf_dict_get(acro_tar, pymupdf.PDF_NAME("CO"))
+
+        # get its xref and insert it into target catalog
+        tar_xref = mupdf.pdf_to_num(acro_tar)
+        acro_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)
+        root = mupdf.pdf_dict_get(mupdf.pdf_trailer(tarpdf), pymupdf.PDF_NAME("Root"))
+        mupdf.pdf_dict_put(root, pymupdf.PDF_NAME("AcroForm"), acro_tar_ind)
+
+    if from_page <= to_page:
+        src_range = range(from_page, to_page + 1)
+    else:
+        src_range = range(from_page, to_page - 1, -1)
+
+    for i in range(len(src_range)):
+        # read first page that was copied over
+        tar_page = tar[start_at + i]
+
+        # convert it to a formal PDF page
+        tar_page_pdf = mupdf.pdf_page_from_fz_page(tar_page)
+
+        # extract its annotations array
+        tar_annots = mupdf.pdf_dict_get(tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"))
+        if not mupdf.pdf_is_array(tar_annots):
+            tar_annots = mupdf.pdf_dict_put_array(
+                tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"), 5
+            )
+
+        # read the original page in the source PDF
+        src_page = src[src_range[i]]
+
+        # now walk through source page widgets and copy over
+        w_xrefs = [  # widget xrefs of the source page
+            xref
+            for xref, wtype, _ in src_page.annot_xrefs()
+            if wtype == pymupdf.PDF_ANNOT_WIDGET  # pylint: disable=no-member
+        ]
+
+        # Remove page references from widgets to prevent duplicate copies
+        # of the page in the target.
+        for xref in w_xrefs:
+            w_obj = mupdf.pdf_load_object(srcpdf, xref)
+            mupdf.pdf_dict_del(w_obj, pymupdf.PDF_NAME("P"))
+
+        for xref in w_xrefs:
+            w_obj = mupdf.pdf_load_object(srcpdf, xref)
+
+            # check if field is a member of inter-field validations
+            temp = mupdf.pdf_dict_getp(w_obj, "AA/C")
+            if mupdf.pdf_is_dict(temp):
+                is_aac = True
+            else:
+                is_aac = False
+
+            # recursively complete the widget object with all referenced objects
+            w_obj_graft = mupdf.pdf_graft_mapped_object(graftmap, w_obj)
+
+            # add the completed widget object to the target PDF
+            w_obj_tar = mupdf.pdf_add_object(tarpdf, w_obj_graft)
+
+            # extract its generated target xref number
+            tar_xref = mupdf.pdf_to_num(w_obj_tar)
+
+            # create an indirect object from it
+            w_obj_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0)
+
+            # insert this xref reference into the page,
+            mupdf.pdf_array_push(tar_annots, w_obj_tar_ind)
+
+            # and also into "AcroForm/Fields",
+            mupdf.pdf_array_push(tar_fields, w_obj_tar_ind)
+            # and also into "AcroForm/CO" if a computation field.
+            if is_aac:
+                mupdf.pdf_array_push(tar_co, w_obj_tar_ind)
+
+
 def do_links(
     doc1: pymupdf.Document,
     doc2: pymupdf.Document,
@@ -5354,7 +5479,7 @@ def has_annots(doc: pymupdf.Document) -> bool:
     for i in range(doc.page_count):
         for item in doc.page_annot_xrefs(i):
             # pylint: disable=no-member
-            if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET):
+            if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET):  # pylint: disable=no-member
                 return True
     return False
 

diff --git a/tests/resources/cms-etc-filled.pdf b/tests/resources/cms-etc-filled.pdf
diff --git a/tests/resources/interfield-calculation.pdf b/tests/resources/interfield-calculation.pdf
diff --git a/tests/test_insertpdf.py b/tests/test_insertpdf.py
@@ -186,3 +186,38 @@ def test_3789():
         # If this is the last split file, exit the loop
         if to_page == -1:
             break
+
+
+def test_widget_insert():
+    """Confirm copy of form fields / widgets."""
+    from pymupdf import mupdf
+    tar = pymupdf.open(os.path.join(resources, "cms-etc-filled.pdf"))
+    pc0 = tar.page_count  # for later assertion
+    src = pymupdf.open(os.path.join(resources, "interfield-calculation.pdf"))
+    pc1 = src.page_count  # for later assertion
+
+    tarpdf = pymupdf._as_pdf_document(tar)
+    tar_field_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
+    )
+    tar_co_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
+    )
+    srcpdf = pymupdf._as_pdf_document(src)
+    src_field_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/Fields")
+    )
+    src_co_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/CO")
+    )
+
+    tar.insert_pdf(src)
+    new_field_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields")
+    )
+    new_co_count = mupdf.pdf_array_len(
+        mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO")
+    )
+    assert tar.page_count == pc0 + pc1
+    assert new_field_count == tar_field_count + src_field_count
+    assert new_co_count == tar_co_count + src_co_count