Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to python-mammoth a capability to output Tracked Changes from Word docx into HTML #152

Open
BogdanChernyachuk opened this issue Jan 5, 2025 · 3 comments

Comments

@BogdanChernyachuk
Copy link

I would like to propose the following feature (needed for one of my work projects):

I need an ability to output into HTML document that has tracked changes on, so that all insertions are going under <ins> tag and deletions under <del> tag

For example:
Word:
This is a house that John Jack build
Html:
<p>This is the house that <del>John</del><ins>Jack</ins> built</p>

It should be an optional feature, which the client can control though additional parameter of the convert_to_html() function or by using a specific style map, like currently python-mammoth can show or hide comments based on style map ).

Implementation details:

In OpenXML format these tags are present in the following format

<w:del w:author="John Doe" w:date="2023-10-25T14:18:00Z" w:id="1">
    <w:r>
        <w:delText>Deleted text</w:delText>
    </w:r>
</w:del>
<w:ins w:author="John Doe" w:date="2023-10-25T14:18:10Z" w:id="2">
    <w:r>
        <w:t>Inserted text</w:t>
    </w:r>
</w:ins>

Current version of mammoth ignores <w:del> tag and for <w:ins> tag it takes all children nodes
I propose to introduce Insertion and Deletion elements in Document model that will handle the data of these nodes

p.s. In fact I have this implemented in my local repo and if such feature looks interesting, I can make a pull request
But I would leave to the author of the library to define how the public interface for this option will look like, would it be really a paremeter in convert_to_html
mammoth.convert_to_html(fileobj=fileobj, ignore_tracked_changes=True)
or would it be some specific style in style_map
using style_map looks preferable as this parameter is passed from https://github.com/microsoft/markitdown into mammoth as well, so it would be great to make a change in mammoth that will not require a change in markitdown

Here are some unit tests that I used to verify my implementation

def _run_element_with_deleted_text(text):
    return xml_element("w:r", {}, [_deleted_text_element(text)])

def _deleted_text_element(value):
    return xml_element("w:delText", {}, [xml_text(value)])


def test_insertion_element():
    element = xml_element("w:p", {}, [
         _run_element_with_text("This is "),
        xml_element("w:ins", {}, [
            _run_element_with_text("inserted")
        ])
    ])
    
    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.run([documents.text("inserted")])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=True)
    )

    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.insertion([documents.run([documents.text("inserted")])])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=False)
    )


def test_deletion_element():
    element = xml_element("w:p", {}, [
         _run_element_with_text("This is "),
        xml_element("w:del", {}, [
            _run_element_with_deleted_text("deleted")
        ])
    ])
    
    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=True)
    )

    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.deletion([documents.run([documents.text("deleted")])])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=False)
    )

@mwilliamson
Copy link
Owner

Having insertions and deletions controlled by the style map probably makes the most sense. I probably won't have time to work on this any time soon, but a minimal example document and corresponding expected HTML would be helpful.

@BogdanChernyachuk
Copy link
Author

Hi again, and thanks for attention to my request

actually, if you think this feature will be valuable for the library, I can work on it and submit a PR

please find attached a document and expected result
track_changes.docx

I know what changes need to be done, on high level:

  1. in body_xml we need to create readers for w:del and w:ins tags, so that they will be packed in Insertion and Deletion elements in the output document model. w:delText when it is inside w:del should be processes same as w:t
  2. in DocumentConverter, develop visit_deletion and visit_insertion functions so that we can output from Insertion and Deletion into HTML ins and del tag
    The code should be backward compatible, so in case if the stylemap for del and ins is not explicitly specified, del tags will be ignored and the inner contents of the ins tags will be processed with read_child_elements, as it is now

expected HTML output

def test_track_changes_are_converted_to_ins_and_del_elements():
    with open(generate_test_path("track_changes.docx"), "rb") as fileobj:
        # existing version converts the changes by keeping insertion and removing deletion
        result = mammoth.convert_to_html(fileobj=fileobj)
        assert_equal("<p>This is the house that Jack built</p>", result.value)
       
        # style_map_with_changes - NEED TO BE DEFINED how we can provide new styles for ins and del
        result = mammoth.convert_to_html(fileobj=fileobj, style_map = style_map_with_changes )
        assert_equal("<p>This is the house that <del>John</del><ins>Jack</ins> built</p>", result.value)
        
         # style_map_without_changes - NEED TO BE DEFINED - user may want to define a stile to return to previous version of the document
         # i.e. keep all del tags and remive ins tags
        result = mammoth.convert_to_html(fileobj=fileobj, style_map = style_map_without_changes )
        assert_equal("<p>This is the house that John built</p>", result.value)

@mwilliamson
Copy link
Owner

Thanks for the offer, but I'm afraid I'm not currently accepting pull requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants