TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

ghukill · 2025-01-09T16:38:16Z

Purpose and background context

Pasting the git commit verbatim:

    Why these changes are being introduced:
    
    A bug was discovered where the application did not exit cleanly
    for XML sources when the number of records transformed resulted
    in 2+ batches of writing to the parquet dataset.  The cause of this
    memory fault was the combination of three factors:
    
    1. parsing of XML strings into BeautifulSoup4 Tags
    2. running Transmogrifier via 'pipenv run transform...'
    3. transformed records are written to a parquet dataset, requiring
    more than one batch of writes
    
    Under the hood, BeautifulSoup4 uses lxml to parse the XML string
    it receives.  Somehow, through a combination of these three factors,
    there were memory pointers left dangling when the application
    attempted to tear itself down.  This resulted in C code throwing
    an abort() call, which bubbled up.
    
    It was confirmed that not one of these factors was enough on its
    own.  Non XML sources work fine, removing bs4 parsing works fine,
    even changing bs4's parsing to use 'html.parser' works fine,
    and running the transformation without 'pipenv run ...' works fine.
    
    There is some documentation and forum posts out there that bs4
    is somewhat messy with memory management, and it appears this
    combination of factors triggers that.
    
    How this addresses that need:
    
    When producing a bs4 'source_record' for transformation, the
    bs4 parsing is handled in a short-lived, isolated thread.  By doing
    so, we do not see any memory management issues when the application
    exits.  It is less than ideal to spawn a thread just to parse a
    bs4 Tag, but the performance hit is minimal.
    
    Furthermore, there is active discussion of removing BeautifulSoup4
    altogether from Transmogrifier, so this patch is seen as temporary
    until that time.

As noted in the Jira ticket, an error like the following was cropping up for Transmogrifier runs of a decent size (specifically: 1k+ records which results in multiple batches of 1k writes to the parquet dataset and not evenly divisible by 1k (unknown how this factors in)):

...
...
2025-01-07 15:47:06,193 INFO transmogrifier.cli.main(): Total time to complete transform: 0:03:01.773851
python(22003,0x1eb444c00) malloc: *** error for object 0x11f988457: pointer being freed was not allocated
python(22003,0x1eb444c00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      PYTHONMALLOC=debug pipenv run transform -s alma -i  -o /tmp/dataset

The last piece in the puzzle was learning that invoking Transmog via pipenv run transform ... was a necessarey piece of this error getting thrown. Running the transformations via pure, interactive python was not problematic, nor was even a CLI runner test like we do for other unit tests.

It is believed it's related to pipenv run ... spawning a subprocess / dedicated shell for the work. Even when running Transmog via a docker container, this ends up calling pipenv run ... which creates a subprocess within that container. It is not recommended that we revisit this approach at this time, but it's worth remembering for the future that virtually all of our CLIs are actually running as a subprocess in the container.

It does feel a little suspect that this code from the timdex-dataset-api library is involved, where only multiple batches of writes trigger the memory fault. But recall it's a) only when BS4 is involved, b) run via a pipenv run ... context, etc. All other writing to the dataset has been successful; up to millions of records for sources gisogm, removing BS4 and only using lxml, etc. My hunch is that deep in the C-code weeds there is some interplay between iterators and BS4 memory management, but likely far too low-level to address directly, and we might just need to workaround as this PR proposes.

TL/DR

BeautifulSoup4 parsing was conflicting with parquet dataset writing when invoked via pipenv run .... The workaround solution at this time is to parse BeautifulSoup4 Tags in an isolated thread within the application, which avoids memory faults on application teardown.

BeautifulSoup4 remains a performance bottleneck, and now has introduced memory management considerations with the new parquet dataset writing. It's proposed that we continue to explore removing it entirely from Transmogrifier.

How can a reviewer manually see the effects of these changes?

Confirm that current code results in clean exit codes

pipenv run transform \
-s libguides \
-i tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml \
-o /tmp/dataset

Then run echo $? and observe a 0 exit code.

Recreate error exit codes by running against `main` branch

First, checkout main branch.

Second, download this fixture from this PR branch (which was removed when checking out main) to a known location (e.g. ~/Downloads).

This fixture has 1,001 records, where the single record beyond 1k is enough to trigger a second batch of parquet dataset writing, and this memory fault.

Lastly, run Transmog against this fixture:

pipenv run transform \
-s libguides \
-i ~/Downloads/libguides-2025-01-09-full-extracted-records-to-index.xml \
-o /tmp/dataset

Note the application teardown output after the transformation and writing is successful:

...
...
python(88147,0x1eb444c00) malloc: *** error for object 0x1474fe457: pointer being freed was not allocated
python(88147,0x1eb444c00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      pipenv run transform -s libguides -i  -o /tmp/dataset

Includes new or updated dependencies?

YES: all dependencies were updated to remove outdated libraries as a source of this error

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-454

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: A bug was discovered where the application did not exit cleanly for XML sources when the number of records transformed resulted in 2+ batches of writing to the parquet dataset. The cause of this memory fault was the combination of three factors: 1. parsing of XML strings into BeautifulSoup4 Tags 2. running Transmogrifier via 'pipenv run transform...' 3. transformed records are written to a parquet dataset, requiring more than one batch of writes Under the hood, BeautifulSoup4 uses lxml to parse the XML string it receives. Somehow, through a combination of these three factors, there were memory pointers left dangling when the application attempted to tear itself down. This resulted in C code throwing an abort() call, which bubbled up. It was confirmed that not one of these factors was enough on its own. Non XML sources work fine, removing bs4 parsing works fine, even changing bs4's parsing to use 'html.parser' works fine, and running the transformation without 'pipenv run ...' works fine. There is some documentation and forum posts out there that bs4 is somewhat messy with memory management, and it appears this combination of factors triggers that. How this addresses that need: When producing a bs4 'source_record' for transformation, the bs4 parsing is handled in a short-lived, isolated thread. By doing so, we do not see any memory management issues when the application exits. It is less than ideal to spawn a thread just to parse a bs4 Tag, but the performance hit is minimal. Furthermore, there is active discussion of removing BeautifulSoup4 altogether from Transmogrifier, so this patch is seen as temporary until that time. Side effects of this change: * XML sources of decent size, that will trigger multiple batches of writing to output parquet dataset, will not throw error exit codes. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-454

ehanson8

Works as expected and I was able to recreate the error. Good catch on this and there is more than enough evidence to convict and exile bs4 at this point 🙃

ghukill · 2025-01-09T18:53:24Z

Works as expected and I was able to recreate the error. Good catch on this and there is more than enough evidence to convict and exile bs4 at this point 🙃

Thanks for confirming you can recreate! Very helpful. I did try two machines (work and personal), but one never knows.

ghukill · 2025-01-10T21:22:12Z

@jonavellecuerdo, @ehanson8 - here is a small script that can be created in output/memory_debugging.py that replicates the error with much less jumping around in code:

"""
Trigger via pipenv:
pipenv run python output/memory_debugging.py

Trigger via python (when in pipenv shell):
python output/memory_debugging.py

Trigger via subprocess:
from output.memory_debugging import *
trigger_error_via_subprocess()

Trigger via Ipython on teardown:
from output.memory_debugging import *
main()
exit()  # <------ here the error is thrown
"""

import itertools
import subprocess
import threading
import uuid

from bs4 import BeautifulSoup
from lxml import etree
import pyarrow as pa
import pyarrow.dataset as ds

XML_FILE = (
    "tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml"
)

BATCH_SIZE = 1000

PARSE_BS4_IN_THREAD = False

def get_bs4_tags(xml_filepath=XML_FILE):
    with open(xml_filepath, "rb") as file:
        for _, element in etree.iterparse(
            file,
            tag="{*}record",
            encoding="utf-8",
            recover=True,
        ):
            record_string = etree.tostring(element, encoding="utf-8")
            if not PARSE_BS4_IN_THREAD:
                record = BeautifulSoup(record_string, "xml")
            else:
                record = parse_bs4_in_thread(record_string)
            yield record
            element.clear()

def parse_bs4_in_thread(source_record: bytes):
    def parse():
        nonlocal source_record
        source_record = BeautifulSoup(source_record, "xml")

    thread = threading.Thread(target=parse)
    thread.start()
    thread.join()
    return source_record

def get_pa_batches(records_iter):
    for i, record_batch in enumerate(itertools.batched(records_iter, BATCH_SIZE)):
        pa_batch = pa.RecordBatch.from_pylist(
            [{"record": str(record)} for record in record_batch]
        )
        print(f"processed batch: {i}, length: {len(pa_batch)}")
        yield pa_batch


SCHEMA = pa.schema(
    (pa.field("record", pa.string()),),
)


def write_dataset(record_batches_iter):
    ds.write_dataset(
        record_batches_iter,
        base_dir="/tmp/dataset-memory",
        existing_data_behavior="overwrite_or_ignore",
        format="parquet",
        max_open_files=500,
        max_rows_per_file=100_000,
        max_rows_per_group=1_000,
        schema=SCHEMA,
        use_threads=True,
    )


def main():
    tags_iter = get_bs4_tags()
    pa_batches_iter = get_pa_batches(tags_iter)
    write_dataset(pa_batches_iter)


def trigger_error_via_subprocess():
    result = subprocess.run(  # noqa: S603
        ["pipenv", "run", "python", "output/memory_debugging.py"],  # noqa: S607
        capture_output=True,
        text=True,
        check=False,
    )
    print(f"subprocess return code (where '-6' is C memory fault): {result.returncode}")
    return result


if __name__ == "__main__":
    main()

ghukill · 2025-01-13T14:09:57Z

@ehanson8, @jonavellecuerdo - after a considerably deep dive on this, and learning quite a bit between the interplay of bs4, lxml, pyarrow, and python garbage collection.... I contine to believe this is the best option at the moment.

Interestingly, we could put a gc.freeze() call right here before the CLI application exits, which successfully avoids this error. It does so by marking all items currently in garbage collection scope as "permenant" and therefore not to be cleaned up by python garbage collection, and therefore will not throw a memory error.

But, this feels like a hack more than a solution. The approach to parse BS4 in an isolated thread -- current approach in this PR -- appears to be the most thorough option with the constraints we have: using BS4, writing streaming batches via pyarrow, etc.

ehanson8 · 2025-01-13T14:33:59Z

Though I would pose the question of which is easier to take out in the future? I think bs4 is nearing eviction from this app so I wouldn't consider it something we need to accommodate for much longer and a one-line workaround may be fine

ghukill · 2025-01-13T15:00:57Z

Though I would pose the question of which is easier to take out in the future? I think bs4 is nearing eviction from this app so I wouldn't consider it something we need to accommodate for much longer and a one-line workaround may be fine

I almost add this to my comment. Interestingly, I think the PR approach is easier to remove. It's specific to BS4 parsing, so if and when we do remove BS4, it'll naturally get removed.

That was actually a concern: if we added gc.freeze() in the CLI, we may forget why, and it'd stick around even after BS4 was removed.

jonavellecuerdo

Thank you for your doing such thorough research into this error! Can also confirm that I was able to recreate the error following the instructions in the PR.

I also like the current solution as opposed to gc.freeze as it feels a little more intuitive/easier to follow and allows Python to carry on with its automatic garbage collection processes.

ghukill added 2 commits January 9, 2025 11:02

Update dependencies and align timdex-dataset-api tests with updates

ce55bd1

ghukill marked this pull request as ready for review January 9, 2025 16:46

ghukill requested review from ehanson8 and jonavellecuerdo January 9, 2025 16:46

ehanson8 approved these changes Jan 9, 2025

View reviewed changes

jonavellecuerdo approved these changes Jan 13, 2025

View reviewed changes

ghukill merged commit 48979ba into main Jan 13, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

ghukill commented Jan 9, 2025 •

edited

Loading

ehanson8 left a comment

ghukill commented Jan 9, 2025

ghukill commented Jan 10, 2025 •

edited

Loading

ghukill commented Jan 13, 2025

ehanson8 commented Jan 13, 2025

ghukill commented Jan 13, 2025

jonavellecuerdo left a comment

TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

Conversation

ghukill commented Jan 9, 2025 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Confirm that current code results in clean exit codes

Recreate error exit codes by running against main branch

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ehanson8 left a comment

Choose a reason for hiding this comment

ghukill commented Jan 9, 2025

ghukill commented Jan 10, 2025 • edited Loading

ghukill commented Jan 13, 2025

ehanson8 commented Jan 13, 2025

ghukill commented Jan 13, 2025

jonavellecuerdo left a comment

Choose a reason for hiding this comment

ghukill commented Jan 9, 2025 •

edited

Loading

Recreate error exit codes by running against `main` branch

ghukill commented Jan 10, 2025 •

edited

Loading