Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231

Merged
merged 2 commits into from
Jan 13, 2025

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 9, 2025

Purpose and background context

Pasting the git commit verbatim:

    Why these changes are being introduced:
    
    A bug was discovered where the application did not exit cleanly
    for XML sources when the number of records transformed resulted
    in 2+ batches of writing to the parquet dataset.  The cause of this
    memory fault was the combination of three factors:
    
    1. parsing of XML strings into BeautifulSoup4 Tags
    2. running Transmogrifier via 'pipenv run transform...'
    3. transformed records are written to a parquet dataset, requiring
    more than one batch of writes
    
    Under the hood, BeautifulSoup4 uses lxml to parse the XML string
    it receives.  Somehow, through a combination of these three factors,
    there were memory pointers left dangling when the application
    attempted to tear itself down.  This resulted in C code throwing
    an abort() call, which bubbled up.
    
    It was confirmed that not one of these factors was enough on its
    own.  Non XML sources work fine, removing bs4 parsing works fine,
    even changing bs4's parsing to use 'html.parser' works fine,
    and running the transformation without 'pipenv run ...' works fine.
    
    There is some documentation and forum posts out there that bs4
    is somewhat messy with memory management, and it appears this
    combination of factors triggers that.
    
    How this addresses that need:
    
    When producing a bs4 'source_record' for transformation, the
    bs4 parsing is handled in a short-lived, isolated thread.  By doing
    so, we do not see any memory management issues when the application
    exits.  It is less than ideal to spawn a thread just to parse a
    bs4 Tag, but the performance hit is minimal.
    
    Furthermore, there is active discussion of removing BeautifulSoup4
    altogether from Transmogrifier, so this patch is seen as temporary
    until that time.

As noted in the Jira ticket, an error like the following was cropping up for Transmogrifier runs of a decent size (specifically: 1k+ records which results in multiple batches of 1k writes to the parquet dataset and not evenly divisible by 1k (unknown how this factors in)):

...
...
2025-01-07 15:47:06,193 INFO transmogrifier.cli.main(): Total time to complete transform: 0:03:01.773851
python(22003,0x1eb444c00) malloc: *** error for object 0x11f988457: pointer being freed was not allocated
python(22003,0x1eb444c00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      PYTHONMALLOC=debug pipenv run transform -s alma -i  -o /tmp/dataset

The last piece in the puzzle was learning that invoking Transmog via pipenv run transform ... was a necessarey piece of this error getting thrown. Running the transformations via pure, interactive python was not problematic, nor was even a CLI runner test like we do for other unit tests.

It is believed it's related to pipenv run ... spawning a subprocess / dedicated shell for the work. Even when running Transmog via a docker container, this ends up calling pipenv run ... which creates a subprocess within that container. It is not recommended that we revisit this approach at this time, but it's worth remembering for the future that virtually all of our CLIs are actually running as a subprocess in the container.

It does feel a little suspect that this code from the timdex-dataset-api library is involved, where only multiple batches of writes trigger the memory fault. But recall it's a) only when BS4 is involved, b) run via a pipenv run ... context, etc. All other writing to the dataset has been successful; up to millions of records for sources gisogm, removing BS4 and only using lxml, etc. My hunch is that deep in the C-code weeds there is some interplay between iterators and BS4 memory management, but likely far too low-level to address directly, and we might just need to workaround as this PR proposes.

TL/DR

BeautifulSoup4 parsing was conflicting with parquet dataset writing when invoked via pipenv run .... The workaround solution at this time is to parse BeautifulSoup4 Tags in an isolated thread within the application, which avoids memory faults on application teardown.

BeautifulSoup4 remains a performance bottleneck, and now has introduced memory management considerations with the new parquet dataset writing. It's proposed that we continue to explore removing it entirely from Transmogrifier.

How can a reviewer manually see the effects of these changes?

Confirm that current code results in clean exit codes

pipenv run transform \
-s libguides \
-i tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml \
-o /tmp/dataset

Then run echo $? and observe a 0 exit code.

Recreate error exit codes by running against main branch

First, checkout main branch.

Second, download this fixture from this PR branch (which was removed when checking out main) to a known location (e.g. ~/Downloads).

This fixture has 1,001 records, where the single record beyond 1k is enough to trigger a second batch of parquet dataset writing, and this memory fault.

Lastly, run Transmog against this fixture:

pipenv run transform \
-s libguides \
-i ~/Downloads/libguides-2025-01-09-full-extracted-records-to-index.xml \
-o /tmp/dataset

Note the application teardown output after the transformation and writing is successful:

...
...
python(88147,0x1eb444c00) malloc: *** error for object 0x1474fe457: pointer being freed was not allocated
python(88147,0x1eb444c00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      pipenv run transform -s libguides -i  -o /tmp/dataset

Includes new or updated dependencies?

YES: all dependencies were updated to remove outdated libraries as a source of this error

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

A bug was discovered where the application did not exit cleanly
for XML sources when the number of records transformed resulted
in 2+ batches of writing to the parquet dataset.  The cause of this
memory fault was the combination of three factors:

1. parsing of XML strings into BeautifulSoup4 Tags
2. running Transmogrifier via 'pipenv run transform...'
3. transformed records are written to a parquet dataset, requiring
more than one batch of writes

Under the hood, BeautifulSoup4 uses lxml to parse the XML string
it receives.  Somehow, through a combination of these three factors,
there were memory pointers left dangling when the application
attempted to tear itself down.  This resulted in C code throwing
an abort() call, which bubbled up.

It was confirmed that not one of these factors was enough on its
own.  Non XML sources work fine, removing bs4 parsing works fine,
even changing bs4's parsing to use 'html.parser' works fine,
and running the transformation without 'pipenv run ...' works fine.

There is some documentation and forum posts out there that bs4
is somewhat messy with memory management, and it appears this
combination of factors triggers that.

How this addresses that need:

When producing a bs4 'source_record' for transformation, the
bs4 parsing is handled in a short-lived, isolated thread.  By doing
so, we do not see any memory management issues when the application
exits.  It is less than ideal to spawn a thread just to parse a
bs4 Tag, but the performance hit is minimal.

Furthermore, there is active discussion of removing BeautifulSoup4
altogether from Transmogrifier, so this patch is seen as temporary
until that time.

Side effects of this change:
* XML sources of decent size, that will trigger multiple batches
of writing to output parquet dataset, will not throw error exit codes.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-454
@ghukill ghukill marked this pull request as ready for review January 9, 2025 16:46
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and I was able to recreate the error. Good catch on this and there is more than enough evidence to convict and exile bs4 at this point 🙃

@ghukill
Copy link
Contributor Author

ghukill commented Jan 9, 2025

Works as expected and I was able to recreate the error. Good catch on this and there is more than enough evidence to convict and exile bs4 at this point 🙃

Thanks for confirming you can recreate! Very helpful. I did try two machines (work and personal), but one never knows.

@ghukill
Copy link
Contributor Author

ghukill commented Jan 10, 2025

@jonavellecuerdo, @ehanson8 - here is a small script that can be created in output/memory_debugging.py that replicates the error with much less jumping around in code:

"""
Trigger via pipenv:
pipenv run python output/memory_debugging.py

Trigger via python (when in pipenv shell):
python output/memory_debugging.py

Trigger via subprocess:
from output.memory_debugging import *
trigger_error_via_subprocess()

Trigger via Ipython on teardown:
from output.memory_debugging import *
main()
exit()  # <------ here the error is thrown
"""

import itertools
import subprocess
import threading
import uuid

from bs4 import BeautifulSoup
from lxml import etree
import pyarrow as pa
import pyarrow.dataset as ds

XML_FILE = (
    "tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml"
)

BATCH_SIZE = 1000

PARSE_BS4_IN_THREAD = False

def get_bs4_tags(xml_filepath=XML_FILE):
    with open(xml_filepath, "rb") as file:
        for _, element in etree.iterparse(
            file,
            tag="{*}record",
            encoding="utf-8",
            recover=True,
        ):
            record_string = etree.tostring(element, encoding="utf-8")
            if not PARSE_BS4_IN_THREAD:
                record = BeautifulSoup(record_string, "xml")
            else:
                record = parse_bs4_in_thread(record_string)
            yield record
            element.clear()

def parse_bs4_in_thread(source_record: bytes):
    def parse():
        nonlocal source_record
        source_record = BeautifulSoup(source_record, "xml")

    thread = threading.Thread(target=parse)
    thread.start()
    thread.join()
    return source_record

def get_pa_batches(records_iter):
    for i, record_batch in enumerate(itertools.batched(records_iter, BATCH_SIZE)):
        pa_batch = pa.RecordBatch.from_pylist(
            [{"record": str(record)} for record in record_batch]
        )
        print(f"processed batch: {i}, length: {len(pa_batch)}")
        yield pa_batch


SCHEMA = pa.schema(
    (pa.field("record", pa.string()),),
)


def write_dataset(record_batches_iter):
    ds.write_dataset(
        record_batches_iter,
        base_dir="/tmp/dataset-memory",
        existing_data_behavior="overwrite_or_ignore",
        format="parquet",
        max_open_files=500,
        max_rows_per_file=100_000,
        max_rows_per_group=1_000,
        schema=SCHEMA,
        use_threads=True,
    )


def main():
    tags_iter = get_bs4_tags()
    pa_batches_iter = get_pa_batches(tags_iter)
    write_dataset(pa_batches_iter)


def trigger_error_via_subprocess():
    result = subprocess.run(  # noqa: S603
        ["pipenv", "run", "python", "output/memory_debugging.py"],  # noqa: S607
        capture_output=True,
        text=True,
        check=False,
    )
    print(f"subprocess return code (where '-6' is C memory fault): {result.returncode}")
    return result


if __name__ == "__main__":
    main()

@ghukill
Copy link
Contributor Author

ghukill commented Jan 13, 2025

@ehanson8, @jonavellecuerdo - after a considerably deep dive on this, and learning quite a bit between the interplay of bs4, lxml, pyarrow, and python garbage collection.... I contine to believe this is the best option at the moment.

Interestingly, we could put a gc.freeze() call right here before the CLI application exits, which successfully avoids this error. It does so by marking all items currently in garbage collection scope as "permenant" and therefore not to be cleaned up by python garbage collection, and therefore will not throw a memory error.

But, this feels like a hack more than a solution. The approach to parse BS4 in an isolated thread -- current approach in this PR -- appears to be the most thorough option with the constraints we have: using BS4, writing streaming batches via pyarrow, etc.

@ehanson8
Copy link
Contributor

Though I would pose the question of which is easier to take out in the future? I think bs4 is nearing eviction from this app so I wouldn't consider it something we need to accommodate for much longer and a one-line workaround may be fine

@ghukill
Copy link
Contributor Author

ghukill commented Jan 13, 2025

Though I would pose the question of which is easier to take out in the future? I think bs4 is nearing eviction from this app so I wouldn't consider it something we need to accommodate for much longer and a one-line workaround may be fine

I almost add this to my comment. Interestingly, I think the PR approach is easier to remove. It's specific to BS4 parsing, so if and when we do remove BS4, it'll naturally get removed.

That was actually a concern: if we added gc.freeze() in the CLI, we may forget why, and it'd stick around even after BS4 was removed.

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your doing such thorough research into this error! Can also confirm that I was able to recreate the error following the instructions in the PR.

I also like the current solution as opposed to gc.freeze as it feels a little more intuitive/easier to follow and allows Python to carry on with its automatic garbage collection processes.

@ghukill ghukill merged commit 48979ba into main Jan 13, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants