-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIMX 454 - parse BS4 in isolated thread to avoid dangling memory pointers #231
Conversation
Why these changes are being introduced: A bug was discovered where the application did not exit cleanly for XML sources when the number of records transformed resulted in 2+ batches of writing to the parquet dataset. The cause of this memory fault was the combination of three factors: 1. parsing of XML strings into BeautifulSoup4 Tags 2. running Transmogrifier via 'pipenv run transform...' 3. transformed records are written to a parquet dataset, requiring more than one batch of writes Under the hood, BeautifulSoup4 uses lxml to parse the XML string it receives. Somehow, through a combination of these three factors, there were memory pointers left dangling when the application attempted to tear itself down. This resulted in C code throwing an abort() call, which bubbled up. It was confirmed that not one of these factors was enough on its own. Non XML sources work fine, removing bs4 parsing works fine, even changing bs4's parsing to use 'html.parser' works fine, and running the transformation without 'pipenv run ...' works fine. There is some documentation and forum posts out there that bs4 is somewhat messy with memory management, and it appears this combination of factors triggers that. How this addresses that need: When producing a bs4 'source_record' for transformation, the bs4 parsing is handled in a short-lived, isolated thread. By doing so, we do not see any memory management issues when the application exits. It is less than ideal to spawn a thread just to parse a bs4 Tag, but the performance hit is minimal. Furthermore, there is active discussion of removing BeautifulSoup4 altogether from Transmogrifier, so this patch is seen as temporary until that time. Side effects of this change: * XML sources of decent size, that will trigger multiple batches of writing to output parquet dataset, will not throw error exit codes. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected and I was able to recreate the error. Good catch on this and there is more than enough evidence to convict and exile bs4
at this point 🙃
Thanks for confirming you can recreate! Very helpful. I did try two machines (work and personal), but one never knows. |
@jonavellecuerdo, @ehanson8 - here is a small script that can be created in """
Trigger via pipenv:
pipenv run python output/memory_debugging.py
Trigger via python (when in pipenv shell):
python output/memory_debugging.py
Trigger via subprocess:
from output.memory_debugging import *
trigger_error_via_subprocess()
Trigger via Ipython on teardown:
from output.memory_debugging import *
main()
exit() # <------ here the error is thrown
"""
import itertools
import subprocess
import threading
import uuid
from bs4 import BeautifulSoup
from lxml import etree
import pyarrow as pa
import pyarrow.dataset as ds
XML_FILE = (
"tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml"
)
BATCH_SIZE = 1000
PARSE_BS4_IN_THREAD = False
def get_bs4_tags(xml_filepath=XML_FILE):
with open(xml_filepath, "rb") as file:
for _, element in etree.iterparse(
file,
tag="{*}record",
encoding="utf-8",
recover=True,
):
record_string = etree.tostring(element, encoding="utf-8")
if not PARSE_BS4_IN_THREAD:
record = BeautifulSoup(record_string, "xml")
else:
record = parse_bs4_in_thread(record_string)
yield record
element.clear()
def parse_bs4_in_thread(source_record: bytes):
def parse():
nonlocal source_record
source_record = BeautifulSoup(source_record, "xml")
thread = threading.Thread(target=parse)
thread.start()
thread.join()
return source_record
def get_pa_batches(records_iter):
for i, record_batch in enumerate(itertools.batched(records_iter, BATCH_SIZE)):
pa_batch = pa.RecordBatch.from_pylist(
[{"record": str(record)} for record in record_batch]
)
print(f"processed batch: {i}, length: {len(pa_batch)}")
yield pa_batch
SCHEMA = pa.schema(
(pa.field("record", pa.string()),),
)
def write_dataset(record_batches_iter):
ds.write_dataset(
record_batches_iter,
base_dir="/tmp/dataset-memory",
existing_data_behavior="overwrite_or_ignore",
format="parquet",
max_open_files=500,
max_rows_per_file=100_000,
max_rows_per_group=1_000,
schema=SCHEMA,
use_threads=True,
)
def main():
tags_iter = get_bs4_tags()
pa_batches_iter = get_pa_batches(tags_iter)
write_dataset(pa_batches_iter)
def trigger_error_via_subprocess():
result = subprocess.run( # noqa: S603
["pipenv", "run", "python", "output/memory_debugging.py"], # noqa: S607
capture_output=True,
text=True,
check=False,
)
print(f"subprocess return code (where '-6' is C memory fault): {result.returncode}")
return result
if __name__ == "__main__":
main() |
@ehanson8, @jonavellecuerdo - after a considerably deep dive on this, and learning quite a bit between the interplay of Interestingly, we could put a But, this feels like a hack more than a solution. The approach to parse BS4 in an isolated thread -- current approach in this PR -- appears to be the most thorough option with the constraints we have: using BS4, writing streaming batches via pyarrow, etc. |
Though I would pose the question of which is easier to take out in the future? I think |
I almost add this to my comment. Interestingly, I think the PR approach is easier to remove. It's specific to BS4 parsing, so if and when we do remove BS4, it'll naturally get removed. That was actually a concern: if we added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your doing such thorough research into this error! Can also confirm that I was able to recreate the error following the instructions in the PR.
I also like the current solution as opposed to gc.freeze
as it feels a little more intuitive/easier to follow and allows Python to carry on with its automatic garbage collection processes.
Purpose and background context
Pasting the git commit verbatim:
As noted in the Jira ticket, an error like the following was cropping up for Transmogrifier runs of a decent size (specifically: 1k+ records which results in multiple batches of 1k writes to the parquet dataset and not evenly divisible by 1k (unknown how this factors in)):
The last piece in the puzzle was learning that invoking Transmog via
pipenv run transform ...
was a necessarey piece of this error getting thrown. Running the transformations via pure, interactive python was not problematic, nor was even a CLI runner test like we do for other unit tests.It is believed it's related to
pipenv run ...
spawning a subprocess / dedicated shell for the work. Even when running Transmog via a docker container, this ends up callingpipenv run ...
which creates a subprocess within that container. It is not recommended that we revisit this approach at this time, but it's worth remembering for the future that virtually all of our CLIs are actually running as a subprocess in the container.It does feel a little suspect that this code from the
timdex-dataset-api
library is involved, where only multiple batches of writes trigger the memory fault. But recall it's a) only when BS4 is involved, b) run via apipenv run ...
context, etc. All other writing to the dataset has been successful; up to millions of records for sourcesgisogm
, removing BS4 and only usinglxml
, etc. My hunch is that deep in the C-code weeds there is some interplay between iterators and BS4 memory management, but likely far too low-level to address directly, and we might just need to workaround as this PR proposes.TL/DR
BeautifulSoup4 parsing was conflicting with parquet dataset writing when invoked via
pipenv run ...
. The workaround solution at this time is to parse BeautifulSoup4Tag
s in an isolated thread within the application, which avoids memory faults on application teardown.BeautifulSoup4 remains a performance bottleneck, and now has introduced memory management considerations with the new parquet dataset writing. It's proposed that we continue to explore removing it entirely from Transmogrifier.
How can a reviewer manually see the effects of these changes?
Confirm that current code results in clean exit codes
Then run
echo $?
and observe a0
exit code.Recreate error exit codes by running against
main
branchFirst, checkout
main
branch.Second, download this fixture from this PR branch (which was removed when checking out
main
) to a known location (e.g.~/Downloads
).This fixture has
1,001
records, where the single record beyond 1k is enough to trigger a second batch of parquet dataset writing, and this memory fault.Lastly, run Transmog against this fixture:
pipenv run transform \ -s libguides \ -i ~/Downloads/libguides-2025-01-09-full-extracted-records-to-index.xml \ -o /tmp/dataset
Note the application teardown output after the transformation and writing is successful:
Includes new or updated dependencies?
YES: all dependencies were updated to remove outdated libraries as a source of this error
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)