support lazy batching again, support general iterators #76

d-maurer · 2024-09-28T08:53:40Z

Fixes #75.

dtml-in's lazy batching was broken by listifying the incoming sequence in order to support some iterators (e.g. dict key views).
This PR replaces the listifying by a sequence_ensure_subscription wrapper ensuring that the returned object has sequence like behavior sufficient for dtml-in. It uses the original object if this already fulfills the requirement, otherwise a wrapper which lazily emulates sequence behavior for an arbitrary iterator.

The main problem with the approach is the recognition whether the incoming sequence already fulfills the requirements. For this it uses the following heuristics:

it accesses seq[0]. If this raises IndexError it assumes "sufficiently sequence like"; if it raises AttributeError, TypeError or KeyError it assumes "not sufficiently sequence like"
it tries to verify that seq is not a mapping. For this it accesses seq[None]. Most sequence types will raise an exception if the index is not an integer or a slice. The heuristics checks for TypeError, ValueError and RuntimeError (the latter because ZTUtils.Lazy.Lazy used this exception. If one of those exceptions is raised, it assumes "sufficuetly sequence like".

The approach is not very robust as it makes assumptions about the raised exceptions. Nevertheless, the PR relies on explicit exception enumeration because in the context of Zope temporary exceptions can arise which we do not want to catch. I am open for other heuristics.

dataflake

Looks good to me but I don't have any good test case to see what difference it makes.

d-maurer · 2024-09-30T08:15:06Z

but I don't have any good test case to see what difference it makes.

I hope that @sauzher will check this with his large catalog.

sauzher · 2024-09-30T16:39:26Z

@d-maurer
I've checked with ~50.000 objects populated catalog.

It still takes ~7000ms to load. The call that takes time is

this inner call

d-maurer · 2024-09-30T18:00:38Z

Alessandro Ceglie wrote at 2024-9-30 09:39 -0700:

@d-maurer I've checked with ~50.000 objects populated catalog. It still takes ~7000ms to load. The call that takes time is [this one](https://github.com/zopefoundation/DocumentTemplate/blob/d67f09646acf7cc7d4a1d75bc1e3fbd1b824cee7/src/DocumentTemplate/DT_In.py#L463)

This is surprising: according to your finding, the expensive line is `sequence = sequence_ensure_subscription(md[name])`; the `md[name]` is a simple lookup (very fast); the only expensive thing inside `sequence_ensure_subscription` is `sequence[0]`, i.e. the access to the first search result element. The time for this access should not depend on the search result size, i.e. it should be necessary even if a filter has been applied. Please dig in further. I suggest to profile in an interactive Python interpreter the call `sequence_ensure_subscription(catalog.searchAll())`. The `searchAll` should be very fast (as the operation returns a lazy result). I expect that almost all time is spend in `Lazy.__getitem__` (implementing the `sequence[0]`). Look what for time is spent in this call. I expect ZODB reads; if this is true then ZODB caching (and below maybe OS caching) will highly affect the timing.

d-maurer · 2024-10-01T06:47:37Z

Alessandro Ceglie wrote at 2024-9-30 09:39 -0700:

@d-maurer I've checked with ~50.000 objects populated catalog. It still takes ~7000ms to load. The call that takes time is [this one](https://github.com/zopefoundation/DocumentTemplate/blob/d67f09646acf7cc7d4a1d75bc1e3fbd1b824cee7/src/DocumentTemplate/DT_In.py#L463)

I used the following code in an interactive interpreter: ```python from cProfile import Profile from pstats import Stats, SortKey from DocumentTemplate.DT_Util import sequence_ensure_subscription p = ... portal ... c = p.portal_catalog r = c.searchAll() with Profile() as pr: x = sequence_ensure_subscription(r) ps = Stats(pr) ps.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(10) ``` With my catalog containing about 7.000 objects, this resulted in: ``` 239274 function calls in 0.316 seconds Ordered by: cumulative time List reduced from 107 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.316 0.316 DT_Util.py:479(sequence_ensure_subscription) 1 0.000 0.000 0.316 0.316 DT_Util.py:455(sequence_supports_subscription) 2 0.000 0.000 0.316 0.158 Lazy.py:194(__getitem__) 87 0.001 0.000 0.316 0.004 Connection.py:759(setstate) 87 0.001 0.000 0.253 0.003 serialize.py:633(setGhostState) 87 0.000 0.000 0.252 0.003 serialize.py:623(getState) 174 0.035 0.000 0.251 0.001 {method 'load' of '_pickle.Unpickler' objects} 5404 0.009 0.000 0.215 0.000 DateTime.py:460(__setstate__) 5404 0.038 0.000 0.205 0.000 DateTime.py:475(_parse_args) 5404 0.007 0.000 0.119 0.000 DateTime.py:214(_calcDependentSecond) ``` As expected, almost all time is spent in ZODB reads (--> `Connection.py:759(setstate)`) caused by `Lazy.py:194(__getitem__)`. The profile shows 2 calls but the second one is in fact very fast: almost all time is spent by the first call. The cost for this `__getitem__` is independent of the search result size. While you can avoid the `__getitem__` if you show results only if a search filter has been specified, you will pay the cost when you have specified the filter. The profile above has been obtained with an empty ZODB cache. When the cache already contains the accessed objects, the profile looks like ``` 30 function calls in 0.000 seconds Ordered by: cumulative time List reduced from 21 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 DT_Util.py:479(sequence_ensure_subscription) 1 0.000 0.000 0.000 0.000 DT_Util.py:455(sequence_supports_subscription) 2 0.000 0.000 0.000 0.000 Lazy.py:194(__getitem__) 1 0.000 0.000 0.000 0.000 Catalog.py:120(__getitem__) 1 0.000 0.000 0.000 0.000 Catalog.py:436(instantiate) 1 0.000 0.000 0.000 0.000 Catalog.py:431(_maintain_zodb_cache) 1 0.000 0.000 0.000 0.000 __init__.py:25(__new__) 1 0.000 0.000 0.000 0.000 ZCatalog.py:468(maintain_zodb_cache) 1 0.000 0.000 0.000 0.000 __init__.py:33(__setstate__) 2 0.000 0.000 0.000 0.000 __init__.py:81(__setattr__) ``` I will chenge the heuristics to effectively not accessing any sequence element. This will make `sequence_ensure_subscription` very fast. Of course in your `catalogView` case, you will pay the ZODB cost when the batch elements are rendered. If you suppress the initial (filter empty) rendering, you may need to pay it only once (after filter is defined).

sauzher · 2024-10-01T07:01:31Z

@d-maurer You're right, infact I modified my comment just 5 minutes after the post with the right expensive call (

DocumentTemplate/src/DocumentTemplate/DT_Util.py

Line 462 in d67f096

obj[0]

) and indeed it seems that accessing first element sequence[0] costs time.

Zope caching works well, the second time I call the manage_catalogView it renders very fast, but we are always at the same point: loading that zmi page in production the first time (in a load balanced enviroment) could leads to proxyerrors for each zeoclient.

... reading your last comment just right in this moment it seems we are on the same page.

Paying the batch loading time only when the path filter is set is something a manager could take in count but as @drfho pointed out here there are catalogs where path index is not defined and one may wants to check some brains immediately to test catalog consistency (makes sense).

Perhaps the manage_catalogView is to be refactored at least for the first opening call showing only, and only those, the first 20 brains and use an action button (or a path search if it's there) to load next ones. We can still have the page showing how much items are in the catalog by len(lazyMap) call

d-maurer · 2024-10-01T07:30:21Z

Alessandro Ceglie wrote at 2024-10-1 00:01 -0700:

... Perhaps the `manage_catalogView` is to be refactored at least for the first opening call showing only, and only those, the first 20 brains and use an action button (or a path search if it's there) to load next ones. We can still have the page showing how much items are in the catalog by `len(lazyMap)` call

The `DocumentTemplate` of this PR already ensures this as it ensures that the batching happens lazily, i.e. you pay only for the rendered entries (maybe plus 1) and not for all search results. Refactoring `manage_catalogView` may help but it would need to follow a different route -- operating on the low level catalog API. Observations: * We do not need a "path" index to support elementary path filtering, we can instead use `_catalog.uids` (an `OIBTree` mapping uids (which actually are paths) to rids (= "Record IDs")). Its `keys` method supports `min` and `max` to efficiently specify a range of keys to be accessed. Using `min`, we can e.g. specify from which key on keys should be enumerated. The result is lazy, i.e. we pay only for the keys actually accessed (thus lazy batching will be efficient). * If we drop the "Type" display, there is no need to access the metadata anymore. Using these two observations, we can make `manage_catalogView` very fast even for huge catalogs. We pay (significantly) only when we have selected a specific entry and want to get detailed information for it.

…bscription avoiding actual item access (as it may be expensive)

d-maurer · 2024-10-01T07:46:43Z

I have changed the heuristics to distinguish sequence from mapping subscription to avoid a potentially costly sequence access. It now uses
hasattr(obj, "__getitem__") and hasattr(obj, "__len__") and not hasattr(obj, "get") and not hasattr(obj, "keys") to detect sequence subscription.

support lazy batching again, support general iterators

d67f096

d-maurer requested review from icemac, sauzher and dataflake September 28, 2024 08:54

d-maurer mentioned this pull request Sep 28, 2024

/manage_catalogView takes ages to open in large sites zopefoundation/Products.ZCatalog#155

Open

dataflake approved these changes Sep 30, 2024

View reviewed changes

icemac removed their request for review October 1, 2024 06:53

change heuristics used to distinquish between sequence and mapping su…

b36a3c0

…bscription avoiding actual item access (as it may be expensive)

d-maurer merged commit f08870f into master Oct 3, 2024
13 checks passed

d-maurer deleted the lazy_batching branch October 3, 2024 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support lazy batching again, support general iterators #76

support lazy batching again, support general iterators #76

d-maurer commented Sep 28, 2024

dataflake left a comment

d-maurer commented Sep 30, 2024

sauzher commented Sep 30, 2024 •

edited

Loading

d-maurer commented Sep 30, 2024 via email

d-maurer commented Oct 1, 2024 via email

sauzher commented Oct 1, 2024

d-maurer commented Oct 1, 2024 via email

d-maurer commented Oct 1, 2024

support lazy batching again, support general iterators #76

support lazy batching again, support general iterators #76

Conversation

d-maurer commented Sep 28, 2024

dataflake left a comment

Choose a reason for hiding this comment

d-maurer commented Sep 30, 2024

sauzher commented Sep 30, 2024 • edited Loading

d-maurer commented Sep 30, 2024 via email

d-maurer commented Oct 1, 2024 via email

sauzher commented Oct 1, 2024

d-maurer commented Oct 1, 2024 via email

d-maurer commented Oct 1, 2024

sauzher commented Sep 30, 2024 •

edited

Loading