-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270
Comments
Questions. What happens when the surface is mutated? Someone else blits to it, someone draws on it, anything. How does this interact with vectorization? Does it clump into groups of 4/8 contiguous pixels when possible? Does it include "irrelevant pixels" in the clump to reach that? Could it automatically clump on alignment boundaries for aligned loads of the data?
Why not? If you're saying the rest of the surface is irrelevant why not pack all the data into a smaller form and bring down the memory usage? |
If the surface is modified, the calculated clumps become invalid and must be recomputed. This can be handled in one of two ways:
An automatic recalculation immediately after modifications to the surface seems preferable to me, as it avoids cluttering the user’s code.
Packing data into a smaller form wouldn’t reduce memory usage, as the For example, storing the This approach would also introduce limitations similar to those of colorkey operations, where it only optimizes for contiguous same-colored pixels. More complex pixel arrangements would lead to slower blitting. Overall, I think it’s better to keep the implementation agnostic to pixel arrangements by default. These kinds of features could be explored later as opt-in enhancements. |
If you can automatically detect when a surface has changed, you could apply this optimization internally and automatically. You could even automatically convert() or convert_alpha() into special internal surfaces if a surface is blit repeatedly onto something with an incompatible pixelformat. In terms of clumping mechanics in your example, I would expect it be faster to clump to
Well you see I was sort of thinking this would create a read only surface, which I guess is not the case given your other comment. This idea makes sense if you're creating a read only surface, you wouldn't need to keep around SDL_Surface pixels at all. |
Also the elephant in the room here is that we should probably focus on GPU rendering rather than further improvements to software rendering. |
Blitting with the CPU is expensive. While some workarounds exist to maintain optimal performance for certain visual effects, many scenarios leave little room for improvement.
TL;DR
This Issue proposes a setup that optimizes CPU blitting by skipping large, irrelevant pixel regions ("clumps") in surfaces, reducing wasted processing time.
The Problem
Consider the following surface:
It’s a small 16x13 pixel surface containing black and white pixels. The same logic applies to larger surfaces. Surfaces with irregular shapes—typically curved or arched—result in large unused portions of the image.
Now, let’s say we want to blit this surface using a blend flag, such as
BLEND_ADD
.For reference, the
BLEND_ADD
flag adds each source color channel to the corresponding destination channel:Here’s the key insight: most of the pixels in this surface are black (0 for all channels). When adding black pixels to any destination, the result is unchanged because
destination + 0 = destination
. Therefore, black pixels contribute nothing, and processing them is effectively wasted effort.The issue lies in the blitting algorithm:
To illustrate:
In this case, 60% of the time is spent blitting nothing.
The Solution: Skip Non-Contributing Pixels
The proposed optimization skips over large, contiguous blocks of irrelevant pixels (referred to as "clumps") in two steps:
Implementation Details
Precomputing Clumps
This can be implemented as an additional surface setup function (e.g., similar to
convert()
orpremul_alpha()
). It introduces a lightweight data structure for clump metadata:If no clumps are identified, the structure remains minimal.
Optimized Blitting
A single optimized blit function processes the precomputed clumps. This can be generalized for any blend mode using macros, ensuring minimal maintenance cost.
Comparison to Similar Approaches
Colorkey Blitting
Colorkey blits skip drawing pixels matching a specific color but still check each pixel individually.
In contrast, this approach skips entire clumps of irrelevant pixels without any per-pixel checks.
For example, if a row has 1000 pixels and only one is relevant:
Additionally, colorkey is limited to
blitcopy
and basic alpha blending, whereas this optimization works for any blend mode.RLE Acceleration
Run-Length Encoding (RLE) compresses data by storing contiguous pixels of the same color (e.g.,
(100, red), (50, green)
). While RLE excels in best-case scenarios (large uniform regions), it struggles when pixels are similar but not identical:(1, brown), (1, yellow), (1, green), (1, lightgreen)
…In such cases, RLE degenerates to standard blitting performance—or worse.
This optimization does not focus on compression. Instead, it skips over large portions of irrelevant pixels, delivering significant performance gains across any surface pixel arrangement, not just in ideal cases.
The text was updated successfully, but these errors were encountered: