Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270

itzpr3d4t0r · 2024-12-17T12:32:59Z

Blitting with the CPU is expensive. While some workarounds exist to maintain optimal performance for certain visual effects, many scenarios leave little room for improvement.

TL;DR

This Issue proposes a setup that optimizes CPU blitting by skipping large, irrelevant pixel regions ("clumps") in surfaces, reducing wasted processing time.

Problem: certain pixel colors, depending on blending mode contribute nothing but are still processed, wasting CPU cycles.
Solution: Precompute relevant pixel "clumps" during surface setup and skip unnecessary regions during blitting.
Benefits:
- Eliminates per-pixel checks (unlike colorkey).
- Works with any blend mode (generalized approach).
- Outperforms RLE in real-world cases with non-uniform pixels.

The Problem

Consider the following surface:

It’s a small 16x13 pixel surface containing black and white pixels. The same logic applies to larger surfaces. Surfaces with irregular shapes—typically curved or arched—result in large unused portions of the image.
Now, let’s say we want to blit this surface using a blend flag, such as BLEND_ADD.

For reference, the BLEND_ADD flag adds each source color channel to the corresponding destination channel:

If the destination red channel is 100 and the source is 50, the result will be 150.
If the destination is black (0, 0, 0) and the source is white (255, 255, 255), the result is (255, 255, 255).

Here’s the key insight: most of the pixels in this surface are black (0 for all channels). When adding black pixels to any destination, the result is unchanged because destination + 0 = destination. Therefore, black pixels contribute nothing, and processing them is effectively wasted effort.

The issue lies in the blitting algorithm:

It processes every pixel—either individually or in vectorized chunks of 4 or 8—without regard to whether they contribute to the result.
Adding branching to skip irrelevant pixels would slow down vectorized code, which relies on consistent execution paths for performance.

To illustrate:

In this case, 60% of the time is spent blitting nothing.

The Solution: Skip Non-Contributing Pixels

The proposed optimization skips over large, contiguous blocks of irrelevant pixels (referred to as "clumps") in two steps:

Precompute clumps: Identify the positions and lengths of relevant pixel regions during surface setup (performed once).
Blit only the relevant clumps: Use the precomputed data to skip unnecessary work during blitting.

Implementation Details

Precomputing Clumps
This can be implemented as an additional surface setup function (e.g., similar to convert() or premul_alpha()). It introduces a lightweight data structure for clump metadata:
- A pointer to an array of clumps (64 bits).
- The number of clumps (32 bits).
If no clumps are identified, the structure remains minimal.
Optimized Blitting
A single optimized blit function processes the precomputed clumps. This can be generalized for any blend mode using macros, ensuring minimal maintenance cost.

Comparison to Similar Approaches

Colorkey Blitting
Colorkey blits skip drawing pixels matching a specific color but still check each pixel individually.
In contrast, this approach skips entire clumps of irrelevant pixels without any per-pixel checks.

For example, if a row has 1000 pixels and only one is relevant:
- Colorkey blitting performs 1000 checks.
- This approach skips all checks and directly blits the relevant pixel.
Additionally, colorkey is limited to blitcopy and basic alpha blending, whereas this optimization works for any blend mode.
RLE Acceleration
Run-Length Encoding (RLE) compresses data by storing contiguous pixels of the same color (e.g., (100, red), (50, green)). While RLE excels in best-case scenarios (large uniform regions), it struggles when pixels are similar but not identical:
- Example: (1, brown), (1, yellow), (1, green), (1, lightgreen)…
  In such cases, RLE degenerates to standard blitting performance—or worse.
This optimization does not focus on compression. Instead, it skips over large portions of irrelevant pixels, delivering significant performance gains across any surface pixel arrangement, not just in ideal cases.

The text was updated successfully, but these errors were encountered:

itzpr3d4t0r · 2024-12-17T12:49:56Z

PS.
This is based on test I made with a tentative implementation, These are the results:

=== blit_colorkey ===
Total: 239.3913035001351
Mean: 0.004776
Median: 0.003755
Stdev: 0.004270

=== blit_colorkey_rle ===
Total: 206.75287910009502
Mean: 0.002941
Median: 0.002340
Stdev: 0.002685

=== blit_normal_add ===
Total: 166.4761044000079
Mean: 0.003322
Median: 0.002535
Stdev: 0.003101

=== blit_drawmap ===
Total: 89.3658430999742
Mean: 0.001774
Median: 0.001411
Stdev: 0.001690

Using the following surface (scaled to the appropriate size before testing each dimension):

Starbuck5 · 2024-12-19T07:43:01Z

Questions.

What happens when the surface is mutated? Someone else blits to it, someone draws on it, anything.

How does this interact with vectorization? Does it clump into groups of 4/8 contiguous pixels when possible? Does it include "irrelevant pixels" in the clump to reach that? Could it automatically clump on alignment boundaries for aligned loads of the data?

This optimization does not focus on compression. Instead, it skips over large portions of irrelevant pixels, delivering significant performance gains across any surface pixel arrangement, not just in ideal cases.

Why not? If you're saying the rest of the surface is irrelevant why not pack all the data into a smaller form and bring down the memory usage?

itzpr3d4t0r · 2024-12-19T10:36:16Z

What happens when the surface is mutated? Someone else blits to it, someone draws on it, anything.

If the surface is modified, the calculated clumps become invalid and must be recomputed. This can be handled in one of two ways:

Automatically: Recompute clumps when:
- The blit/draw operation on this surface has finished.
- Lazily, right before blitting this surface (but only once). This would involve using a flag to track whether the clumps are still valid.
Explicitly: Require the user to call a function like update_clumps() after any surface mutations.

An automatic recalculation immediately after modifications to the surface seems preferable to me, as it avoids cluttering the user’s code.

How does this interact with vectorization? Does it clump into groups of 4/8 contiguous pixels when possible? Does it include "irrelevant pixels" in the clump to reach that? Could it automatically clump on alignment boundaries for aligned loads of the data?

Clumping mechanics:
Each clump contains the starting x and y positions on the destination surface and the number of pixels to draw. The algorithm processes each clump sequentially, regardless of whether the drawing uses AVX/SSE instructions or scalar code. Clumps are calculated row by row.

For example, if a row has:
- 10 irrelevant pixels -> 5 relevant pixels -> 20 irrelevant pixels -> 15 relevant pixels,
  the clump precalculation will produce two clumps:
  (x=10, y=0, length=5) and (x=35, y=0, length=15).
  These clumps are added to a single array (internally referred to as the "drawmap") for blitting. During blitting, the algorithm:
  1. Reads the first clump.
  2. Moves to the source's (x, y) pixel position.
  3. Draws the source pixels up to the clump's length. This will use AVX/SSE or scalar code.
  4. Repeats for each clump within the clip rectangle.
Including irrelevant pixels:
The algorithm does not include non-relevant pixels in the clumps.
Alignment boundaries:
While the algorithm could clump on alignment boundaries for aligned loads, it’s not currently implemented. Achieving alignment would add complexity and isn’t guaranteed to yield significant performance gains. The current implementation is already highly efficient, so alignment concerns are more of a potential future optimization.

Why not pack the relevant data into a smaller structure and reduce memory usage, if the rest of the surface is irrelevant?

Packing data into a smaller form wouldn’t reduce memory usage, as the SDL_Surface pixels would still be allocated. This acceleration structure is supplemental and adds to memory usage.

For example, storing the x, y, and length of each clump (using Uint32) costs 12 bytes per clump. In comparison, a single pixel is 4 bytes. If we expanded clumps to include arrays of "sub-clumps" tracking instances of the same color within a clump, it would require even more memory (e.g., an additional 4+4 (color + len) bytes per color).

This approach would also introduce limitations similar to those of colorkey operations, where it only optimizes for contiguous same-colored pixels. More complex pixel arrangements would lead to slower blitting.

Overall, I think it’s better to keep the implementation agnostic to pixel arrangements by default. These kinds of features could be explored later as opt-in enhancements.

Starbuck5 · 2024-12-20T08:09:20Z

If you can automatically detect when a surface has changed, you could apply this optimization internally and automatically. You could even automatically convert() or convert_alpha() into special internal surfaces if a surface is blit repeatedly onto something with an incompatible pixelformat.

In terms of clumping mechanics in your example, I would expect it be faster to clump to (x=10, y=0, length=8) and (x=34, y=0, length=16). Especially on an SSE backend, without the masked-load-store strategy we use for AVX.

Packing data into a smaller form wouldn’t reduce memory usage, as the SDL_Surface pixels would still be allocated. This acceleration structure is supplemental and adds to memory usage.

Well you see I was sort of thinking this would create a read only surface, which I guess is not the case given your other comment. This idea makes sense if you're creating a read only surface, you wouldn't need to keep around SDL_Surface pixels at all.

Starbuck5 · 2024-12-20T08:23:27Z

Also the elephant in the room here is that we should probably focus on GPU rendering rather than further improvements to software rendering.

itzpr3d4t0r added Performance Related to the speed or resource usage of the project Surface pygame.Surface labels Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270

Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270

itzpr3d4t0r commented Dec 17, 2024 •

edited

Loading

itzpr3d4t0r commented Dec 17, 2024

Starbuck5 commented Dec 19, 2024

itzpr3d4t0r commented Dec 19, 2024

Starbuck5 commented Dec 20, 2024

Starbuck5 commented Dec 20, 2024

Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270

Optimize CPU Blitting by Skipping Non-Contributing Pixels #3270

Comments

itzpr3d4t0r commented Dec 17, 2024 • edited Loading

TL;DR

The Problem

The Solution: Skip Non-Contributing Pixels

Implementation Details

Comparison to Similar Approaches

itzpr3d4t0r commented Dec 17, 2024

Starbuck5 commented Dec 19, 2024

itzpr3d4t0r commented Dec 19, 2024

Starbuck5 commented Dec 20, 2024

Starbuck5 commented Dec 20, 2024

itzpr3d4t0r commented Dec 17, 2024 •

edited

Loading