-
-
Notifications
You must be signed in to change notification settings - Fork 887
Dither and Quantize API and Performance Improvements. #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1138 +/- ##
==========================================
+ Coverage 82.23% 82.27% +0.03%
==========================================
Files 678 678
Lines 29192 29216 +24
Branches 3284 3278 -6
==========================================
+ Hits 24007 24038 +31
+ Misses 4488 4483 -5
+ Partials 697 695 -2
Continue to review full report at Codecov.
|
|
Is the "all memory in for speed" approach really what we want with Gif? Just pushed a change to also run the benchmark against Note that the theoretical maximum There are alternative approaches that would sacrifice some of the speed but allocate much less. The most naive thing I can think of: binary search, hopefully less than 10x slower, could be done without going down the rabbit hole researching literature. Do you remember how much slower was the linear search without caching? Can't find the PR introducing the cache, must be older than history itself :) |
|
Ok, I realized binary search won't be enough, so I guess we can't find a good solution in a reasonable time :( |
|
Looks like this could be the perfect use case scenario for a |
|
These micro-optimizations do not worth the efforts now IMO, we should rather focus on finding a better algorithm. An efficient k-d tree implementagion may do the job. This one looks promising. We can actually combine it with other techniques. I have an idea now, and I may be able to code it in a couple of hours. But in this PR I'll focus on API review from now on. |
|
Regarding locking vs ConcurrentDictionary: we can get rid of locking entirely, we just need a first pass to fill the dictionary. |
|
@antonfirsov @Sergio0694 I would personally love to see a better implementation. I deliberately kept the Looking forward to seeing what you can come up with! |
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly good, only one remark
|
|
||
| [Theory] | ||
| [WithFile(TestImages.Bmp.Car, PixelTypes.Rgba32)] | ||
| public void EncodeAllocationCheck<TPixel>(TestImageProvider<TPixel> provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually prefer to put these under ImageSharp.Tests/ProfilingBenchmarks, and keep them skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually mean to check that in!
| /// <returns>The <see cref="ReadOnlyMemory{TPixel}"/> palette.</returns> | ||
| ReadOnlyMemory<TPixel> BuildPalette(ImageFrame<TPixel> source, Rectangle bounds); | ||
| /// <returns>The <see cref="ReadOnlySpan{TPixel}"/> palette.</returns> | ||
| ReadOnlySpan<TPixel> BuildPalette(ImageFrame<TPixel> source, Rectangle bounds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't find any related guideline point, but I have a bad feeling returning a Span as a "result" of an operation, because the ownership of the backing memory is not clear from the API.
I think we should return void here, and publish a separate Palette property instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've hit upon a point that led to me coding about 5 different version of this.
I struggled with a Palette property for a couple of reasons:
- The result is temporally coupled since it is not assigned on construction. The palette could be null (If using
ReadOnlyMemory<TPixel>) or emptyReadOnlySpan<TPixel> - Using
ReadOnlyMemory<TPixel>can lead to a larger palette length than actually required.
The OctreeFrameQuantizer doesn't return the final number of entries until you build the palette. You have to build it then slice the result reducing the length by the determined index count, returning ReadOnlySpan allowed this.
If you don't do this you can end up with a larger Gif than you actually need. The WuFrameQuantizer is similar, but at least you know the length early on.
I'll give everything on last look over and see if I can improve things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The palette could be null (If using
ReadOnlyMemory<TPixel>) or emptyReadOnlySpan<TPixel>
We can also throw an InvalidOperationException if the palette has not yet been built.
Using
ReadOnlyMemory<TPixel>can lead to a larger palette length than actually required. [...] You have to build it then slice the result reducing the length by the determined index count, returningReadOnlySpanallowed this.
You can also slice a ReadOnlyMemory<T>.
This is no strong opinion however. Alternatively we can clarify the ownership model in the docs of BuildPalette() without changing the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also slice a ReadOnlyMemory.
Why oh why has that never occurred to me before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question....
Would you consider is ok API wise for the QuantizedFrame to reference the FrameQuantizer palette rather than copying it? Currently I pass the span, copy it and use it for the lifetime of the frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that. It leads to messy code in the Gif Encoder.
|
@antonfirsov I gave it another pass and cleaned it up quite a lot. Still not happy with perf and allocations but it's incrementally improving. |
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now! Just let's be more careful with stackalloc maybe.
| // Bulk convert our palette to RGBA to allow assignment to tables. | ||
| // Palette length maxes out at 256 so safe to stackalloc. | ||
| Span<Rgba32> rgbaPaletteSpan = stackalloc Rgba32[paletteLength]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yes, but I can't spot any safety mechanism to enforce this. We should at least Guard against it, stackallocking a non-const buffer is super unsafe.
Another option is to stackalloc when paletteLength < 64 and create an array otherwise. We should not stackalloc 256 * sizeof(Rgba32) == 1KB data anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also small sidenote, keep in mind that stackalloc zeroes out memory right now, plus the current implementation is quite inefficient since it's basically just a busy loop with a series of push 0x0, plus additional code to handle stack overflows etc.
Eg. doing stackalloc int[256] produces (among other things) this loop:
L001c: mov eax, 0x40
L0021: push 0x0
L0023: push 0x0
L0025: dec rax
L0028: jnz L0021Check out the complete sample code here, it actually doesn't look that great right now: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHjykDAJYA7DAwCyACgCUXHp2o9lDAMoAHbCIA8ojAD4GuTSIYBeIxmxgA1tgA29iGGFiA2qQCsSALoBueRUGQJViAHYjEzcABn8QgF8aeKA==
You might want to consider just renting an array (which also skips the memory clear), especially if using stackalloc could introduce other security/safety concerns.
There's a proposal for a [SkipLocalsInit] attribute for C# 9.0 I think, but it's still under prototyping. And the zeroing optimization Ben Adams is doing is still an open PR possibly coming with .NET 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Palette size is limited in a couple of places...
- In the setter for
QuantizerOptions. - In the constructor for
IndexedImageFrame.
I can use the constant again and slice though if you want...
I thought 1Kb was the general rule for a max amount, should we be more conservative?
From the linked blog post.
I won’t prescribe anything specific, but anything larger than a kilobyte is a point of concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see now that IndexedImageFrame constructor should prevent anything going wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1Kb is the maximum, but probably not an optimum, or at least not always. But it doesn't really matter in WritePaletteChunk, it's not a hot path, just converting & copying a short slice of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll just revert stackalloc for now looking at the codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew that would've scared you 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always make things even more scary with the Legacy JIT 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run that legacy asm through a decompiler, I got this:
public static int M()
{
Span<int> span = stackalloc int[256];
CalculateMassOfTheSun();
return span[0];
}| private void BuildCube() | ||
| { | ||
| Span<double> vv = stackalloc double[this.colors]; | ||
| Span<double> vv = stackalloc double[this.maxColors]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's too big! 2048!!
|
We should probably merge this as-is, and continue further experiments from here. It's likely that we can introduce all necessary improvements without breaking the current API. |
Prerequisites
Description
The
IDitherinterface contained implementation specific parameters in the method signatures. This has been simplified to the absolute bare minimum possible parameters. I also managed to remove boxing from the palette dither type.In addition I simplified the
IFrameQuantizerinterface to remove palette reference ambiguity in the API.I noticed some performance problems with the
GifEncoderduring development and as part of my refactoring I tackled some of those issues. Mean performance is around 3-4X faster with 25% less allocations on 2 out of 3 targets. (I have no idea what is so different in NET Core 2.1 memory usage)Note: Allocations are due to usage of
ConcurrentDictionary<TPixel,int>in the pixel map which cannot be avoided. I normalized the concurrency count across target frameworks to at least ensure allocations do not go wild on NETFX.I've introduces some
Unsafe.Addin the dithering and quantization loops but safe in the knowledge that bounds are carefully constrained long before those methods are called.Before
After
A slight fix to the color matching code was implemented also with led to more accurate matching and distribution of error.