Skip to content

Conversation

@JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

The IDither interface contained implementation specific parameters in the method signatures. This has been simplified to the absolute bare minimum possible parameters. I also managed to remove boxing from the palette dither type.

In addition I simplified the IFrameQuantizer interface to remove palette reference ambiguity in the API.

I noticed some performance problems with the GifEncoder during development and as part of my refactoring I tackled some of those issues. Mean performance is around 3-4X faster with 25% less allocations on 2 out of 3 targets. (I have no idea what is so different in NET Core 2.1 memory usage)

Note: Allocations are due to usage of ConcurrentDictionary<TPixel,int> in the pixel map which cannot be avoided. I normalized the concurrency count across target frameworks to at least ensure allocations do not go wild on NETFX.

I've introduces some Unsafe.Add in the dithering and quantization loops but safe in the knowledge that bounds are carefully constrained long before those methods are called.

Before

Method Runtime Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'System.Drawing Gif' .NET 4.7.2 12.93 ms 23.910 ms 1.311 ms 1.00 0.00 234.3750 - - 1093.11 KB
'ImageSharp Gif' .NET 4.7.2 86.65 ms 14.582 ms 0.799 ms 6.74 0.60 1142.8571 714.2857 285.7143 8155.4 KB
'System.Drawing Gif' .NET Core 2.1 12.09 ms 2.494 ms 0.137 ms 1.00 0.00 234.3750 - - 1089.88 KB
'ImageSharp Gif' .NET Core 2.1 97.67 ms 128.094 ms 7.021 ms 8.08 0.55 800.0000 200.0000 - 737.43 KB
'System.Drawing Gif' .NET Core 3.1 12.01 ms 1.054 ms 0.058 ms 1.00 0.00 15.6250 - - 198.06 KB
'ImageSharp Gif' .NET Core 3.1 71.03 ms 26.512 ms 1.453 ms 5.91 0.13 1142.8571 714.2857 285.7143 8123.59 KB

After

Method Runtime Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'System.Drawing Gif' .NET 4.7.2 10.66 ms 1.325 ms 0.073 ms 1.00 0.00 234.3750 - - 1093.01 KB
'ImageSharp Gif' .NET 4.7.2 27.68 ms 7.814 ms 0.428 ms 2.60 0.05 1031.2500 625.0000 281.2500 6509.24 KB
'System.Drawing Gif' .NET Core 2.1 11.01 ms 10.751 ms 0.589 ms 1.00 0.00 234.3750 - - 1089.88 KB
'ImageSharp Gif' .NET Core 2.1 24.18 ms 3.775 ms 0.207 ms 2.20 0.10 1031.2500 656.2500 312.5000 740.78 KB
'System.Drawing Gif' .NET Core 3.1 10.49 ms 1.687 ms 0.092 ms 1.00 0.00 15.6250 - - 197.95 KB
'ImageSharp Gif' .NET Core 3.1 27.27 ms 41.280 ms 2.263 ms 2.60 0.23 1031.2500 625.0000 281.2500 6453.76 KB

A slight fix to the color matching code was implemented also with led to more accurate matching and distribution of error.

@codecov
Copy link

codecov bot commented Mar 3, 2020

Codecov Report

Merging #1138 into master will increase coverage by 0.03%.
The diff coverage is 96.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1138      +/-   ##
==========================================
+ Coverage   82.23%   82.27%   +0.03%     
==========================================
  Files         678      678              
  Lines       29192    29216      +24     
  Branches     3284     3278       -6     
==========================================
+ Hits        24007    24038      +31     
+ Misses       4488     4483       -5     
+ Partials      697      695       -2
Flag Coverage Δ
#unittests 82.27% <96.59%> (+0.03%) ⬆️
Impacted Files Coverage Δ
src/ImageSharp/Formats/Gif/LzwEncoder.cs 98.19% <ø> (ø) ⬆️
...g/Processors/Dithering/OrderedDither.KnownTypes.cs 100% <ø> (ø) ⬆️
...ImageSharp/Formats/Png/PngEncoderOptionsHelpers.cs 86.07% <ø> (ø) ⬆️
...y/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs 80% <0%> (ø) ⬆️
...ocessors/Quantization/QuantizeProcessor{TPixel}.cs 100% <100%> (ø) ⬆️
...cessing/Processors/Quantization/OctreeQuantizer.cs 100% <100%> (ø) ⬆️
...essing/Processors/Quantization/PaletteQuantizer.cs 100% <100%> (ø) ⬆️
...sors/Quantization/PaletteFrameQuantizer{TPixel}.cs 100% <100%> (ø) ⬆️
.../Processing/Processors/Quantization/WuQuantizer.cs 100% <100%> (ø) ⬆️
...p/Processing/Processors/Dithering/OrderedDither.cs 98.76% <100%> (-0.14%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce8c213...71e5a6d. Read the comment docs.

@antonfirsov
Copy link
Member

antonfirsov commented Mar 4, 2020

PaletteFrameQuantizer<T> and EuclideanPixelMap<T> were being around here for a while, but unfortunately I did not have a chance to have a deeper look at the implementation before, so sorry for bringing this question to the table this late, but we need to carefully think this over.

Is the "all memory in for speed" approach really what we want with Gif? Just pushed a change to also run the benchmark against rgb-48bpp.png, which is an image of wider color variety:

|               Method |         TestImage | Ratio |     Gen 0 |     Gen 1 |    Gen 2 |   Allocated |
|--------------------- |------------------ |------:|----------:|----------:|---------:|------------:|
| 'System.Drawing Gif' |       Bmp/Car.bmp |  1.00 |   15.6250 |         - |        - |   197.95 KB |
|     'ImageSharp Gif' |       Bmp/Car.bmp |  2.33 | 1093.7500 |  656.2500 | 312.5000 |  6453.27 KB |
|                      |                   |       |           |           |          |             |
| 'System.Drawing Gif' | Png/rgb-48bpp.png |  1.00 |         - |         - |        - |   393.23 KB |
|     'ImageSharp Gif' | Png/rgb-48bpp.png |  2.91 | 3250.0000 | 1500.0000 | 500.0000 | 23105.69 KB |

Note that the theoretical maximum distanceCache for Rgba32 is 256^4 ~= 4 billion (!) elements.

There are alternative approaches that would sacrifice some of the speed but allocate much less. The most naive thing I can think of: binary search, hopefully less than 10x slower, could be done without going down the rabbit hole researching literature. Do you remember how much slower was the linear search without caching? Can't find the PR introducing the cache, must be older than history itself :)

@antonfirsov
Copy link
Member

antonfirsov commented Mar 4, 2020

Ok, I realized binary search won't be enough, so I guess we can't find a good solution in a reasonable time :(

@Sergio0694
Copy link
Member

Looks like this could be the perfect use case scenario for a ConcurrentDictionarySlim<TKey, TValue> collection, if it existed. If we care about memory usage, I'm wondering whether maybe it'd be worth it to run some tests with a DictionarySlim<TKey, TValue> (see here) that is accessed either by a lock, or maybe even with just a shared SpinLock? Memory usage should go down, especially because DictionarySlim<TKey, TValue> doesn't box value types, plus it's meant for types with a fast GetHashCode implementation, so it doesn't store a copy of the code in the internal table, which saves an additional 4 bytes per entry. Could be worth looking into maybe? 🤔

@antonfirsov
Copy link
Member

antonfirsov commented Mar 4, 2020

These micro-optimizations do not worth the efforts now IMO, we should rather focus on finding a better algorithm. An efficient k-d tree implementagion may do the job. This one looks promising. We can actually combine it with other techniques. I have an idea now, and I may be able to code it in a couple of hours. But in this PR I'll focus on API review from now on.

@antonfirsov
Copy link
Member

Regarding locking vs ConcurrentDictionary: we can get rid of locking entirely, we just need a first pass to fill the dictionary.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov @Sergio0694 I would personally love to see a better implementation. I deliberately kept the EuclideanPixelMap<T> internal because I was not happy with the approach.

Looking forward to seeing what you can come up with!

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, only one remark


[Theory]
[WithFile(TestImages.Bmp.Car, PixelTypes.Rgba32)]
public void EncodeAllocationCheck<TPixel>(TestImageProvider<TPixel> provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer to put these under ImageSharp.Tests/ProfilingBenchmarks, and keep them skipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually mean to check that in!

/// <returns>The <see cref="ReadOnlyMemory{TPixel}"/> palette.</returns>
ReadOnlyMemory<TPixel> BuildPalette(ImageFrame<TPixel> source, Rectangle bounds);
/// <returns>The <see cref="ReadOnlySpan{TPixel}"/> palette.</returns>
ReadOnlySpan<TPixel> BuildPalette(ImageFrame<TPixel> source, Rectangle bounds);
Copy link
Member

@antonfirsov antonfirsov Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find any related guideline point, but I have a bad feeling returning a Span as a "result" of an operation, because the ownership of the backing memory is not clear from the API.

I think we should return void here, and publish a separate Palette property instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've hit upon a point that led to me coding about 5 different version of this.

I struggled with a Palette property for a couple of reasons:

  1. The result is temporally coupled since it is not assigned on construction. The palette could be null (If using ReadOnlyMemory<TPixel>) or empty ReadOnlySpan<TPixel>
  2. Using ReadOnlyMemory<TPixel> can lead to a larger palette length than actually required.

The OctreeFrameQuantizer doesn't return the final number of entries until you build the palette. You have to build it then slice the result reducing the length by the determined index count, returning ReadOnlySpan allowed this.

https://github.com/SixLabors/ImageSharp/pull/1138/files#diff-2e327e0f312f1aae943b78b1145bd8adR90

If you don't do this you can end up with a larger Gif than you actually need. The WuFrameQuantizer is similar, but at least you know the length early on.

I'll give everything on last look over and see if I can improve things.

Copy link
Member

@antonfirsov antonfirsov Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The palette could be null (If using ReadOnlyMemory<TPixel>) or empty ReadOnlySpan<TPixel>

We can also throw an InvalidOperationException if the palette has not yet been built.

Using ReadOnlyMemory<TPixel> can lead to a larger palette length than actually required. [...] You have to build it then slice the result reducing the length by the determined index count, returning ReadOnlySpan allowed this.

You can also slice a ReadOnlyMemory<T>.

This is no strong opinion however. Alternatively we can clarify the ownership model in the docs of BuildPalette() without changing the API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also slice a ReadOnlyMemory.

Why oh why has that never occurred to me before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question....

Would you consider is ok API wise for the QuantizedFrame to reference the FrameQuantizer palette rather than copying it? Currently I pass the span, copy it and use it for the lifetime of the frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that. It leads to messy code in the Gif Encoder.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov I gave it another pass and cleaned it up quite a lot.

Still not happy with perf and allocations but it's incrementally improving.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now! Just let's be more careful with stackalloc maybe.

Comment on lines 568 to 570
// Bulk convert our palette to RGBA to allow assignment to tables.
// Palette length maxes out at 256 so safe to stackalloc.
Span<Rgba32> rgbaPaletteSpan = stackalloc Rgba32[paletteLength];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes, but I can't spot any safety mechanism to enforce this. We should at least Guard against it, stackallocking a non-const buffer is super unsafe.

Another option is to stackalloc when paletteLength < 64 and create an array otherwise. We should not stackalloc 256 * sizeof(Rgba32) == 1KB data anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also small sidenote, keep in mind that stackalloc zeroes out memory right now, plus the current implementation is quite inefficient since it's basically just a busy loop with a series of push 0x0, plus additional code to handle stack overflows etc.

Eg. doing stackalloc int[256] produces (among other things) this loop:

L001c: mov eax, 0x40
L0021: push 0x0
L0023: push 0x0
L0025: dec rax
L0028: jnz L0021

Check out the complete sample code here, it actually doesn't look that great right now: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHjykDAJYA7DAwCyACgCUXHp2o9lDAMoAHbCIA8ojAD4GuTSIYBeIxmxgA1tgA29iGGFiA2qQCsSALoBueRUGQJViAHYjEzcABn8QgF8aeKA==

You might want to consider just renting an array (which also skips the memory clear), especially if using stackalloc could introduce other security/safety concerns.

There's a proposal for a [SkipLocalsInit] attribute for C# 9.0 I think, but it's still under prototyping. And the zeroing optimization Ben Adams is doing is still an open PR possibly coming with .NET 5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Palette size is limited in a couple of places...

  1. In the setter for QuantizerOptions.
  2. In the constructor for IndexedImageFrame.

I can use the constant again and slice though if you want...

I thought 1Kb was the general rule for a max amount, should we be more conservative?

From the linked blog post.

I won’t prescribe anything specific, but anything larger than a kilobyte is a point of concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now that IndexedImageFrame constructor should prevent anything going wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1Kb is the maximum, but probably not an optimum, or at least not always. But it doesn't really matter in WritePaletteChunk, it's not a hot path, just converting & copying a short slice of data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll just revert stackalloc for now looking at the codegen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew that would've scared you 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always make things even more scary with the Legacy JIT 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run that legacy asm through a decompiler, I got this:

public static int M()
{
    Span<int> span = stackalloc int[256];
    CalculateMassOfTheSun();
    return span[0];
}

private void BuildCube()
{
Span<double> vv = stackalloc double[this.colors];
Span<double> vv = stackalloc double[this.maxColors];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's too big! 2048!!

@antonfirsov
Copy link
Member

antonfirsov commented Mar 9, 2020

We should probably merge this as-is, and continue further experiments from here. It's likely that we can introduce all necessary improvements without breaking the current API.

@JimBobSquarePants JimBobSquarePants merged commit fab07ab into master Mar 9, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/dither-quantize-updates branch March 9, 2020 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants