This post is just a summary of a few things I learned about performance optimization of CPU-bound / memory-bound processes in C#/.NET.
I was able to accelerate two different low-level image processing algorithms by a factor of 10 while learning a lot and enjoying the process.

The scope is optimizing in C# (.NET 4.6.2) on x64 CPUs by hand.
It is based on my experiences while optimizing a couple of 2D image processing algorithms for a client.
I am no expert though on low-level programming or performance optimization.
If you have done performance optimizations on numerical algorithms before, you will probably not find anything here you did not know already (however, you are very welcome to correct me on everything I got wrong).
If you do not care about my experiences and are only looking for further resources, skip right to the Resources.

And yes, I know GPUs are fast, and yes, I know that there are tools like hybridizer.
And also yes, I asked myself multiple times during the process if verbosely switching off C#’s safety features and making it look like C++ was better than just rewriting the thing in C++ in the first place.
But those are topics for another time.

So here we go.

The Right Tools Enable A Systematic Approach

When doing performance optimization in .NET there is little need for guess-work.
Many tools for a systematic approach are readily available:

find the bottleneck using Visual Studio’s own profiler and optionally PerfView
use functional tests to guard against regression using your unit testing framework of choice
benchmark the improvements with BenchmarkDotNet
optionally check the generated assembly instructions with Visual Studio’s disassembly viewer or with one of BenchmarkDotNet’s Diagnosers

I already wrote a whole blog post stressing the necessity of a good benchmarking tool (spoiler: no, Stopwatch does not suffice) and praising BenchmarkDotNet.
So all I will say here is: use BenchmarkDotNet to make benchmarking easy, reliable and fun!

Find The Bottleneck With A Profiler

I never used a profiler before and only this time used the Visual Studio profiler to make sure that the assumption about the image processing algorithm itself being the bottleneck was correct.
So, I will not go into detail how to use a profiler and interpret the results.
There are plenty of resources for that.
This section is just a reminder to actually do this sanity check to make sure that performance improvements are made in the most promising places.
Start the application in Release mode, turn on the profiler, trigger the slow process (in my case by clicking through the UI), wait until it ends, stop the profiler, look at the results.
Make sure to profile a release build, not a debug build.
Make sure that the profiled projects are built with the same compiler optimizations as an actual production build.

Guard Against Regression With Zero-Tolerance Tests

When optimizing for performance, the rules of clean coding are sacrificed on the altar of speed, safety features like memory-safety and even type-safety may be forgone.
The algorithm itself may be refactored beyond recognition.
A strong suite of functional tests and regression tests provides rapid feedback on regression and helps you to avoid mistakes.
This is one reason to keep the original algorithm around and modify a copy.
In my case I was not only allowed but even forced to use the original algorithm as the unfailable source of truth, i. e. the oracle.
Hence, regression and functional tests were the same thing.

Both algorithms had two layers: an outer layer that managed the image tiling and the 2D looping, and an inner layer doing the actual processing on the pixel data provided by the outer layer.
One of the two image processing algorithms contained both layers in a single class.
The other had the outer layer in its base class and contained only the inner layer.
I expected performance optimization to involve modifying both layers.

Everything that is expected to change during performance optimization must be copied and the references adjusted accordingly.
A sufficiently separated copy can be changed at will without affecting the original algorithm.

Thus, for each algorithm I made a copy of everything that belonged to either layer (including the mentioned base class), so that I was free to change everything in both layers.

I then created two layers of regression tests for each algorithm: an inner layer using raw random pixel data for testing the in-tile processing and an outer layer using real whole example images for testing the whole thing including the tiling and looping.
The inner tests used a small rectangular array and ran in seconds, the outer tests used big images and ran in minutes.

When the original algorithm serves as the oracle, the tests must compare the result of the to-be-optimized copy to the result of the original.
They must make sure that the results are … well, what exactly should the assertion be here?

Should the result be the exact same as before?
Or would it suffice to be “close enough”?
How much deviation should “close enough” allow to still fail on actual regression?
When are we forced to allow small epsilon deviations?

TL;DR: Floating Point Is Not Magic

My take-away from doing optimization is that it is really nice to have a regression test not allowing any deviations, i. e. testing for bit-wise identical results.
A lot of image processing algorithms treat the edges and corners of an image differently via special cases and it is easy to break those special cases when refactoring.
Having a no-deviations-allowed test will definitely catch those regressions.
Floating point is not magic^{[citation needed]}.
If the optimized version uses the same precision internally as the original algorithm, you can expect identical results.
If the optimized version uses a different precision from the original (e. g. by utilizing single precision SIMD), you can test against a carefully checked modified version of the original algorithm that also uses that precision.
Allowing some carefully chosen small deviations should be the last measure.

Floating point arithmetic is still tricky

There is no reason to allow any kind of small epsilon deviations from the original just because the algorithm internally uses floating point arithmetic.
Floating point is not magic, not random and not arbitrary.
The floating point format and floating point arithmetic are very much deterministic, well-defined and usually well-implemented.
As long as the original and the optimized version use the same precision (both use double precision aka double or both use single precision aka float), there need not be any deviation.
In fact, the optimized version of one of the real-life algorithms yielded a result that was bit-identical to the original (with performance improved ten-fold), because both used float.
Such an outcome guarantees that there is no regression and makes the refactoring untouchable by rational criticism, at least within the scope of the provided test cases.

There is an element of truth to the fear of non-determinism and the resulting epsilon obsession, though: intermediate precision.
Intermediate precision can become a problem when using the x87 instruction set.
The x87 instruction set was and can still be used for floating point arithmetic and enables calculations to be done in 80 bit registers (“extended precision”).
Before written to memory, the result of the calculation will be rounded to double (64 bit) precision.
The problem with this is that the compiler may choose to keep intermediate results enregistered in the 80 bit register and keep the precision, or it may choose to write intermediate results to memory with 64 bit precision and read them back out later.
Depending on these compiler decisions at various places in the algorithm, the end results may vary.
From the programmer’s point of view this decision process is not deterministic, which makes the end result non-deterministic within the rounding margins.
There are various CPU switches and compiler options to alleviate this problem, but still floating point arithmetic got a bad reputation.

Intermediate precision may also be a problem when using compiler optimizations allowing the reordering of calculation terms.
Using FMA instructions (e. g. through NET intrinsics) will also change intermediate precision and thus possibly the end result.

The x64 CIL JIT compiler will by default not issue x87 instructions, not issue FMA instructions and not reorder calculation terms.
It will use SIMD instructions for floating point calculations, all of which are single or double precision and give the same results for scalar and superscalar calculations.
This is of course not the whole story as the CLI in principle allows calculations with arbitrarily high intermediate precision (see CLI Spec (PDF), Partition 1 Section 12.3.1, also discussed here and here).
I have not encountered any problems with that during my optimizations, though, and will assume that within common x64 processors floating point arithmetic in .NET is deterministic and we are thus free to expect bit-identical results from our refactored algorithm as long as we do not use certain compiler optimizations or intrinsics.

There is still a discrepancy in C# between scalar and superscalar floating point calculations.
When writing scalar calculations it is natural to use double precision by default.
A lot of math library functions naturally take and return double.
Using single precision will then not only be awkward and annoying, but may very well be slower than double precision because of the necessary implicit and explicit conversions.
When writing superscalar calculations, however, the situation is different.
Many performance optimizations involve vectorization.
The SIMD instruction set available on modern CPUs (AVX2) is virtually the same for single and double precision.
Consequently, C#’s Vector<T> has the same arithmetic capabilities for both.
For a lot of algorithms single precision is sufficient.
Thus, there is an incentive to put twice as many single precision values into a register and (in principal) do twice the number of calculations in the same amount of time.

If the original algorithm has double precision arithmetic in it, it may be worth creating a single precision version of it and then do the regression tests with zero deviation against that version.

So, If

the original algorithm uses double precision calculations (check for implicit conversion e. g. when using Math.Sqrt()!)
single precision would suffice across the algorithm while still fulfilling its precision requirements

Then create another copy of the original algorithm and change all the double precision math to single precision math:

Use explicit float type declarations
Make sure that after each calculation that implicitly converts to double, the result is explicitly converted back to float
In nested calculations, make sure that each conversion to float is done in the respective innermost-possible parentheses

These changes are likely to be limited, easy to check with a diff-viewer and rather fail-safe.
They are well worth the effort if they enable regression testing with zero deviation.

Benchmark Your Improvements Incrementally

Whether or not you created a single precision version of the algorithm, benchmarking should be done against the original.
In case I have not yet sufficiently often recommended BenchmarkDotNet, here we go again.
Benchmark often, ideally after each small change, so you get some data for what works and what does not.
Prepare to be surprised.
If you change too much before benchmarking again, chances are that some of those changes lowered performance without you noticing.

Resources

There are many optimization guides, tutorials and look-up resources out there.
I in particular found the following helpful:

beating array iteration performance
- Part 1: https://thegreatco.com/posts/20190401/
- Part 2: https://thegreatco.com/posts/20190403/
alignment and pipelining: https://mijailovic.net/2018/07/20/alignment-and-pipelining/
CPU caches and why you care: https://www.youtube.com/watch?v=WDIkqP4JbkE
about garbage collection: https://docs.microsoft.com/en-us/previous-versions/dotnet/articles/ms973837(v=msdn.10)
avoiding GC pressure: https://michaelscodingspot.com/avoid-gc-pressure/
value types: https://docs.microsoft.com/en-us/archive/blogs/ericlippert/the-truth-about-value-types
patterns for high-performance C#: https://www.youtube.com/watch?v=7GTpwgsmHgU
using “Unsafe”: https://ndportmann.com/system-runtime-compilerservices-unsafe/
loop nest optimization: https://en.wikipedia.org/wiki/Loop_nest_optimization
Intel intrinsics guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/