Oodle 2.8.6 continues small fixes and tweaks in the 2.8.x major version. Leviathan decompression is now 5-10% faster on modern processors.
I have a new standard test machine, so I want to use this space to leave myself some checkpoint reference numbers on the new machine. My new standard machine is :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz
I always down-clock my test machines and disable turbo (or "boost") for the core clocks. This gives me very reliable profiling (of single threaded work, anyway). If you don't do this, you will see not just variability of runs, but also the first thing you test will usually seem faster than later tests, so your testing may be order dependent. If you don't lock your cores at low clocks, then you should be repeating your tests many times, trying them in different orders, and seeing if your results are stable.
new machine :
ect seven Oodle 2.8.6
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
ooSelkie7 : 2.19:1 , 7.7 enc MB/s , 4205.6 dec MB/s
ooMermaid7 : 2.85:1 , 4.3 enc MB/s , 1985.1 dec MB/s
ooKraken7 : 3.08:1 , 3.5 enc MB/s , 1258.7 dec MB/s
ooLeviathan7 : 3.22:1 , 2.2 enc MB/s , 778.4 dec MB/s
zlib9 : 2.33:1 , 9.4 enc MB/s , 328.1 dec MB/s
old machine :
ect seven Oodle 2.8.6
Core i7 3770 (locked at 3.4 GHz)
ooSelkie7 : 2.19:1 , 7.5 enc MB/s , 3682.9 dec MB/s
ooMermaid7 : 2.85:1 , 3.9 enc MB/s , 1722.3 dec MB/s
ooKraken7 : 3.08:1 , 3.0 enc MB/s , 1024.9 dec MB/s
ooLeviathan7 : 3.22:1 , 1.9 enc MB/s , 679.4 dec MB/s
zlib9 : 2.33:1 , 8.0 enc MB/s , 310.2 dec MB/s
speeds are all single threaded, except the Oodle Optimal level encoders which use 2 threads for encoding (Jobify).
All reports on my blog before this post were on the Core i7 3770, where the run platform was not explicitly identified. All reports in the future will be on the Ryzen.
Here's an "example_lz_chart" run on the new machine :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Oodle 2.8.6 example_lz_chart
<
file>
lz_chart loading r:\testsets\lztestset\lzt99...
file size : 24700820
------------------------------------------------------------------------------
Selkie : super fast to encode & decode, least compression
Mermaid: fast decode with better-than-zlib compression
Kraken : good compression, fast decoding, great tradeoff!
Leviathan : very high compression, slowest decode
------------------------------------------------------------------------------
chart cell shows | raw/comp ratio : encode MB/s : decode MB/s |
All compressors run at various encoder effort levels (SuperFast - Optimal).
Many repetitions are run for accurate timing.
------------------------------------------------------------------------------
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie |1.41:834:4353|1.45:742:4355|1.53:557:4112|1.68:465:4257|1.70:412:4232|
Mermaid|1.54:702:3119|1.66:535:2591|1.79:434:2450|2.01:350:2429|2.04:324:2395|
Kraken |1.55:702:2247|1.71:532:1432|1.88:421:1367|2.10:364:1399|2.27:241:1272|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 1.412 | 1.447 | 1.526 | 1.678 | 1.698 |
Mermaid| 1.542 | 1.660 | 1.793 | 2.011 | 2.041 |
Kraken | 1.548 | 1.711 | 1.877 | 2.103 | 2.268 |
------------------------------------------------------------------------------
encode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 834.386 | 742.003 | 557.065 | 465.025 | 412.442 |
Mermaid| 701.818 | 534.711 | 433.517 | 350.444 | 324.358 |
Kraken | 701.792 | 531.799 | 420.887 | 364.245 | 240.661 |
------------------------------------------------------------------------------
decode speed (MB/s):
| HyperFast4| HyperFast3| HyperFast2| HyperFast1| SuperFast |
Selkie | 4352.567 | 4355.253 | 4111.801 | 4256.927 | 4231.549 |
Mermaid| 3118.633 | 2590.950 | 2449.676 | 2429.461 | 2394.976 |
Kraken | 2247.102 | 1431.774 | 1366.672 | 1399.416 | 1272.313 |
------------------------------------------------------------------------------
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie |1.75:285:3847|1.83:127:4121|1.86: 55:4296|1.93: 10:4317|1.94:7.2:4297|
Mermaid|2.12:226:2307|2.19:115:2533|2.21: 52:2661|2.37:5.5:2320|2.44:4.2:2256|
Kraken |2.32:152:1387|2.39: 30:1483|2.44: 23:1469|2.55:9.8:1350|2.64:3.5:1292|
Leviath|2.48: 58: 899|2.56: 23: 937|2.62: 11: 968|2.71:3.9: 948|2.75:2.4: 932|
------------------------------------------------------------------------------
compression ratio (raw/comp):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 1.748 | 1.833 | 1.863 | 1.933 | 1.943 |
Mermaid| 2.118 | 2.194 | 2.207 | 2.370 | 2.439 |
Kraken | 2.320 | 2.390 | 2.435 | 2.553 | 2.640 |
Leviath| 2.479 | 2.557 | 2.616 | 2.708 | 2.749 |
------------------------------------------------------------------------------
encode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 284.979 | 127.375 | 55.468 | 10.398 | 7.168 |
Mermaid| 226.279 | 114.597 | 52.334 | 5.457 | 4.229 |
Kraken | 152.400 | 29.891 | 22.928 | 9.849 | 3.530 |
Leviath| 58.356 | 23.379 | 10.845 | 3.927 | 2.380 |
------------------------------------------------------------------------------
decode speed (MB/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal3 |
Selkie | 3846.881 | 4121.199 | 4296.318 | 4317.344 | 4297.364 |
Mermaid| 2307.345 | 2532.950 | 2660.551 | 2320.415 | 2255.556 |
Kraken | 1387.219 | 1483.488 | 1469.246 | 1350.332 | 1292.404 |
Leviath| 899.052 | 937.473 | 968.337 | 948.179 | 932.194 |
------------------------------------------------------------------------------
i'm currently on a i7-3xxx too, looking into Ryzen 9 3xxx also. however recent info about inter-CCX (this chiplet thing) concerns me. The cache latency between core 1 and core 16 is far greater than between core 1 and core 2. I think some of these >8 core Ryzens are actually more like 2x6cores, or 2x8cores. So any thread that travels between those 2 sets of 6/8 pays a far higher cost. I was wondering if your tests take this into consideration. Or are your tests short enough (in time) that the probability of a thread being moved to another core not happening?
ReplyDeleteIt's actually like a 4x4 cores. There are 4 CCX's, each with its own set of 4 cores and 16 MB of L3. Access to memory in another CCX will be higher latency than within your own. My tests are single threaded so it won't show this. Threading switching also won't show this significantly. The OS is pretty good at scheduling a thread back to the same core it was on if your system is not overloaded. Thread switches are also extremely rare (per second or per milli as opposed to per nano), and some cache misses after a thread switch is normal. Communication between CCX's is vastly faster than the old NUMA multi-processor days.
ReplyDeleteWhere you would see it is if you have all the cores running simultaneously and touching the same cache lines to share data. Then those cache lines would have to be constantly moving between CCX's.
This is bad programming and everyone should stop doing it and learn how to write efficient code for the modern highly threaded world. For example lock/gate variables should always be on their own cache lines. Use SPSC queues heavily. Don't poll on shared variables. Use true waitable events, not poll loops. Thread the right size of work items. etc. etc.
I don't see this being an issue for most real world work loads.
Anyway, it's just win-win. The 3950 is so cheap you can just imagine you got a 4 core processor on a single CCX for Intel prices and they threw in 12 more cores for free.
We actually this recently in production.
ReplyDeleteCache contention is a real big problem for multi-threaded programs and it can crop up easily and have a huge effect.
We had a random number generator with a global state that we were hitting from lots of threads. (we use a per-thread random number generator for real production work, but just to easily hack things up we still have the one with global state). The random number generator code itself is only a few instructions, and it's not called super often, so we didn't think twice about it, but profiling showed it was taking a huge disproportionate amount of time, something like 10% of total runtime for 0.1% of the instructions, just because the cache line with the random number global state had to be passed between all the cores that were fighting over it.
Global variables that are read-write to multiple threads is a huge no-no for high performance code. You have to stay on top of it. (and watch out for "false sharing" where two variables you think are independent wind up on the same cache line and cause contention)