There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)
Anyhoo, onto the testing -
I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.
The graphs show total time for all N decodes, and time per decode (total/N).
If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.
Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".
Conclusion :
No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.
Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).
It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.
Charts :
Kraken on PS4 (6 cores; 4 cores per 2MB L2) :
lzt24 :
lzt99:
Almost perfect threading from 1-6 cores (total time constant) even with large binary file.
webster:
webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.
webster Selkie :
Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.
Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :
lzt24 :
lzt99 :
webster :
Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the
Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :
lzt24 :
lzt99 :
I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.
Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).
raw data : (lzt24)
lappy : no turbo : (*1000) 1, 9.1360, 9.1360 2, 9.5523, 4.7761 3, 9.7850, 3.2617 4, 10.1901, 2.5475 5, 14.6867, 2.9373 6, 16.6759, 2.7793 7, 19.1105, 2.7301 8, 20.1687, 2.5211 9, 23.6391, 2.6266 10, 25.9279, 2.5928 11, 27.7395, 2.5218 12, 27.6459, 2.3038 13, 30.7935, 2.3687 14, 31.8541, 2.2753 15, 33.7883, 2.2526 16, 34.8252, 2.1766 lappy : with turbo : 1, 0.0060, 0.0060 2, 0.0070, 0.0035 3, 0.0087, 0.0029 4, 0.0095, 0.0024 <- 4 5, 0.0133, 0.0027 6, 0.0170, 0.0028 7, 0.0175, 0.0025 8, 0.0193, 0.0024 <- 8 9, 0.0228, 0.0025 10, 0.0252, 0.0025 11, 0.0262, 0.0024 12, 0.0278, 0.0023 <- 12 13, 0.0318, 0.0024 14, 0.0310, 0.0022 15, 0.0325, 0.0022 16, 0.0346, 0.0022 <- 16 PS4 : 1, 0.0099, 0.0099 2, 0.0102, 0.0051 3, 0.0104, 0.0035 4, 0.0106, 0.0027 5, 0.0110, 0.0022 6, 0.0111, 0.0018 <- min 7, 0.0147, 0.0021 8, 0.0180, 0.0022 9, 0.0204, 0.0023 10, 0.0214, 0.0021 11, 0.0217, 0.0020 12, 0.0220, 0.0018 <- same min again 13, 0.0257, 0.0020 14, 0.0297, 0.0021 15, 0.0310, 0.0021 16, 0.0319, 0.0020 comparing just lappy turbo to no-turbo : lappy : no turbo : 1, 9.1360, 9.1360 2, 9.5523, 4.7761 3, 9.7850, 3.2617 4, 10.1901, 2.5475 lappy : with turbo : 1, 6.0, 6.0 2, 7.0, 3.5 3, 8.7, 2.9 4, 9.5, 2.4 You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo has almost completely clocked down
The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.
On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).
No comments:
Post a Comment