cbloom rants: Kraken Perf with Simultaneous Threaded Decodes

I had a report from a customer of poor Kraken decode performance on PS4 when using 10 simultaneous threads for decoding, and it occurred to me I had never tested that thoroughly. (Their issue turned out to be something else; see end of post).

There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)

Anyhoo, onto the testing -

I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.

The graphs show total time for all N decodes, and time per decode (total/N).

If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.

Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".

Conclusion :

No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.

Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).

It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.

Charts :

Kraken on PS4 (6 cores; 4 cores per 2MB L2) :

lzt24 :

lzt99:

Almost perfect threading from 1-6 cores (total time constant) even with large binary file.

webster:

webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.

webster Selkie :

Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.

Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :

lzt24 :

lzt99 :

webster :

Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the

Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :

lzt24 :

lzt99 :

I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.

Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).

raw data : (lzt24)


lappy : no turbo : (*1000)
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475
5,  14.6867,   2.9373
6,  16.6759,   2.7793
7,  19.1105,   2.7301
8,  20.1687,   2.5211
9,  23.6391,   2.6266
10,  25.9279,   2.5928
11,  27.7395,   2.5218
12,  27.6459,   2.3038
13,  30.7935,   2.3687
14,  31.8541,   2.2753
15,  33.7883,   2.2526
16,  34.8252,   2.1766

lappy : with turbo :
1,   0.0060,   0.0060
2,   0.0070,   0.0035
3,   0.0087,   0.0029
4,   0.0095,   0.0024 <- 4
5,   0.0133,   0.0027
6,   0.0170,   0.0028
7,   0.0175,   0.0025
8,   0.0193,   0.0024 <- 8
9,   0.0228,   0.0025
10,   0.0252,   0.0025
11,   0.0262,   0.0024
12,   0.0278,   0.0023 <- 12
13,   0.0318,   0.0024
14,   0.0310,   0.0022
15,   0.0325,   0.0022
16,   0.0346,   0.0022 <- 16

PS4 :
1,   0.0099,   0.0099
2,   0.0102,   0.0051
3,   0.0104,   0.0035
4,   0.0106,   0.0027
5,   0.0110,   0.0022
6,   0.0111,   0.0018 <- min
7,   0.0147,   0.0021
8,   0.0180,   0.0022
9,   0.0204,   0.0023
10,   0.0214,   0.0021
11,   0.0217,   0.0020
12,   0.0220,   0.0018 <- same min again
13,   0.0257,   0.0020
14,   0.0297,   0.0021
15,   0.0310,   0.0021
16,   0.0319,   0.0020



comparing just lappy turbo to no-turbo :

lappy : no turbo :
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475

lappy : with turbo :
1,   6.0,   6.0
2,   7.0,   3.5
3,   8.7,   2.9
4,   9.5,   2.4

You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo
With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo
has almost completely clocked down

The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.

On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).

cbloom rants

3/08/2017

Kraken Perf with Simultaneous Threaded Decodes

No comments:

Post a Comment