tag:blogger.com,1999:blog-5246987755651065286.post7381758415327733091..comments2024-02-22T16:15:42.388-08:00Comments on cbloom rants: 02-18-14 - ans_fast implementation notescbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger24125tag:blogger.com,1999:blog-5246987755651065286.post-58092512737922231312014-03-04T16:08:31.730-08:002014-03-04T16:08:31.730-08:00Ok, Charles code ( http://pastebin.com/qqyLNXFA ) ...Ok, Charles code ( http://pastebin.com/qqyLNXFA ) clearly shows that decoding should have practically the same speed in both cases for extremely accurate entropy coding. Encoding could be vectorized for rANS to get a few times speedup, but it is not an essential advantage.<br /><br />However, for a bit less accurate cases like CABAC, tANS seems much simpler. For example 16 tables for 16 states (qANS from paper) to operate deltaH~0.001 bits/symbol would need 16*16=256 bytes (4 bits for new state, 1 bit for symbol, 2 bits for nbBits).Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-74403385547834005322014-03-04T14:13:28.442-08:002014-03-04T14:13:28.442-08:00A fair comparison is what Charles just posted, or ...A fair comparison is what Charles just posted, or Matt Mahoney's fpaq[abc].Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-53725626200783276662014-03-04T12:55:55.371-08:002014-03-04T12:55:55.371-08:00Charles, so what is a fair comparison?
regarding r...Charles, so what is a fair comparison?<br />regarding read_bits, e.g.:<br />read_bits(k)<br />{bits = bitBuffer & (1 < < k - 1);<br /> bitBuffer = bitBuffer > > k;<br />}<br />Using 64 bit BitBuffer, every let say 4 symbols you check if you didn't get below 32 bits - if so, read 32 bits.<br />You need one branch per a few symbols.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-9739252655198662462014-03-04T12:54:19.880-08:002014-03-04T12:54:19.880-08:00This comment has been removed by the author.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-11562095221886296312014-03-04T12:46:50.912-08:002014-03-04T12:46:50.912-08:00Fabian, single tABS is not interesting, but exactl...Fabian, single tABS is not interesting, but exactly like for CABAC, we can use multiple small tables for different quantized probabilities of less probable symbol.<br />For example use 32 small tables: t=decodingTable[m][x], where m = p/64 <= 1/2.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-66768059987154485912014-03-04T12:41:52.608-08:002014-03-04T12:41:52.608-08:00Please look at the code I posted at pastebin. Sto...Please look at the code I posted at pastebin. Stop comparing to CABAC, it's not a fair comparison, it's designed for hardware. You need to compare equivalent things.<br /><br />In tANS, "read_bits" is not free. It's actually hiding the "if" for renormalization.<br />cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-35980405478440718472014-03-04T12:34:58.696-08:002014-03-04T12:34:58.696-08:00Charles, range coder is for large alphabets, while...Charles, range coder is for large alphabets, while now I want to say that e.g. tABS has also advantages for the binary case - CABAC seems to be best for comparison here(?).<br /><br />tANS step:<br />t = DecodingTable[m][x];<br />value = valMPS XOR t.symbol;<br />x = t.NewX + read_bits(t.nbBits);<br /><br />CABAC decoding step from article:<br />R_LPS = RTAB[m][(R > > 6) & 3];<br />R_MPS = R - R_LPS;<br />if (V < (R_MPS < < BitsLeft))<br />{R=R_MPS; value=valMPS;<br /> Rnorm = (R_MPS > > 8) XOR 1;}<br />else<br />{V = V – (R_MPS < < BitsLeft)<br /> R = R_LPS; value = !valMPS<br /> Rnorm = RnormTAB[R_LPS > > 3];}<br />R = R << Rnorm;<br />BitsLeft = BitsLeft - Rnorm;<br />if (BitsLeft < = 0)<br />{V = (V < < M_BITS) | read_bits(M_BITS);<br /> BitsLeft = BitsLeft + M_BITS;}Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-76999508938115404872014-03-04T12:26:16.095-08:002014-03-04T12:26:16.095-08:00tABS by itself means binary coder with static mode...tABS by itself means binary coder with static model though. That's not a very useful primitive.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-72197023283463867042014-03-04T12:18:16.687-08:002014-03-04T12:18:16.687-08:00For the record, here is rABS and binary arithmetic...For the record, here is rABS and binary arithmetic code :<br /><br />http://pastebin.com/qqyLNXFA<br /><br />including renorm and modeling. The speed will be nearly identical and depend on quirks of usage.<br /><br />Yes tABS might be a little faster than rABS.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-90525851829709838812014-03-04T12:00:00.392-08:002014-03-04T12:00:00.392-08:00Jarek, you're being silly. CABAC is not desig...Jarek, you're being silly. CABAC is not designed to be the fastest arithmetic encoder in software. To do fast arithmetic coding in software you would just use a range coder with byte or dword renormalization.<br />cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-56696824816151047172014-03-04T11:58:06.586-08:002014-03-04T11:58:06.586-08:00Fabian, I couldn't find a better paper, but lo...Fabian, I couldn't find a better paper, but look at its relatively complicated code (fig 2B) - especially bit-by-bit renormalization.<br />In comparison tANS is just a single table use and BitOr taking all bits at once - it is at least two "if"s less per step.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-56482838593238238632014-03-04T11:32:03.027-08:002014-03-04T11:32:03.027-08:00Did you post the wrong link? The only graph in tha...Did you post the wrong link? The only graph in that paper is about the bit rate improvement of CABAC over CAVLC. There is no mention of absolute perf numbers that I can see, and their measurements were done on a 10-year old Pentium 4 - which is notable for having variable bit shifts that are 12x slower than integer adds in terms of latency.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-74445391085794391512014-03-04T10:33:05.626-08:002014-03-04T10:33:05.626-08:00Here is an article about CABAC performance: http:/...Here is an article about CABAC performance: http://iphome.hhi.de/marpe/download/iwk06_marpe_et_al.pdf<br />Its graphs end on 40MB/s, while directly using Charles or Yann's tANS implementation for binary case we would get ~70MB/s.<br />One reason is that CABAC renormalization is bit by bit, while tANS just takes the required number of bits once per step.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-48193655080675326592014-03-04T09:18:44.823-08:002014-03-04T09:18:44.823-08:00Fabian, sure there is usually dependence in the bi...Fabian, sure there is usually dependence in the binary case, but still e.g. tABS has a few advantages. For example let say 64 tables for different quantized probability of the least probable symbol on 64 states - a few kilobytes of tables for deltaH~0.0001 bits/symbol (or less than a kilobyte for perfect for cheap hardware implementation qABS):<br /><br />Encoding: buffering (symbol,probability) we can use SIMD version - getting a few times speedup.<br /><br />Decoding: tABS step is just a table use and BitOr - CABAC M-coder has additionally renormalization and addition - tABS should be at least 20% faster, and both Charles and Yann's implementations seem to be.<br />Additionally, sometimes there can appear independence of a few succeeding probabilities, what allows to use a few times faster SIMD variant - there can be a few separate decoding step variants depending on the number of bits to decode in this step.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-88245513960445963092014-02-25T12:40:01.479-08:002014-02-25T12:40:01.479-08:00Yes, you can SIMD/GPU-decode e.g. AC coefficient m...Yes, you can SIMD/GPU-decode e.g. AC coefficient magnitudes. But you wouldn't want to use a binary coder there. Use rANS/tANS to code log(|x|) as one symbol then send sign+suffix bits raw.<br /><br />One of the reasons to use binary coders over multi-symbol coders (aside from the speed) is that adaptive modeling of bits is easy and simple. But with ANS you lose that simplicity, on the encoder side anyway (with static models that's less of a concern because you have a two-pass split between gathering stats and encoding anyway).<br /><br />Also, adaptive binary coders do a *lot* of model updates that need to be visible to everyone using that model afterwards, which forces sequence points and kills parallelism.<br /><br />Finally, note that the symbols/models in a SIMD decoder needn't be completely independent;<br />there are *certain* kinds of dependencies between symbols that can be handled efficiently in a SIMD model.<br /><br />For example, you can switch models in the middle of a SIMD "run", if all lanes can determine the location of model switch points (and thus which model they should be using) with a parallel prefix operation that only depends on the initial state vector. So you can do things like "use model 0 until you've seen three consecutive zero symbols, then switch to model 1".<br /><br />But the sweet spot for that is still switching between a handful of multisymbol contexts, not between tons of binary contexts (since the switching step is comparatively expensive).Fabian 'ryg' Giesenhttps://www.blogger.com/profile/13685994980026854143noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-43254089295871791802014-02-25T11:49:24.718-08:002014-02-25T11:49:24.718-08:00Charles, you are right - it can speedup (many time...Charles, you are right - it can speedup (many times) encoding as encoder knows all probabilities, but there is a big issue with decoding - there cannot be any dependencies between probabilities of such e.g. 8 SIMD simultaneously decoded symbols.<br />The question is when such independence appears in real applications, especially video compression, where such speedup would translate into longer battery life of mobile devices.<br />As you have mentioned, there are used independent regions like macroblocks (or slices in h.265), what is indeed difficult to use here ... but there are also DCT coefficients, which, after subtracting predictions, are encoded independently(?) - couldn't we decode 8 of them simultaneously this way?<br />Such decoder would rather need a pass mode for some positions.Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-12576668818705700282014-02-25T09:38:24.057-08:002014-02-25T09:38:24.057-08:00Jarek, I just don't see a usage where that'...Jarek, I just don't see a usage where that's compelling. ABS is not any faster than binary arithmetic coding, which is already fast. Yes you can SIMD it, which would make it very fast indeed, but you have to gather the probability model for each bit - AND those models can't be dependent on those bits. eg. you can't use it for coding the 8 bits in a byte, since each bit decode would be dependent on the previous. It could only be used for doing like many macroblock mode bits at the same time in a video coder. The usability of it is just vanishingly rare and not even a big speedup when it is usable.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-54524204719964622212014-02-25T05:53:28.195-08:002014-02-25T05:53:28.195-08:00Hi Charles,
As in many applications there is neede...Hi Charles,<br />As in many applications there is needed bitwise entropy coder - with probabilities varying from bit to bit, I wonder how ABS compares to arithmetic coding? <br />uABS or tABS should be accurate enough even for 8bit * 16 SIMD.<br />Thanks,<br />Jarek Jarek Dudahttps://www.blogger.com/profile/11358050996148333936noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-18843406852408974432014-02-19T15:47:24.471-08:002014-02-19T15:47:24.471-08:00@Ethatron - I don't see how to make that work ...@Ethatron - I don't see how to make that work here. Maybe someone can figure it out.<br />cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-504985230079830982014-02-19T12:01:40.620-08:002014-02-19T12:01:40.620-08:00"The fastest way I could find to do the bit I..."The fastest way I could find to do the bit IO was "big endian style""<br /><br />Can't you just invert the bit-sequence and the tables? With huffman coding you can convert the prefix-code into a postfix-code this way and do little-endian style decoding (from least to most significant bit and byte).Ethatronhttps://www.blogger.com/profile/10014459901168712472noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-35411563927953221782014-02-18T23:54:36.612-08:002014-02-18T23:54:36.612-08:00The paper went live a bit more than a day ago: htt...The paper went live a bit more than a day ago: http://arxiv.org/abs/1402.3392<br /><br />I'll also go over this in my blog but the key ideas are all in there.<br /><br />I thought ANS was kinda neat before, but the ability to do SIMD/GPU encoding/decoding (without the resulting bit stream being dependent on the SIMD width!) is fairly unique, and not something I ever expected to see in an entropy coder. (Note that the general interleaving construction given in the paper will not, generally, result in a bitstream that's SIMD-width independent; one-renormalize ANS is special in that regard.)Fabian 'ryg' Giesenhttps://www.blogger.com/profile/13685994980026854143noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-48500979158214379072014-02-18T23:12:11.917-08:002014-02-18T23:12:11.917-08:00Interestingly :) the current "beta" bran...Interestingly :) the current "beta" branch of FSE feature some equivalent speed trick (the "ILP" mode store fractional bits into multiple accumulators), with approximately equivalent impact on speed. It's the same concept.<br /><br />The rest of the code though is different, since your version seems more accurate during "preparation steps" (while fse favor speed for the time being).<br /><br />Beta is expected to be pushed to master later this week.Cyanhttps://www.blogger.com/profile/02905407922640810117noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-15721977708230160312014-02-18T19:37:40.655-08:002014-02-18T19:37:40.655-08:00Yeah, fse and "huf" ; revising to make t...Yeah, fse and "huf" ; revising to make that a little clearer...<br />cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-50286539125211216522014-02-18T19:06:36.069-08:002014-02-18T19:06:36.069-08:00"The tANS file sizes here are much smaller th..."The tANS file sizes here are much smaller than fse or rANS. If I'm fair and transmit the counts they should go up a bit. The rest of the win comes from the improvements to normalizing counts and making the sort order."<br />Well, your comparison vs. rANS in your code is fair since it uses the same normalized counts (and your rANS doesn't store them either); it's really just FSE that's penalized.Fabian 'ryg' Giesenhttps://www.blogger.com/profile/13685994980026854143noreply@blogger.com