Previous parts :
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
I'm delirious with fever right now so I might write something inane, but I'm so bored of lying in bed so I'm trying to wrap this up. Anyhoo..
So first of all we have to talk a bit about what we're comparing the A Star parse to.
"Normal" is a complex forward lazy parse using heuristics to guide parsing, as described in Part 1. "Fast" is like Normal but uses simpler heuristics and simpler match finder.
"Chain" is more interesting. Chain is a complex "lazy"-type parser which considers N decisions ahead (eg. Chain 4 considers 4 decisions ahead). It works thusly :
Chain Parse : first do a full parse of the file using some other parser; this provides with a baseline cost to end from each point. Now do a forward parse. At each position, consider all match and literal options. For each option, step ahead by that option and consider all the options at the next position. Add up the cost of each coding step. After N steps (for chain N) add on the cost to end from the first baseline parse. Go back to the original position and finalize the choice with the lowest cost. Basically it's a full graph walk for N steps, then use an estimate of the cost to the end from the final nodes of that sub-graph.
To make Chain parsing viable you have to reduce the number of match options to a maximum of 8 or so. Still Chain N has a complexity of 8^N , so it becomes slow very quickly as N grows.
Chain forward parse is significantly better than LZSS style backwards optimal parse for these LZ coders that have important adaptive state. The baseline parse I use for Chain actually is a backwards LZSS optimal parse, so you can see how it does by looking at the "Chain 0" results.
First overall results. Chain 6 is the most amount of steps I can run in reasonable time, and AStar 2048 means the quantum length for dividing up the file for AStar was 2048.
raw | Fast | Normal | Chain 6 | AStar 2048 | |
lzt00 | 16914 | 5179 | 5016 | 4923 | 4920 |
lzt01 | 200000 | 198313 | 198321 | 198312 | 198312 |
lzt02 | 755121 | 181109 | 177792 | 173220 | 173315 |
lzt03 | 3471552 | 1746443 | 1713023 | 1698949 | 1690655 |
lzt04 | 48649 | 13088 | 12412 | 10407 | 10249 |
lzt05 | 927796 | 368346 | 367598 | 355804 | 354230 |
lzt06 | 563160 | 352827 | 351051 | 344721 | 343173 |
lzt07 | 500000 | 226533 | 215996 | 209133 | 208566 |
lzt08 | 355400 | 250503 | 249987 | 230541 | 230220 |
lzt09 | 786488 | 302927 | 287479 | 268544 | 265525 |
lzt10 | 154624 | 11508 | 10958 | 10307 | 10291 |
lzt11 | 58524 | 20553 | 19628 | 19139 | 19087 |
lzt12 | 164423 | 29001 | 26488 | 23966 | 23622 |
lzt13 | 1041576 | 935484 | 931415 | 924510 | 922745 |
lzt14 | 102400 | 47690 | 47298 | 46417 | 46350 |
lzt15 | 34664 | 10832 | 10688 | 10269 | 10260 |
lzt16 | 21504 | 10110 | 10055 | 9952 | 9927 |
lzt17 | 53161 | 19526 | 18514 | 17971 | 17970 |
lzt18 | 102400 | 64280 | 63251 | 59772 | 59635 |
lzt19 | 768771 | 322951 | 288872 | 269132 | 269162 |
lzt20 | 1179702 | 888881 | 872315 | 856369 | 855588 |
lzt21 | 679936 | 91677 | 88011 | 83529 | 83184 |
lzt22 | 400000 | 287715 | 284378 | 279674 | 279459 |
lzt23 | 1048576 | 807253 | 804048 | 798369 | 798334 |
lzt24 | 3471552 | 1418076 | 1411387 | 1399197 | 1388105 |
lzt25 | 1029744 | 113085 | 107882 | 97320 | 100175 |
lzt26 | 262144 | 212445 | 210836 | 207701 | 207552 |
lzt27 | 857241 | 237253 | 235137 | 222023 | 220837 |
lzt28 | 1591760 | 332660 | 308940 | 260547 | 252808 |
lzt29 | 3953035 | 1193914 | 1180823 | 1147160 | 1135603 |
lzt30 | 100000 | 100001 | 100001 | 100001 | 100001 |
10800163 | 10609600 | 10337879 | 10289860 |
Now number of Chain steps for the chain parser : (that's O0 - O6)
U | N | O0 | O1 | O2 | O3 | O4 | O5 | O6 | |
lzt00 | 16914 | 5016 | 5024 | 4922 | 4922 | 4922 | 4922 | 4923 | 4923 |
lzt01 | 200000 | 198321 | 198321 | 198312 | 198312 | 198312 | 198312 | 198312 | 198312 |
lzt02 | 755121 | 177792 | 177877 | 175905 | 174835 | 174073 | 173759 | 173509 | 173220 |
lzt03 | 3471552 | 1713023 | 1712337 | 1704417 | 1703873 | 1702651 | 1701635 | 1700282 | 1698949 |
lzt04 | 48649 | 12412 | 11315 | 10516 | 10481 | 10457 | 10427 | 10416 | 10407 |
lzt05 | 927796 | 367598 | 368729 | 365743 | 364332 | 360630 | 356403 | 355968 | 355804 |
lzt06 | 563160 | 351051 | 350995 | 346856 | 345500 | 344778 | 344739 | 344702 | 344721 |
lzt07 | 500000 | 215996 | 215644 | 211336 | 209481 | 209259 | 209244 | 209138 | 209133 |
lzt08 | 355400 | 249987 | 249372 | 239375 | 237320 | 231554 | 231435 | 233324 | 230541 |
lzt09 | 786488 | 287479 | 284875 | 280683 | 275679 | 270721 | 269754 | 269107 | 268544 |
lzt10 | 154624 | 10958 | 10792 | 10367 | 10335 | 10330 | 10311 | 10301 | 10307 |
lzt11 | 58524 | 19628 | 19604 | 19247 | 19175 | 19225 | 19162 | 19159 | 19139 |
lzt12 | 164423 | 26488 | 25644 | 24217 | 24177 | 24094 | 24108 | 24011 | 23966 |
lzt13 | 1041576 | 931415 | 931415 | 929713 | 927841 | 926162 | 924515 | 924513 | 924510 |
lzt14 | 102400 | 47298 | 47300 | 46518 | 46483 | 46461 | 46437 | 46429 | 46417 |
lzt15 | 34664 | 10688 | 10656 | 10317 | 10301 | 10275 | 10278 | 10267 | 10269 |
lzt16 | 21504 | 10055 | 10053 | 9960 | 9966 | 9959 | 9952 | 9948 | 9952 |
lzt17 | 53161 | 18514 | 18549 | 17971 | 17970 | 17974 | 17971 | 17973 | 17971 |
lzt18 | 102400 | 63251 | 63248 | 59863 | 59850 | 59799 | 59790 | 59764 | 59772 |
lzt19 | 768771 | 288872 | 281959 | 277661 | 273316 | 269157 | 269141 | 269133 | 269132 |
lzt20 | 1179702 | 872315 | 872022 | 868088 | 865376 | 863236 | 859727 | 856408 | 856369 |
lzt21 | 679936 | 88011 | 88068 | 84848 | 83851 | 83733 | 83674 | 83599 | 83529 |
lzt22 | 400000 | 284378 | 284297 | 281902 | 279711 | 279685 | 279689 | 279696 | 279674 |
lzt23 | 1048576 | 804048 | 804064 | 802742 | 801324 | 799891 | 798367 | 798368 | 798369 |
lzt24 | 3471552 | 1411387 | 1410226 | 1404736 | 1403314 | 1402345 | 1401064 | 1400193 | 1399197 |
lzt25 | 1029744 | 107882 | 107414 | 99839 | 100154 | 99710 | 98552 | 98132 | 97320 |
lzt26 | 262144 | 210836 | 210855 | 207775 | 207763 | 207738 | 207725 | 207706 | 207701 |
lzt27 | 857241 | 235137 | 236568 | 233524 | 228073 | 223123 | 222884 | 222540 | 222023 |
lzt28 | 1591760 | 308940 | 295072 | 286018 | 276905 | 273520 | 269611 | 264726 | 260547 |
lzt29 | 3953035 | 1180823 | 1183407 | 1180733 | 1177854 | 1170944 | 1162310 | 1152482 | 1147160 |
lzt30 | 100000 | 100001 | 100001 | 100001 | 100001 | 100001 | 100001 | 100001 | 100001 |
10609600 | 10585703 | 10494105 | 10448475 | 10404719 | 10375899 | 10355030 | 10337879 |
Some notes : up to 6 (the most I can run) more chain steps is better - for the sum, but not for all files. In some cases, more steps is worse, which should never really happen, but it's an issue of approximate optimal parsers I'll discuss later. (*)
On most files, going past 4 chain steps helps very little, but on some files it seems to monotonically keep improving. For example lzt29 stands out. Those files are ones that get helped the most by AStar.
Now the effect on quantum size on AStar. In all cases I only output codes from the first 3/4 of each quantum.
raw | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | |
lzt00 | 16914 | 4923 | 4923 | 4920 | 4920 | 4920 | 4921 | 4921 |
lzt01 | 200000 | 198312 | 198312 | 198312 | 198312 | 198312 | 198314 | 198314 |
lzt02 | 755121 | 175242 | 173355 | 173368 | 173315 | 173331 | 173454 | 173479 |
lzt03 | 3471552 | 1699795 | 1691530 | 1690878 | 1690655 | 1690594 | 1690603 | 1690617 |
lzt04 | 48649 | 10243 | 10245 | 10234 | 10249 | 10248 | 10241 | 10241 |
lzt05 | 927796 | 357166 | 354629 | 354235 | 354230 | 354233 | 354242 | 354257 |
lzt06 | 563160 | 346663 | 343202 | 343139 | 343173 | 343194 | 343263 | 343238 |
lzt07 | 500000 | 209934 | 208669 | 208584 | 208566 | 208556 | 208553 | 208562 |
lzt08 | 355400 | 228389 | 229447 | 229975 | 230220 | 230300 | 230374 | 230408 |
lzt09 | 786488 | 266571 | 265564 | 265487 | 265525 | 265559 | 265542 | 265527 |
lzt10 | 154624 | 10701 | 10468 | 10330 | 10291 | 10273 | 10273 | 10272 |
lzt11 | 58524 | 19139 | 19123 | 19096 | 19087 | 19085 | 19084 | 19084 |
lzt12 | 164423 | 23712 | 23654 | 23616 | 23622 | 23628 | 23630 | 23627 |
lzt13 | 1041576 | 923258 | 922853 | 922747 | 922745 | 922753 | 922751 | 922753 |
lzt14 | 102400 | 46397 | 46364 | 46351 | 46350 | 46350 | 46348 | 46350 |
lzt15 | 34664 | 10376 | 10272 | 10260 | 10260 | 10251 | 10258 | 10254 |
lzt16 | 21504 | 9944 | 9931 | 9926 | 9927 | 9927 | 9927 | 9927 |
lzt17 | 53161 | 17937 | 17970 | 17968 | 17970 | 17969 | 17969 | 17969 |
lzt18 | 102400 | 59703 | 59613 | 59632 | 59635 | 59637 | 59640 | 59640 |
lzt19 | 768771 | 269213 | 269151 | 269128 | 269162 | 269193 | 269218 | 269229 |
lzt20 | 1179702 | 855992 | 855580 | 855478 | 855588 | 855671 | 855685 | 855707 |
lzt21 | 679936 | 83882 | 83291 | 83215 | 83184 | 83172 | 83171 | 83169 |
lzt22 | 400000 | 279803 | 279368 | 279414 | 279459 | 279605 | 279630 | 279647 |
lzt23 | 1048576 | 798325 | 798319 | 798321 | 798334 | 798354 | 798357 | 798358 |
lzt24 | 3471552 | 1393742 | 1388636 | 1388031 | 1388105 | 1388317 | 1388628 | 1388671 |
lzt25 | 1029744 | 97910 | 101246 | 101302 | 100175 | 100484 | 100272 | 100149 |
lzt26 | 262144 | 207779 | 207563 | 207541 | 207552 | 207559 | 207577 | 207576 |
lzt27 | 857241 | 222229 | 220832 | 220770 | 220837 | 220773 | 220756 | 220757 |
lzt28 | 1591760 | 256404 | 253257 | 252933 | 252808 | 252737 | 252735 | 252699 |
lzt29 | 3953035 | 1136193 | 1135442 | 1135543 | 1135603 | 1135710 | 1135689 | 1135713 |
lzt30 | 100000 | 100001 | 100001 | 100001 | 100001 | 100001 | 100001 | 100001 |
10319878 | 10292810 | 10290735 | 10289860 | 10290696 | 10291106 | 10291116 |
The best sum is at 2048, but 1024 is a lot faster and almost the same.
Again, as the previous note at (*), we should really see just improvement with larger quantum sizes, but past 2048 we start seeing it go backwards in some cases.
Lastly a look at where the AStar parse is spending its time. This is for a 1024 quantum.
The x axis here is the log2 of the number of nodes visited to parse a quantum. So, log2=20 means a million nodes were needed to parse that quantum. So for speed purposes a cell one to the right is twice as bad. The values in the cells are the percentage of quanta in the file that needed that number of nodes.
(note : log2=20 means one million nodes were visited to output 768 bytes worth of codes, so it's quite a lot)
log2 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |
lzt00 | 0 | 0 | 0 | 18.18 | 59.09 | 18.18 | 4.55 | ||||||||
lzt01 | 3.75 | 0.75 | 41.2 | 34.08 | 13.86 | 5.62 | 0.75 | ||||||||
lzt02 | 1.81 | 1.36 | 25.37 | 34.09 | 13.59 | 13.02 | 8.15 | 1.93 | 0.23 | 0.23 | |||||
lzt03 | 1.46 | 1.18 | 17.51 | 18.46 | 14.16 | 13.17 | 6.95 | 4.81 | 3.66 | 4.81 | 9.54 | 2.79 | 0.96 | 0.11 | 0.03 |
lzt04 | 1.67 | 0 | 0 | 1.67 | 0 | 21.67 | 5 | 18.33 | 3.33 | 10 | 16.67 | 16.67 | 5 | ||
lzt05 | 0.59 | 0.25 | 4.41 | 10.77 | 9.92 | 18.32 | 13.23 | 10.09 | 9.67 | 6.02 | 12.47 | 3.22 | 0.51 | 0.08 | 0.08 |
lzt06 | 0.8 | 0.93 | 6.81 | 23.77 | 14.69 | 16.96 | 21.09 | 11.48 | 2.67 | 0.8 | |||||
lzt07 | 0.46 | 0.46 | 8.66 | 7.88 | 6.8 | 15.3 | 17 | 14.53 | 5.56 | 9.58 | 8.19 | 4.79 | 0.31 | 0.31 | |
lzt08 | 0 | 0 | 0 | 0 | 0 | 0 | 1.68 | 1.68 | 1.47 | 27.67 | 53.88 | 11.95 | 1.68 | ||
lzt09 | 0.29 | 0.48 | 0.76 | 0.86 | 0.95 | 3.9 | 28.07 | 47.76 | 16.18 | 0.38 | |||||
lzt10 | 0 | 0.56 | 10.17 | 12.99 | 9.04 | 9.04 | 10.17 | 41.24 | 4.52 | 0.56 | 1.13 | ||||
lzt11 | 0 | 0 | 7.89 | 10.53 | 14.47 | 17.11 | 6.58 | 9.21 | 21.05 | 10.53 | 2.63 | ||||
lzt12 | 0 | 0 | 0 | 0 | 0 | 4.27 | 28.91 | 59.24 | 7.58 | ||||||
lzt13 | 0 | 0 | 0.07 | 0.14 | 0.57 | 1.72 | 3.36 | 5.72 | 39.24 | 42.03 | 7.08 | 0.07 | |||
lzt14 | 0 | 0.83 | 0 | 2.5 | 8.33 | 34.17 | 42.5 | 5 | 2.5 | 1.67 | 0.83 | 0 | 0.83 | ||
lzt15 | 0 | 2.27 | 4.55 | 15.91 | 13.64 | 15.91 | 13.64 | 6.82 | 11.36 | 11.36 | 4.55 | ||||
lzt16 | 0 | 0 | 3.57 | 0 | 14.29 | 42.86 | 32.14 | 3.57 | |||||||
lzt17 | 1.39 | 1.39 | 2.78 | 1.39 | 4.17 | 75 | 13.89 | ||||||||
lzt18 | 0 | 0 | 0 | 0 | 0 | 0.72 | 0 | 2.17 | 2.9 | 11.59 | 56.52 | 23.19 | 2.9 | ||
lzt19 | 0 | 0 | 1.26 | 2.81 | 0.39 | 7.56 | 87.11 | 0.87 | |||||||
lzt20 | 0 | 0.13 | 2.08 | 2.02 | 4.29 | 67.07 | 24.29 | 0.06 | |||||||
lzt21 | 0.2 | 0.78 | 6.07 | 6.07 | 5.28 | 19.77 | 35.62 | 22.9 | 1.96 | 0.2 | 0.2 | ||||
lzt22 | 0 | 0.56 | 2.98 | 5.59 | 26.82 | 62.94 | 1.12 | ||||||||
lzt23 | 0 | 0 | 0 | 0 | 0 | 0.07 | 1.35 | 2.63 | 0.92 | 70.88 | 23.15 | 0.14 | 0.36 | 0.5 | |
lzt24 | 0.44 | 0.61 | 4.14 | 37.41 | 7.62 | 12.68 | 12.72 | 8.52 | 6.11 | 5.19 | 3.11 | 0.94 | 0.31 | 0.04 | |
lzt25 | 0.22 | 0.43 | 1.52 | 1.74 | 2.68 | 6.44 | 15.69 | 27.19 | 30.22 | 13.09 | 0.72 | ||||
lzt26 | 0 | 0 | 0 | 1.15 | 3.15 | 2.58 | 77.65 | 14.61 | 0.57 | ||||||
lzt27 | 0.61 | 0.1 | 7.55 | 6.53 | 1.22 | 4.39 | 5 | 4.08 | 7.76 | 44.8 | 16.43 | 1.43 | |||
lzt28 | 0.25 | 0.1 | 3.71 | 0.94 | 0.74 | 6.77 | 15.56 | 10.08 | 10.97 | 14.82 | 18.68 | 11.41 | 4.05 | 1.24 | 0.1 |
lzt29 | 0.3 | 0.73 | 1.61 | 22.37 | 5.28 | 6.16 | 26.34 | 2.97 | 0.48 | 0.85 | 19.63 | 12.47 | 0.73 | ||
lzt30 | 3.7 | 0.74 | 47.41 | 34.07 | 12.59 | 0.74 |
Well there's no easy answer, the character of the files are all very different.
In many cases the A Star parse is reasonably fast (comparable to Chain 3 or something). But in some cases it's quite slow, eg. lzt04, lzt08, lzt28.
Okay, I think that's all the data. We have one point to discuss :
(*) = in all these type of endeavors, we see these anomolies where as we give the optimizer more space to make decisions, it gets better for a while, then starts getting worse. I saw the same thing, but more extreme, with video coding.
Basically what causes this is that you aren't optimizing for your real final goal. If you were optimizing for the total output size, then giving it more freedom should never hurt. But you aren't. With Chain N or with A Star in both cases you are optimizing just some local portion, and it turns out that if you let it make really aggressive decisions trying to optimize the local bit, that can hurt overall.
A similar issue happens with an Huffman optimal parse, becuase you are using the huffman code lengths from the previous parse to do the current parse. That's fine as long as your parse is reasonably similar, but if you let the optimal parser really go nuts, it can start to get pretty far off those statistics, which makes it wrong, so that more optimizing actually gives worse results.
With video coding the main issue I had was that the optimization was generally local (eg. just on one macro block at a time or some such), but it of course affects the future as a source for motion compensation (and in other ways), and it turns out if you do really aggressive optimization on the local decisions, that can wind up hurting overall.
A similar thing can happen in image and video coding if you let optimization proceed very aggressively, because you have to use some simple analytic criterion (such as RMSE - though even if you use a fancier metric the same problems arise). The issue is that the coder can wind up finding strange states that are a good trade-off for RMSE, but wind up looking just horrible visually.
Obviously the correct solution is to optimize with the true final goal in mind. But that's not always possible, either computationally, or because the final goal is subjective.
Generally the solution is to moderate the optimization in some way. You have some heuristic idea of what kind of solutions will provide good globally optimal solutions. (for example, in image/video coding, you might require that the bit rate allocation not create too big of a difference between adjacent blocks). So you sort of want to guide your optimization to start around where you suspect the answer to be, and then you tune it so that you don't allow it to be too aggressive in making whatever decision it thinks is locally optimal.
What's the relative speed of the Chain* variants compared to A*, and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)?
ReplyDelete"What's the relative speed of the Chain* variants compared to A*"
ReplyDeleteA* is a lot more variable. It can be as fast as Chain3 or so, but sometimes as bad as Chain6 or so.
"and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)? "
First of all, something I maybe didn't point out is that both for ChainN and A* to be reasonably fast you have to pre-find all matches. So I decide in advance that I will consider at most 8 matches per position, and I go find all those matches and save them in an array. So you never redo match finding as you parse.
If it's 8 matches, then Chain N+1 is 8X slower than Chain N.
I should make clear that even if Chain was fast enough, it's not super awesome because it uses a pretty rough approximation for the cost to end after the N steps which doesn't include the current state at all. A* doesn't have this drawback.
At the moment Chain is more usable in production because I haven't found a good way to control the occasional very long time that A* takes.
So it basically means that current LZMA could get even better compression performance by using one of your methodologies ?
ReplyDeleteYes!
ReplyDeleteThough only something like 1-3%
ReplyDelete