cbloom rants: 01-09-12 - LZ Optimal Parse with A Star Part 5

Wrapping up the series with lots of numbers.

Previous parts :

cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1

I'm delirious with fever right now so I might write something inane, but I'm so bored of lying in bed so I'm trying to wrap this up. Anyhoo..

So first of all we have to talk a bit about what we're comparing the A Star parse to.

"Normal" is a complex forward lazy parse using heuristics to guide parsing, as described in Part 1. "Fast" is like Normal but uses simpler heuristics and simpler match finder.

"Chain" is more interesting. Chain is a complex "lazy"-type parser which considers N decisions ahead (eg. Chain 4 considers 4 decisions ahead). It works thusly :

Chain Parse : first do a full parse of the file using some other parser; this provides with a baseline cost to end from each point. Now do a forward parse. At each position, consider all match and literal options. For each option, step ahead by that option and consider all the options at the next position. Add up the cost of each coding step. After N steps (for chain N) add on the cost to end from the first baseline parse. Go back to the original position and finalize the choice with the lowest cost. Basically it's a full graph walk for N steps, then use an estimate of the cost to the end from the final nodes of that sub-graph.

To make Chain parsing viable you have to reduce the number of match options to a maximum of 8 or so. Still Chain N has a complexity of 8^N , so it becomes slow very quickly as N grows.

Chain forward parse is significantly better than LZSS style backwards optimal parse for these LZ coders that have important adaptive state. The baseline parse I use for Chain actually is a backwards LZSS optimal parse, so you can see how it does by looking at the "Chain 0" results.

First overall results. Chain 6 is the most amount of steps I can run in reasonable time, and AStar 2048 means the quantum length for dividing up the file for AStar was 2048.

	raw	Fast	Normal	Chain 6	AStar 2048
lzt00	16914	5179	5016	4923	4920
lzt01	200000	198313	198321	198312	198312
lzt02	755121	181109	177792	173220	173315
lzt03	3471552	1746443	1713023	1698949	1690655
lzt04	48649	13088	12412	10407	10249
lzt05	927796	368346	367598	355804	354230
lzt06	563160	352827	351051	344721	343173
lzt07	500000	226533	215996	209133	208566
lzt08	355400	250503	249987	230541	230220
lzt09	786488	302927	287479	268544	265525
lzt10	154624	11508	10958	10307	10291
lzt11	58524	20553	19628	19139	19087
lzt12	164423	29001	26488	23966	23622
lzt13	1041576	935484	931415	924510	922745
lzt14	102400	47690	47298	46417	46350
lzt15	34664	10832	10688	10269	10260
lzt16	21504	10110	10055	9952	9927
lzt17	53161	19526	18514	17971	17970
lzt18	102400	64280	63251	59772	59635
lzt19	768771	322951	288872	269132	269162
lzt20	1179702	888881	872315	856369	855588
lzt21	679936	91677	88011	83529	83184
lzt22	400000	287715	284378	279674	279459
lzt23	1048576	807253	804048	798369	798334
lzt24	3471552	1418076	1411387	1399197	1388105
lzt25	1029744	113085	107882	97320	100175
lzt26	262144	212445	210836	207701	207552
lzt27	857241	237253	235137	222023	220837
lzt28	1591760	332660	308940	260547	252808
lzt29	3953035	1193914	1180823	1147160	1135603
lzt30	100000	100001	100001	100001	100001
		10800163	10609600	10337879	10289860

Now number of Chain steps for the chain parser : (that's O0 - O6)

	U	N	O0	O1	O2	O3	O4	O5	O6
lzt00	16914	5016	5024	4922	4922	4922	4922	4923	4923
lzt01	200000	198321	198321	198312	198312	198312	198312	198312	198312
lzt02	755121	177792	177877	175905	174835	174073	173759	173509	173220
lzt03	3471552	1713023	1712337	1704417	1703873	1702651	1701635	1700282	1698949
lzt04	48649	12412	11315	10516	10481	10457	10427	10416	10407
lzt05	927796	367598	368729	365743	364332	360630	356403	355968	355804
lzt06	563160	351051	350995	346856	345500	344778	344739	344702	344721
lzt07	500000	215996	215644	211336	209481	209259	209244	209138	209133
lzt08	355400	249987	249372	239375	237320	231554	231435	233324	230541
lzt09	786488	287479	284875	280683	275679	270721	269754	269107	268544
lzt10	154624	10958	10792	10367	10335	10330	10311	10301	10307
lzt11	58524	19628	19604	19247	19175	19225	19162	19159	19139
lzt12	164423	26488	25644	24217	24177	24094	24108	24011	23966
lzt13	1041576	931415	931415	929713	927841	926162	924515	924513	924510
lzt14	102400	47298	47300	46518	46483	46461	46437	46429	46417
lzt15	34664	10688	10656	10317	10301	10275	10278	10267	10269
lzt16	21504	10055	10053	9960	9966	9959	9952	9948	9952
lzt17	53161	18514	18549	17971	17970	17974	17971	17973	17971
lzt18	102400	63251	63248	59863	59850	59799	59790	59764	59772
lzt19	768771	288872	281959	277661	273316	269157	269141	269133	269132
lzt20	1179702	872315	872022	868088	865376	863236	859727	856408	856369
lzt21	679936	88011	88068	84848	83851	83733	83674	83599	83529
lzt22	400000	284378	284297	281902	279711	279685	279689	279696	279674
lzt23	1048576	804048	804064	802742	801324	799891	798367	798368	798369
lzt24	3471552	1411387	1410226	1404736	1403314	1402345	1401064	1400193	1399197
lzt25	1029744	107882	107414	99839	100154	99710	98552	98132	97320
lzt26	262144	210836	210855	207775	207763	207738	207725	207706	207701
lzt27	857241	235137	236568	233524	228073	223123	222884	222540	222023
lzt28	1591760	308940	295072	286018	276905	273520	269611	264726	260547
lzt29	3953035	1180823	1183407	1180733	1177854	1170944	1162310	1152482	1147160
lzt30	100000	100001	100001	100001	100001	100001	100001	100001	100001
		10609600	10585703	10494105	10448475	10404719	10375899	10355030	10337879

Some notes : up to 6 (the most I can run) more chain steps is better - for the sum, but not for all files. In some cases, more steps is worse, which should never really happen, but it's an issue of approximate optimal parsers I'll discuss later. (*)

On most files, going past 4 chain steps helps very little, but on some files it seems to monotonically keep improving. For example lzt29 stands out. Those files are ones that get helped the most by AStar.

Now the effect on quantum size on AStar. In all cases I only output codes from the first 3/4 of each quantum.

	raw	256	512	1024	2048	4096	8192	16384
lzt00	16914	4923	4923	4920	4920	4920	4921	4921
lzt01	200000	198312	198312	198312	198312	198312	198314	198314
lzt02	755121	175242	173355	173368	173315	173331	173454	173479
lzt03	3471552	1699795	1691530	1690878	1690655	1690594	1690603	1690617
lzt04	48649	10243	10245	10234	10249	10248	10241	10241
lzt05	927796	357166	354629	354235	354230	354233	354242	354257
lzt06	563160	346663	343202	343139	343173	343194	343263	343238
lzt07	500000	209934	208669	208584	208566	208556	208553	208562
lzt08	355400	228389	229447	229975	230220	230300	230374	230408
lzt09	786488	266571	265564	265487	265525	265559	265542	265527
lzt10	154624	10701	10468	10330	10291	10273	10273	10272
lzt11	58524	19139	19123	19096	19087	19085	19084	19084
lzt12	164423	23712	23654	23616	23622	23628	23630	23627
lzt13	1041576	923258	922853	922747	922745	922753	922751	922753
lzt14	102400	46397	46364	46351	46350	46350	46348	46350
lzt15	34664	10376	10272	10260	10260	10251	10258	10254
lzt16	21504	9944	9931	9926	9927	9927	9927	9927
lzt17	53161	17937	17970	17968	17970	17969	17969	17969
lzt18	102400	59703	59613	59632	59635	59637	59640	59640
lzt19	768771	269213	269151	269128	269162	269193	269218	269229
lzt20	1179702	855992	855580	855478	855588	855671	855685	855707
lzt21	679936	83882	83291	83215	83184	83172	83171	83169
lzt22	400000	279803	279368	279414	279459	279605	279630	279647
lzt23	1048576	798325	798319	798321	798334	798354	798357	798358
lzt24	3471552	1393742	1388636	1388031	1388105	1388317	1388628	1388671
lzt25	1029744	97910	101246	101302	100175	100484	100272	100149
lzt26	262144	207779	207563	207541	207552	207559	207577	207576
lzt27	857241	222229	220832	220770	220837	220773	220756	220757
lzt28	1591760	256404	253257	252933	252808	252737	252735	252699
lzt29	3953035	1136193	1135442	1135543	1135603	1135710	1135689	1135713
lzt30	100000	100001	100001	100001	100001	100001	100001	100001
		10319878	10292810	10290735	10289860	10290696	10291106	10291116

The best sum is at 2048, but 1024 is a lot faster and almost the same.

Again, as the previous note at (*), we should really see just improvement with larger quantum sizes, but past 2048 we start seeing it go backwards in some cases.

Lastly a look at where the AStar parse is spending its time. This is for a 1024 quantum.

The x axis here is the log2 of the number of nodes visited to parse a quantum. So, log2=20 means a million nodes were needed to parse that quantum. So for speed purposes a cell one to the right is twice as bad. The values in the cells are the percentage of quanta in the file that needed that number of nodes.

(note : log2=20 means one million nodes were visited to output 768 bytes worth of codes, so it's quite a lot)

log2	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
lzt00	0	0	0	18.18	59.09	18.18	4.55
lzt01	3.75	0.75	41.2	34.08	13.86	5.62	0.75
lzt02	1.81	1.36	25.37	34.09	13.59	13.02	8.15	1.93	0.23	0.23
lzt03	1.46	1.18	17.51	18.46	14.16	13.17	6.95	4.81	3.66	4.81	9.54	2.79	0.96	0.11	0.03
lzt04	1.67	0	0	1.67	0	21.67	5	18.33	3.33	10	16.67	16.67	5
lzt05	0.59	0.25	4.41	10.77	9.92	18.32	13.23	10.09	9.67	6.02	12.47	3.22	0.51	0.08	0.08
lzt06	0.8	0.93	6.81	23.77	14.69	16.96	21.09	11.48	2.67	0.8
lzt07	0.46	0.46	8.66	7.88	6.8	15.3	17	14.53	5.56	9.58	8.19	4.79	0.31	0.31
lzt08	0	0	0	0	0	0	1.68	1.68	1.47	27.67	53.88	11.95	1.68
lzt09	0.29	0.48	0.76	0.86	0.95	3.9	28.07	47.76	16.18	0.38
lzt10	0	0.56	10.17	12.99	9.04	9.04	10.17	41.24	4.52	0.56	1.13
lzt11	0	0	7.89	10.53	14.47	17.11	6.58	9.21	21.05	10.53	2.63
lzt12	0	0	0	0	0	4.27	28.91	59.24	7.58
lzt13	0	0	0.07	0.14	0.57	1.72	3.36	5.72	39.24	42.03	7.08	0.07
lzt14	0	0.83	0	2.5	8.33	34.17	42.5	5	2.5	1.67	0.83	0	0.83
lzt15	0	2.27	4.55	15.91	13.64	15.91	13.64	6.82	11.36	11.36	4.55
lzt16	0	0	3.57	0	14.29	42.86	32.14	3.57
lzt17	1.39	1.39	2.78	1.39	4.17	75	13.89
lzt18	0	0	0	0	0	0.72	0	2.17	2.9	11.59	56.52	23.19	2.9
lzt19	0	0	1.26	2.81	0.39	7.56	87.11	0.87
lzt20	0	0.13	2.08	2.02	4.29	67.07	24.29	0.06
lzt21	0.2	0.78	6.07	6.07	5.28	19.77	35.62	22.9	1.96	0.2	0.2
lzt22	0	0.56	2.98	5.59	26.82	62.94	1.12
lzt23	0	0	0	0	0	0.07	1.35	2.63	0.92	70.88	23.15	0.14	0.36	0.5
lzt24	0.44	0.61	4.14	37.41	7.62	12.68	12.72	8.52	6.11	5.19	3.11	0.94	0.31	0.04
lzt25	0.22	0.43	1.52	1.74	2.68	6.44	15.69	27.19	30.22	13.09	0.72
lzt26	0	0	0	1.15	3.15	2.58	77.65	14.61	0.57
lzt27	0.61	0.1	7.55	6.53	1.22	4.39	5	4.08	7.76	44.8	16.43	1.43
lzt28	0.25	0.1	3.71	0.94	0.74	6.77	15.56	10.08	10.97	14.82	18.68	11.41	4.05	1.24	0.1
lzt29	0.3	0.73	1.61	22.37	5.28	6.16	26.34	2.97	0.48	0.85	19.63	12.47	0.73
lzt30	3.7	0.74	47.41	34.07	12.59	0.74

Well there's no easy answer, the character of the files are all very different.

In many cases the A Star parse is reasonably fast (comparable to Chain 3 or something). But in some cases it's quite slow, eg. lzt04, lzt08, lzt28.

Okay, I think that's all the data. We have one point to discuss :

(*) = in all these type of endeavors, we see these anomolies where as we give the optimizer more space to make decisions, it gets better for a while, then starts getting worse. I saw the same thing, but more extreme, with video coding.

Basically what causes this is that you aren't optimizing for your real final goal. If you were optimizing for the total output size, then giving it more freedom should never hurt. But you aren't. With Chain N or with A Star in both cases you are optimizing just some local portion, and it turns out that if you let it make really aggressive decisions trying to optimize the local bit, that can hurt overall.

A similar issue happens with an Huffman optimal parse, becuase you are using the huffman code lengths from the previous parse to do the current parse. That's fine as long as your parse is reasonably similar, but if you let the optimal parser really go nuts, it can start to get pretty far off those statistics, which makes it wrong, so that more optimizing actually gives worse results.

With video coding the main issue I had was that the optimization was generally local (eg. just on one macro block at a time or some such), but it of course affects the future as a source for motion compensation (and in other ways), and it turns out if you do really aggressive optimization on the local decisions, that can wind up hurting overall.

A similar thing can happen in image and video coding if you let optimization proceed very aggressively, because you have to use some simple analytic criterion (such as RMSE - though even if you use a fancier metric the same problems arise). The issue is that the coder can wind up finding strange states that are a good trade-off for RMSE, but wind up looking just horrible visually.

Obviously the correct solution is to optimize with the true final goal in mind. But that's not always possible, either computationally, or because the final goal is subjective.

Generally the solution is to moderate the optimization in some way. You have some heuristic idea of what kind of solutions will provide good globally optimal solutions. (for example, in image/video coding, you might require that the bit rate allocation not create too big of a difference between adjacent blocks). So you sort of want to guide your optimization to start around where you suspect the answer to be, and then you tune it so that you don't allow it to be too aggressive in making whatever decision it thinks is locally optimal.

5 comments:

rygJanuary 11, 2012 at 8:20 PM
What's the relative speed of the Chain* variants compared to A*, and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)?
cbloomJanuary 12, 2012 at 2:28 PM
"What's the relative speed of the Chain* variants compared to A*"

A* is a lot more variable. It can be as fast as Chain3 or so, but sometimes as bad as Chain6 or so.

"and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)? "

First of all, something I maybe didn't point out is that both for ChainN and A* to be reasonably fast you have to pre-find all matches. So I decide in advance that I will consider at most 8 matches per position, and I go find all those matches and save them in an array. So you never redo match finding as you parse.

If it's 8 matches, then Chain N+1 is 8X slower than Chain N.

I should make clear that even if Chain was fast enough, it's not super awesome because it uses a pretty rough approximation for the cost to end after the N steps which doesn't include the current state at all. A* doesn't have this drawback.

At the moment Chain is more usable in production because I haven't found a good way to control the occasional very long time that A* takes.
CyanJanuary 15, 2012 at 4:00 AM
So it basically means that current LZMA could get even better compression performance by using one of your methodologies ?
cbloomJanuary 15, 2012 at 11:00 AM
Yes!
cbloomJanuary 15, 2012 at 11:01 AM
Though only something like 1-3%

1/09/2012

01-09-12 - LZ Optimal Parse with A Star Part 5

5 comments: