tag:blogger.com,1999:blog-5246987755651065286.post6471703298543639004..comments2024-02-22T16:15:42.388-08:00Comments on cbloom rants: 09-14-10 - Challenges in Data Compression 2.5 - More on Sparsitycbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-5246987755651065286.post-8882622011365784942010-09-17T10:11:23.399-07:002010-09-17T10:11:23.399-07:00I believe LZ-NTFS is the infamous "Stac"...I believe LZ-NTFS is the infamous "Stac" algorithm which they sued MS for (and won) despite the fact that it was almost identical to well known works LZSS and LZRW1.<br /><br />At the time MS was much hated, so lots of people cheered that David was punching Goliath, but in reality it was a travesty of patent abuse.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-28543103934657231292010-09-17T02:49:57.688-07:002010-09-17T02:49:57.688-07:00I didn't think my crappy stb.h compressor was ...I didn't think my crappy stb.h compressor was actually interesting, but I saw that Matt Mahoney's "DCE" book mentioned the NTFS compression file format and some numbers for it, and they seemed suspicious so I went and tested mine against it and found mine was actually better, despite having been written mainly to keep the source code small.<br /><br />So I <a href="http://nothings.org/stb/stb_compress.txt" rel="nofollow">wrote mine up just for thoroughness</a>, not that I think it really matters.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-73738148677082541902010-09-15T23:57:41.665-07:002010-09-15T23:57:41.665-07:00M1 optimizes most of model parameters: context mas...M1 optimizes most of model parameters: context masks, weights, non-linear quantization buckets, including FSM construction parameters. Many recent context mixing compressors already tweaked by automated process (including all Shelwien's, Toffer's and my compressors).<br /><br />BTW, Shelwien has an idea to compress bit history in a lossy way to use as useful context clustering.Anonymoushttps://www.blogger.com/profile/01494614434875605048noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-33203647557266913562010-09-15T18:18:12.578-07:002010-09-15T18:18:12.578-07:00Yep, absolutely. My point is that that lossy quan...Yep, absolutely. My point is that that lossy quantization really should be adaptive, and we should be using some system that finds it for us based on what's optimal for the data. That doesn't exist yet, so we wind up usually hand tweaking it (or sometimes training it offline, I believe M1 does this).<br /><br />There's not much of a theoretical formalism behind all this context merging and formation of the lossy FSM's, which means it's hard to tell how close they are to optimal and what's being left on the table.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-32208060313239075992010-09-15T17:58:32.785-07:002010-09-15T17:58:32.785-07:00Increasing context length, as you mentioned someti...Increasing context length, as you mentioned sometimes hurt compresssion. But, moreover affects both speed and memory usage. If we try to see in a "lossy" way for context determination, we would definitely improve the compression. Because, "similar" statistics (context clustering) tend to do better prediction, thus improve compression, and "lossy context compression" enables it. That's one of the reason why we use FSM in our compressors (like BIT, M1, CCM, PAQ etc).Anonymoushttps://www.blogger.com/profile/01494614434875605048noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-9959579434594165122010-09-14T17:22:20.507-07:002010-09-14T17:22:20.507-07:00> In these sensitive parts of the coder, you
&g...> In these sensitive parts of the coder, you<br />> obviously want to use as much context as<br />> possible, but if you use too much your<br />> statistics become too sparse and you start<br />> making big coding mistakes.<br /><br />In other words, the context quantization function<br />is a lossy compression function, but unlike<br />lossy media coders, we have a well defined<br />quality metric here.<br /><br />> For example something that hasn't been explored<br />> very much in general in text compression is<br />> severely assymetric coders.<br /><br />I guess you just missed it completely.<br />In text compression, there're preprocessors, like<br />http://www.ii.uni.wroc.pl/~inikep/<br />And in special cases, like Hutter Prize, people<br />put a lot of work into selection and arrangement<br />of words in the dictionary.<br /><br />Also there're quite a few projects with<br />parameter optimization pass.<br />For example, see<br />http://sites.google.com/site/toffer86/m1-project<br />There's a "o" processing mode which builds a<br />model profile for given data samples (which<br />includes context masks and counter update<br />constants and the like).<br /><br />There're also many other projects with a similar<br />approach (eg. epmopt and beeopt), and most of my<br />coders are made like that, just that it makes<br />more sense to use in the development stage than<br />in public utilities.Shelwienhttps://www.blogger.com/profile/15845762957306674934noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-67317053873965182952010-09-14T17:08:00.751-07:002010-09-14T17:08:00.751-07:00> In these sensitive parts of the coder, you
&g...> In these sensitive parts of the coder, you<br />> obviously want to use as much context as<br />> possible, but if you use too much your<br />> statistics become too sparse and you start<br />> making big coding mistakes.<br /><br />In other words, the context quantization function<br />is a lossy compression function, but unlike<br />lossy media coders, we have a well defined<br />quality metric here.<br /><br />> For example something that hasn't been explored<br />> very much in general in text compression is<br />> severely assymetric coders.<br /><br />I guess you just missed it completely.<br />In text compression, there're preprocessors, like<br />http://www.ii.uni.wroc.pl/~inikep/<br />And in special cases, like Hutter Prize, people<br />put a lot of work into selection and arrangement<br />of words in the dictionary.<br /><br />Also there're quite a few projects with<br />parameter optimization pass.<br />For example, see<br />http://sites.google.com/site/toffer86/m1-project<br />There's a "o" processing mode which builds a<br />model profile for given data samples (which<br />includes context masks and counter update<br />constants and the like).<br /><br />There're also many other projects with a similar<br />approach (eg. epmopt and beeopt), and most of my<br />coders are made like that, just that it makes<br />more sense to use in the development stage than<br />in public utilities.Shelwienhttps://www.blogger.com/profile/15845762957306674934noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-38570804532204983562010-09-14T16:49:56.258-07:002010-09-14T16:49:56.258-07:00Hey, you just gave me an idea to improve DMC. You ...Hey, you just gave me an idea to improve DMC. You build a state machine like regular DMC, but instead of predicting from the ratio of counts, feed the counts to a secondary model.Matt Mahoneyhttps://www.blogger.com/profile/13946883164366534088noreply@blogger.com