cbloom rants

1/23/2015

01-23-15 - LZA New Optimal Parse

Oodle LZA has a new optimal parse.

It's described at encode.ru and I don't feel like repeating myself today. So go read about it there. Thanks to Bulat Ziganshin for explaining the LZMA-style parse in a way that I could finally understand it.

First see Bulat's earlier post [LZ] Optimal parsing which is a description of Tornado's optimal parse and probably the clearest place to get started.

This style of parse is also described in the paper "Bit-Optimal Lempel-Ziv compression".

Some previous rants on the topic :

cbloom rants 10-10-08 - 7
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 06-21-14 - Suffix Trie Note

I should note that the advantage of this optimal parse over my previous LZA optimal parse (backward-forward-chain-N) is that it can achieve roughly the same compression in much less time. The chain-N forward parse can get more compression if N is increased, but the time taken is exponential in N, so it becomes very slow for N over 2, and N over 4 is unusable in practice.

LZA Optimal level 1 uses the straightforward version of this parse. I flush the parse to update statistics any time there are no matches that cross a position (either because no match is possible, or a long match is possible in which case I force it to be chosen, ala LZMA). This looks like :

LZA Optimal level 2 stores the 4 cheapest arrivals to each position. This allows you to arrive from a previous parse which was not the cheapest on its prior subgraph, but results in a cheaper total parse to your current subgraph. You get this sort of braided 4-line thing.

Here's a drawing showing the trace backwards to flush the parse after it's been filled out :

At the flush pos, you take arrival slot 0 (the cheapest) and discard the other arrivals. When you resume from that spot to continue the parse you must resume from only slot 0. It sort of reminds me of the earlier QM post - the parse is uncertain, not set, with these 4 different realities that are possible at each position, until you make a Copenhagen measurement and actually flush the parse, at which point the alternative histories are discarded and you snap to just one. When you resume the parse, the number of arrivals very rapidly grows from 1 up to the maximum of 4, and then stays at 4 through the parse until you snap back to 0.

Unlike the level 1 parse, I do not sync at unambiguous points because you want to keep the possibilities alive for a while. There's a tradeoff between up-to-date statistics and longer flexible parse intervals. (the reason this tradeoff exists is that I don't bring the statistics forward in the parse, though it is possible to do as described by Bulat in the encode.ru thread, my tests indicate it's not worth the speed & memory use hit).

I'm going to show some prints of an actual parse. The columns are the 4 arrivals, and positions increase going down.

The first number is how many positions steps back to your previous point, and the next number is which thread (arrival context) you came from at that position.

So a literal is step back 1 :


  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3

This is a chunk of parse where only literals are possible, and the costs are about the same in each context so the states just carry forward.

  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3

Literals don't always just carry forward, because the cost to code a literal depends on the last-offset context which can be different in each arrival. Here threads 1 & 2 swapped because of literal cost differences.

Through the vast majority of the parse, slot 0 (cheapest) arrives from slot 0. Through all that range, the alternate parses are playing no role. But once in a while they swap threads. Here's an example with the trace-back highlighted :


 [3|0]  3|1   3|2   1|0
 [1|0]  1|1   1|2   1|3
 [1|0]  1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  5|0  [6|0]  5|1   6|1
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|2   1|1   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  8|0  [8|1]  8|2   8|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  4|0  [4|1]  4|2   4|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  3|0  [3|1]  3|2   3|3
  1|0  [1|1]  1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
  1|0   1|1   1|2   1|3
 [7|1]  7|3   7|0   7|2

If the last pos there was a flush point, that would be the parse we trace back to output. Remember that the length is not the length coded at that pos, but the length to *arrive* at that pos (the length coded at the previous pos).

A more detailed printout showing how this type of thing happens. Here I'm also printing match vs. last-offset match, and the offset used in the match :


 [4L+708|0] 3L+708|0  4L+708|1  4L+708|2
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [3L+708|0] 3L+708|1  3L+708|2  3L+708|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+228|0] 4M+228|1  4M+228|2  3M+228|0
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4M+732|0] 4M+732|1  1 +  0|0  3M+732|0
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  1 +  0|0  1 +  0|1  1 +  0|3  1 +  0|2
  4M+ 12|0 [3L+732|0] 5L+ 12|0  4L+ 12|2
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  4M+252|0 [4M+252|1] 4M+252|2  4M+252|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 [4L+708|1] 4L+708|2  4L+708|3  4M+708|0

In the last position here we want to code a match of length 4 at offset 708. Arrivals 1-3 can code it as a last-offset, but arrival 0 (the cheapest) does not have 708 in the last-offset set (because it previously coded 252,12,732,228 and knocked the 708 out of its history).

You can see way back in the past 708 was used twice, and threads 1-3 only used three other offsets (my last-offset set keeps four offsets) so they still have 708.

This is really what the multiple arrivals is all about. You get an extremely non-local effect, that by keeping that 708 in the last offset set, you get a cheaper parse way off in the future when it's used again.

Another one because I find this amusing :


  1 +  0|0  1 +  0|1  1 +  0|2  1 +  0|3
 15M+360|0 12M+360|0[16M+720|0]12M+360|1
  1 +  0|0 [1 +  0|2] 1 +  0|1  1 +  0|3
 [1 +  0|1] 1 +  0|0  1 +  0|2  1 +  0|3
 [1 +  0|0] 1 +  0|1  1 +  0|2  1 +  0|3

What's happened here is the literal coding is cheaper with offset 720 as the last-offset. It's not until two literals later that it becomes the winner. The first literal after the match uses LAM exclusion, and then the next literal after that uses LAM as coding context. ( as described here )


I've been highlighting interesting bits of parse. It should be noted that the vast majority of every parse is just staying on the channel-0 (cheapest) thread. Like this :


     1688: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1689: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1690: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1691: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1692: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1693: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1694: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1695:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1696:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1697: [  3M+167|0]   3M+167|1    3M+167|2    1 +  0|0 
     1698:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1699:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1700: [  3M+ 45|0]   4M+ 45|0    3M+ 45|1    3M+ 45|2 
     1701:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1702:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1703:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1704:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1705:    3M+ 55|0    3M+ 55|1    3M+ 55|2    3M+ 55|3 
     1706:    6M+ 58|0    6M+ 58|1    6M+ 58|2    6M+ 58|3 
     1707:    7M+251|0    7M+251|1    1 +  0|0    1 +  0|1 
     1708: [  8M+330|0]   8M+330|1    8M+330|2    1 +  0|0 
     1709:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1710:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1711:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 

Anyway, here's one I spotted which exhibits unusual jumping around :

     3642: [  4L+612|0]
     3643: [  1 +  0|0]
     3644: [  1 +  0|0]
     3645: [  1 +  0|0]
     3646: [  1 +  0|0]
     3647: [  1 +  0|0]
     3648:    1 +  0|0 
     3649:    1 +  0|0 
     3650: [  3M+192|0]   1 +  0|0 
     3651:    1 +  0|0    1 +  0|1 
     3652:    1 +  0|0    1 +  0|1 
     3653:    1 +  0|0    1 +  0|1 
     3654:    5L+ 12|0 [  4M+ 12|0]   4L+ 12|1    3M+ 12|0 
     3655:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3656:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3657:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3658:    3L+612|0 [  3L+612|1]   3L+612|2    3L+612|3 
     3659:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3660:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3661:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3662:    3L+228|0 [  3L+192|1]   3L+228|2    3L+192|3 
     3663:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     3664:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3665:    1 +  0|0    1 +  0|2    1 +  0|1    1 +  0|3 
     3666:    3L+228|0    4M+624|0    4M+624|1 [  3L+612|1]
     3667:    1 +  0|0    1 +  0|1 [  1 +  0|3]   1 +  0|2 
     3668:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3669:    1 +  0|0    1 +  0|1    1 +  0|3    1 +  0|2 
     3670:    3M+576|0    3M+576|1 [  3M+576|2]   3M+576|3 
     3671:    1 +  0|0    1 +  0|1 [  1 +  0|2]   1 +  0|3 
     3672:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3673:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3674: [  3L+192|2]   3L+192|3    3M+192|0    3M+192|1 
     3675:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3676:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3677:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     3678: [  4L+ 12|0]   4M+ 12|1    3L+ 12|0    3L+624|1 

All the previous examples have been from highly structured binary files. Here's an example on text :


     1527: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1528:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1529:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1530:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1531: [  4M+100|0]   4M+100|1    4M+100|2    4M+100|3 
     1532:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1533:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1534:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1535:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1536:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1537:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1538:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1539:   10M+ 32|0   10M+ 32|1 [  8M+ 32|0]   8M+ 32|1 
     1540:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1541:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1542:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1543:    3M+ 47|0    3M+ 47|1    3M+ 47|2    3M+ 47|3 
     1544:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1545:    5M+ 49|0    5M+ 49|1    5M+ 49|2    5M+ 49|3 
     1546:    5M+  1|0    5M+  1|1    5M+  1|2    5M+  1|3 
     1547:    8M+ 50|0 [  8M+ 50|2]   8M+ 50|1    8M+ 50|3 
     1548:    1 +  0|0 [  1 +  0|1]   1 +  0|2    1 +  0|3 
     1549:    1 +  0|0    1 +  0|1    1 +  0|2    1 +  0|3 
     1550: [  2L+999|1]   2L+999|3    2L+999|0    2L+999|2 
     1551: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 
     1552: [  1 +  0|0]   1 +  0|1    1 +  0|2    1 +  0|3 

The benefit of the N-arrival parse is far far lower on text than on binary.


text :

enwik8 -z5 : 25384698
enwik8 -z6 : 25358366

binary :

fez_essentials -z5 : 9780036
fez_essentials -z6 : 9512780

lzt02 -z5 : 146098
lzt02 -z6 : 141141

lzt25 -z5 : 63376
lzt25 -z6 : 56982


Some random notes :

In LZA I do not have the N^2 problem like Tornado. For each match, I only consider the maximum possible match length. The reason I can do this is because I enforce strict LAM exclusion. Shorter lengths are not allowed because they would not exclude the LAM.

This is not an "optimal" parse in the sense of being exactly right. To be a true optimal parse, you would have to carry the statistics forward through the tree, and you would have to store *all* arrivals to a given position, not just the best 4.

The A-Star parse I previously wrote about it is pretty similar to this actually. Rather than only store 4 arrivals at each position, it can store an unbounded # of arrivals, and it tries to discard paths that cannot ever beat the best path (with a fuzzy tolerance to make that viable).

Storing the 4 cheapest arrivals is maybe not right. You want the 1 cheapest, but then what you want for the other 3 is for them to be interesting alternatives. For example you want their last-offset set to be different than the cheapest arrival, you don't just want a more expensive way to wind up with the same thing. I've done some limited experiments with this but haven't found anything definitive yet.

There's also the issue of deciding when to reset the statistical state.

Because statistics aren't carried forward, there's the issue of parse-statistics feedback. In particular, some hueristic guidance does help (like don't ever take a normal match when a longer repeat-match is possible). Also if the beginning of the file is unusual, it can lead you into a bad local minimum.

1/22/2015

01-22-15 - Nexus 5 Usability

The physical design of these phones is so rotten.

First of all, here's a fundamental principle of good device design :

It should not look symmetric unless it actually *is* symmetric.

I frequently pick it up and try to tap the on button, only to discover that I have it upside down. I shouldn't have to spend those seconds staring at the thing to figure out which end is up.

Since it's in fact behaviorally *not* symmetric, they shouldn't have tried so hard to disguise the buttons and make it look symmetric. It should have a strong different-color strip on one end to make it obvious which end is up. This whole "make it look like a single color brick" that came from Apple is fucking retarded design.

Also the side-buttons should not be black and blend in to the device. They should be a contrast color.

My grilling process would be like this : "why are the buttons the same color as the body?" "uhh, because it looks cool?" "you're fired". "which end is up?" "well, it looks cooler all the same color" "you're fired".

A related symmetry complaint is that it doesn't rotate the screen when you turn it upside down. Again if it looks fucking symmetric, then act symmetric. Don't leave me there going "why the fuck are you not rotating the screen? Oh it's fucking upside down". If I turn it upside then fucking rotate the screen the way I'm holding it.

And of course the fucking volume-up-down should be a physical slider so that I can tell the level at a glance and change it even when the thing is off. And tapping it by accident doesn't mute it or send the volume through the roof or any shit like that. Like all fucking knobs and sliders should *always* be physical not fucking digital shuttles OMG. Fucking AC controls and volume knobs in cars that are just digital shuttles make me want to murder everyone.

Maybe the neatest symmetry solution would be to make it slightly tapered. Make the bottom slightly narrower than the top, which would fit the hand better anyway. More curved at the bottom and more square at the top.

There's a huge hardware problem in that if you have a finger on the screen it won't register other taps. I don't understand how this acceptable. If I ham-fist it and accidentally touch the screen with my gripping hand, it just doesn't work. WTF. Perhaps worse, if my baby has her finger on it, then I can no longer operate it to change to a video for her or whatever.

Many of the standard Google Apps have major GUI flaws.

For example in the Camera it's way too hard to change the basic function, such as switching between pictures and video or switching the camera direction, you have to go through some fucking settings menu; they should just be buttons.

In the phone, it's way too easy to call someone when you're just trying to get info on your call history. (I had to add a "call confirm" app because I kept dialing numbers when I scrolled around)

In all the apps, if you have them set to only sync on wifi or only sync on charger, there's no override button to just "fucking sync now anyway". You have to dig into the menus and change the setting then change it back after syncing, very annoying.

In gmail it's nearly fucking impossible to change between threaded & non-threaded view, which would be handy to do frequently. (the threaded view also makes it really hard to find the newest mail)

But there's a worse problem in general with apps.

It's just completely hidden what you can actually do in them. What areas can I click on? What are buttons and what do they do? And then some spots are different actions if I hold a tap there. And some spots are a different action if I drag there. WTF WTF. This is such rotten usabibility GUI design.

One option to fix it would be to obviously color all GUI elements based on how they can be used. Red lines are draggable, blue elements are tappable, green have a tap-and-hold extra action.

Even aside from making it more obvious, there needs to be a way to highlight all GUI actions. Somewhere I can tap that will make them all pop up tool-tips, or animate "this is holdable, this is draggable" etc.

Why do I have to lecture people on fucking 1950's GUI principles. Make fucking buttons looks like buttons! Don't hide them as graphical elements! Function over form please!

The copy-paste interface is horrific. The main way I run into it is if I get something like an address in an email and it fails to auto-link to the maps app, and I want to manually get that address over there. For fucks sake it's impossible. Or like if I find a URL and it doesn't auto-link to the browser, and I have to try to select it and copy it, I want to kill myself. Which I guess is just a general phone problem - as long as the function you want is hard-coded in, it's okay, but god help you actually make apps interact in a nonstandard way. (like take a photo that's on your phone and upload it as an attachment in a web forum post)

Some other random complaints while I'm at it :

There's no fucking missed calls & texts notification on the home page or lock page. Uh, hello. Make it work as a basic fucking phone please. It's like the most basic fucking thing you need in a phone, it should tell you what calls and texts you got. I have to fucking browse around various apps just to find out if I got any calls or texts? WTF. (yes I know I can buy an app to add these features, but seriously WTF).

(addendum : seriously WTF. Recent missed calls and texts need to be on the home page. This is one complaint I actually need to fix because clicking around every time I turn the phone on is ridiculous.)

There seems to be no way to cache my entire local region in maps. For a real world where we have spotty cell coverage in the country, this is pretty awful. In general offline use (as you might want in areas of terrible reception, eg. everywhere) is extremely poor.

There seems to be no way to clear all private data on every boot. I guess nobody in the world but me even tries to have any privacy these days. I'd like to have no saved cookies, no web passwords, and zero app-sign-ins on every boot.

I should be able to get on the internet over the USB cable.

I should be able to remote control the phone from a PC app over a USB cable. I'd like to do all my setup and management from a PC with a fucking mouse and keyboard please, not by poking with my ham-fist. I'd like to have just a big tree-view of all the options so I can clearly see them and not have to poke around to find them.

I fucking hate the way maps is always trying to twist on me. Fucking up is north. Seems to not be a setting to lock north, I have to constantly tap the fucking icon. Very annoying.

There needs to be a global option to turn off all starring of passwords globally. Becaused starred out passwords and ham-fingers don't mix.

1/14/2015

01-14-15 - Spooky Action at a Distance

I loved "Only Lovers Left Alive". There's not much to the plot, but as just a bit of mood it's lovely. The design, the music, everything is perfect. It reminds me of my youth, actually. An aethestic and lifestyle that's no longer fashionable.

Adam's song - Spooky Action At A Distance - is fantastic.

BUT - there is no fucking spooky action at a distance!

Ooo, you say, Einstein was confused about it. It must be all mysterious and nobody understands it. NO!

Quantum Mechanics is not some crazy mysterious metaphysical thing that science can't explain and therefore allows all sorts of quackery.

First of all, the average person shouldn't read thoughts about Quantum Mechanics that were written in the 20's and 30's when it was still new and not very well understood yet. Hey, obviously they were a little confused back then. Read something modern!

Second of all, I think the way QM is taught in schools is still largely terrible. I don't know how it's done now, but when I went through it it was still being taught in a "historical" way. That is, going through the progression in the same way it was discovered, roughly. First you learn quantization of photons, then you get a wave function, you get copenhagen interpretation and measurement collapse, and then you finally start to get bra-ket and hilbert spaces and only much later (grad school) do you get decoherence theory.

The right way to teach QM is to jump straight to the modern understanding without confusing people with all that other gunk :

1. The universe is a linear vector space of *amplitudes* (this is the source of all the weirdness and is the fundamental postulate) (the universe seems to work in the square roots of things, and double-covers, and this leads to a lot of the non-classical weirdness)

2. The entire universe is QM, there is no classical "observer" that lies outside the QM system.

3. The universe takes all possible outcomes all the time. However, an observer in any given state is not in all of those outcomes.

4. Measurement is the process of entangling the QM state of the measurement device with the QM state of the observed process.

In a very brief nutshell measurement decoherence (entanglement of the measurement device / outside world with a quantum system) goes like this :


You have a measurement device which is initially in an undetermined state |M>

You have something to measure.  Let's take a quantum state which is either 0 or 1 
(eg. a single particle that's either spin up or down) 

|0> + |1>

Initially they are not entangled :

|M> ( |0> + |1> )

And measurement changes them to be entangled :

( |M0>|0> + |M1>|1> )

M0 means the measurement device has observed the state 0

Note that there is no "collapse". Both possibilities still exist, but if the measurement device observes a 0 then the particle must be in state 0.

Now, in the book teaching QM we'd have to go into more detail about what "measurement decoherence" actually is. Because the entire universe is governed by QM, we'd have to show that that process happens by QM as well. In fact that has been done and the details worked out in the past 40 years, so it's relatively new.

Measurement decoherence doesn't happen at the single-particle level. Rather it's a large-scale phenomenon that happens due to entropy, much like thermodynamics. This is why "measurement collapse" could be seen as a connection to an outside classical system for large objects, because it only happens at large system scales. It only happens one way - you can entangle the measurement device with the quantum particles, but you can never de-entangle them (this is why in the old copenhagen interpretation you could get away with talking about "collapse", because if the measurement device is classical-scale then it can never go backwards). The basic laws of QM are time-reversible, but decoherence is not. It's exactly like if you took a bucket of red water and a bucket of blue water and mixed them in one larger bucket - it is now effectively impossible to ever separate out all the particles and restore the two different buckets.

So. The so-called "EPR Paradox" that is not a paradox at all. (well it is if you assume a deterministic physics with hidden variables, which is just wrong; it should be called the EPR Proof that Einstein was Wrong Sometimes). The EPR thought experiment goes like this :

You have two particles that are either in state 0 or 1, and are entangled so that they are either both 0 or both 1 :


|0>|0> + |1>|1>

You separate them to opposite ends of the universe, where lie Alice and Bob :

|A>|B>( |0>|0> + |1>|1> )

Bob then measures his particle and becomes entangled with it, in state B0 or B1 :

|A>( |B0>|0>|0> + |B1>|1>|1> )

If Alice now measures her particle, she will get either a 0 or 1, and see the same thing as Bob. However, nothing has moved faster than light. Her entanglement happens entirely on her local particle :

|A>|0> -> |A0>|0>

it has nothing to do with the state of Bob far away. The initial entangled state of the particle simply gaurantees they will get the same result. But the physical interaction is still local and slower than light, and the information about the result must travel slower than light.

There is never any "action at a distance" or any "paradox" or anything "spooky".

Sheesh. Come on vampires, you've been alive long enough to get your physics right.

1/05/2015

01-05-15 - Daala PVQ Summary

A much belated attempt at summarizing the Daala PVQ coder. This post is in the wrong temporal order, it's sort of an introduction before the other rambling.


First consider Intra, that is the case of no prediction.

We have a block of some variable size (4x4 to 16x16) that has been transformed. The DC is sent by some other mechanism, so we are left with some AC's.

The AC's are grouped into subbands. These are just collections of likely-similar AC coefficients. Ideally they should be statistically similar. The current subbands in Daala are :

The AC coefficients in each subband are consider a vector. We first send the "gain" of the vector, which is the L2 norm (the "length" of the vector). (*1)

The gain can be sent with a non-uniform scalar quantizer. This allows you to build variance-adaptive quantization into the codec without side data. (H264 by sending per-block quantizer changes based on activity, which costs extra bits in signalling). Essentially you want to use larger quantization bins for larger gains, because specifying the exact amount of a large energy is not as important. Daala seems to take gain to the 2/3 power and uniformly quantize that. (*2) (*3)

Once we've sent the gain, we need to send how that energy is distributed on the coefficients. I won't go into the details of how it's done. Of course you have already sent the total energy so you don't want to just send the coefficient values, that would be highly redundant. You may think of it as dividing the vector by the gain, we are left with a unit vector. The Fischer "Pyramid Vector Quantizer" is one good starting point.

Note that Fischer PVQ is explicitly not quite applicable here, because it is only optimal if the coefficients in the vector are independent and have the same distribution, which is not the case in images. Because of that, just sending the index of the PVQ codebook without entropy coding is probably wrong, and a way of coding a PVQ codebook selection that uses the properties of the AC coefficients is preferrable. (eg. code one by one in a z-scan order so you know likelihood is geometrically decreasing, etc. Daala has some proposals along these lines).

A key issue is determining the PVQ codebook resolution. With PVQ this is K, the L1 norm of the unnormalized (integer coefficients) codebook vectors. Daala computes K without sending it. K is computed such that error due to the PVQ quantization is equal to error due to gain quantization. (*4). This makes K a function of the overall quantization level, and also of the gain of that block - more gain = more K = more resolution for where that gain goes.

A non-uniform quantization matrix is a bit murky in this scheme (ala the JPEG CSF factors) because it changes the scaling of each axis in the vector, as well as the average magnitude, which violates the assumptions of Pyramid VQ. Applying a different quantizer per subband is an easy compromise, but pretty coarse. (*5)

And that's it for intra.

Advantages vs. traditional residual coding :

1. You send the gain up front, which is a good summary of the character of the block. This allows for good modeling of that gain (eg. from neighbors, previous gains). It also allows that gain to be used as a context or coding model selector for the rest of the block.

2. Because you send the gain explicitly, it is better preserved than with traditional coding (especially with Trellis Quant which tends to kill gain). (deadzone quantizers also tend to kill gain; here a deadzone quantizer might be appropriate for the total gain, but that's better than doing it for each coefficient independently). Preserving gain is good perceptually. It also allows for separate loss factor for gain and distribution of energy which may or may not be useful.

3. Because you send the gain explicitly, you can use a non-linear quantizer on it.

4. You have a simple way of sending large subbands = zero without coding an extra redundant flag.

5. No patents! (unbelievable that this is even an issue, all you software patenters out there plz DIAGF)

asides :

*1 = In my studies I found preserving L1 norm of AC activity to be more visually important than preserving L2 norm. That doesn't mean it's better to code L1 norm though.

*2 = One advantage of this scheme is having VAQ built in. A disadvantage is that it's hard-coded into the codec so that the encoder isn't free to do adaptive quantization based on better human visual studies. Of course the encoder can send corrections to the built-in quantizer but you better be careful about tweak the VAQ that's in the standard!

*3 = I always found variable quantization of log(gain) to be appealing.

*4 = My perceptual studies indicate that there should probably be more error due to the distribution quantization (K) than from gain quantization.

*5 = Of course, applying a CSF-like matrix to *residuals* as is done in most video coders is also very murky.


Okay, but now we have prediction, from motion compensation or intra prediction or whatever. You have a current block to encode and a predicted block that you think is likely similar.

The problem is we can't just subtract off the predicted block and make a residual the way normal video codecs do. If you did that, you would no longer have a "gain" which was the correct energy level of the block - it would be an energy level of the *residual*, which is not a useful thing either for perceptual quality control or for non-uniform quantization to mimic VAQ.

Sending the gain is easy. You have the gain for each predicted subband, so you can just predict the gain for the current subband to be similar to that (send the delta, context code, etc.). You want to send the delta in quantized space (after the power mapping). The previous block might have more information than you can preserve at the current quantization level (it may have a finely specified gain which your quantizer cannot represent). With a normal residual method, we could just send zero residuals and keep whatever detail was in the previous frame. Daala makes it possible to carry this forward by centering a quantization bucket exactly on the previous gain.

Now we need the distribution of coefficients. The issue is you can't just send the delta using Pyramid VQ. We already have the gain, which is the length of the coefficient N-vector , it's not the length of the delta vector.

Geometrically, we are on an N-sphere (since we know the length of the current vector) and we have a point on that sphere where the predicted block was. So we need to send our current location on the sphere relative to that known previous position. Rather than mess around with spherical coordinate systems, we can take the two vectors, and then essentially just send the parallel part (the dot product, or angle between them), and then the perpendicular part.

JM Valin's solution for Daala using the Householder reflection is just a way of getting the "perpendicular part" in a convenient coordinate system, where you can isolate the N-2 degrees of freedom. You have the length of the perpendicular part (it's gain*sin(theta)), so it's a unit vector, and we can use Pyramid VQ to send it.

So, to transmit our coefficients we send the gain (as delta from previous gain), we send the extent to which the vectors are parallel (eg. by sending theta (*6)), we then know the length of the perpendicular part and just need to send the remaining N-2 directional DOF using Pyramid VQ.

As in the intra case, the K (quantizer resolution) of the Pyramid VQ is computed from other factors. Here obviously rather than being proportional to the total gain, it should be proportional to the length of the perpendicular part, eg. gain*sin(theta). In particular if you send a theta near zero, K goes toward zero.

One funny thing caused by the Householder reflection (coordinate system change to get the perpendicular part) is that you've scrambled up the axes in a way you can't really work with. So custom trained knowledge of the axes, like expected magnitudes and Z-scan order and things like that are gone. eg. with a normal coefficient delta, you know that it's very likely the majority of the delta is in the first few AC's, but after the rotation that's lost. (you can still build that in at the subband level, just not within the subbands).

Another funny thing is the degeneracy of polar conversion around the origin. When two vectors are very close (by Euclidean distance) they have a small angle between them, *if* they are long enough. Near the origin, the polar conversion has a pole (ha ha punny). This occurs for subbands near zero, eg. nearly flat, low energy blocks. Since the gain has previously been sent, it's possible that could be used to change to a different coder for gains near zero (eg. just use the Intra coder). (in Daala you would just send a "noref" flag). To be clear, the issue is that the vectors might be very close in Euclidean distance, and thus seem like good matches based on SAD motion search, but could easily be vectors pointing in completely opposite directions, hence be very bad to code using this theta scheme.

And I think that's about it.

The big goal of this funny business is to be able to send the gain (length) of the current subband vector, rather than sending the length of the delta. This gives you the advantages as discussed previously in the simpler Intra case.

asides :

*6 = send theta, or sin(theta) ? They have slightly different quantization bucket scaling. Most of this assumes that we have a reasonably good prediction so theta is small and theta ~= sin(theta).


Geometrically with white board drawing :

Traditional video coders just form the Euclidean "Delta" vector and send its components.

Daala Intra (no prediction) takes the "Current" vector, sends its length, and then its unit vector (direction) using Pyramid VQ.

Daala Predictive VQ sends the length of "Current", the angle (theta) from Predicted to Current, and the unit vector (direction) of the "Perpendicular" vector.

(of course Daala doesn't send the "Perpendicular" vector in the coordinate space draw above; it's reflected into the space where "Predicted" is aligned with an axis, that way Perpendicular is known to have a zero in that direction and is simply a vector in N-1 dimensions (and you've already sent the length so it has N-2 DOF))

12/27/2014

12-27-14 - Lagrange Rate Control Part 4

I wrote previously about Lagrange Rate control for video :

01-12-10 - Lagrange Rate Control Part 1
01-12-10 - Lagrange Rate Control Part 2
01-13-10 - Lagrange Rate Control Part 3

I was thinking about it recently, and I wanted to do some vague rambling about the overall issue.

First of all, the lagrange method in general for image & video coding is just totally wrong. The main problem is that it assumes every coding decision is independent. That distortions are isolated and additive, which they aren't.

The core of the lagrange method is that you set a "bit usefulness" value (lambda) and then you make each independent coding decision based on whether more bits improve D by lambda or more.

But that's just wrong, because distortions are *not* localized and independent. I've mentioned a few times recently the issue of quality variation; if you make one block in image blurry, and leave others at high detail, that looks far worse than the localized D value tells you, because it's different and stands out. If you have a big patch of similar blocks, then making them different in any way is very noticeable and ugly. There are simple non-local effects, like if the current block is part of a smooth/gradient area, then blending smoothly with neighbors in the output is crucial, and the localized D won't tell you that. There are difficult non-local effects, like if the exact same kind of texture occurs in multiple parts of the image, then coding them differently makes the viewer go "WTF", it's a quality penalty worse than the local D would tell you.

In video, the non-local D effects are even more extreme due to temporal coherence. Any change of quality over time that's due to the coder (and not due to motion) is very ugly (like I frames coming in with too few bits and then being corrected over time, or even worse the horrible MPEG pop if the I-frame doesn't match the cut). Flickering of blocks if they change coding quality over time is horrific. etc. etc. None of this is measurable in a localized lagrangian decision.

(I'm even ignoring for the moment the fact that the encoding itself is non-local; eg. coding of the current block affects the coding of future blocks, either due to context modeling or value prediction or whatever; I'm just talking about the fact that D is highly non-local).

The correct thing to do is to have a total-image (or total-video) perceptual quality metric, and make each coding decision based on how it affects the total quality. But this is impossible.

Okay.

So the funny thing is that the lagrange method actually gets you some global perceptual quality by accident.

Assume we are using quite a simple local D metric like SSD or SAD possibly with SATD or something. Just in images, perceptually what you want is for smooth blocks to be preserved quite well, and very noisey/random blocks to have more error. Constant quantizer doesn't do that, but constant lambda does! Because the random-ish blocks are much harder to code, they cost more bits per quality, they will be coded at lower quality.

In video, it's even more extreme and kind of magical. Blocks with a lot of temporal change are not as important visually - it's okay to have high error where there's major motion, and they are harder to code so they get worse quality. Blocks that stay still are important to have high quality, but they are also easier to code so that happens automatically.

That's just within a frame, but frame-to-frame, which is what I was talking about as "lagrange rate control" the same magic sort of comes out. Frames with lots of detail and motion are harder to code, so get lower quality. Chunks of the video that are still are easier to code, so get higher quality. The high-motion frames will still get more bits than the low-motion frames, just not as many more bits as they would at constant-quality.

It can sort of all seem well justified.

But it's not. The funny thing is that we're optimizing a non-perceptual local D. This D is not taking into account things like the fact that high motion block errors are less noticeable. It's just a hack that by optimizing for a non-perceptual D we wind up with a pretty good perceptual optimization.

Lagrange rate control is sort of neat because it gets you started with pretty good bit allocation without any obvious heuristic tweakage. But that goes away pretty fast. You find that using L1 vs. L2 norm for D makes a big difference in perceptual quality; maybe L1 squared? other powers of D change bit allocation a lot. And then you want to do something like MB-tree to push bits backward; for example the I frame at a cut should get a bigger chunk of bits so that quality pops in rather than trickles in, etc.


I was thinking of this because I mentioned to ryg the other day that I never got B frames working well in my video coder. They worked, and they helped in terms of naive distortion measures, but they created an ugly perceptual quality problem - they had a slightly different look and quality than the P frames, so in a PBPBP sequence you would see a pulsing of quality that was really awful.

The problem is they didn't have uniform perceptual quality. There were a few nasty issues.

One is that at low bit rates, the "B skip" block becomes very desirable in B frames. (for me "B skip" = send no movec or residual; use predicted movec to future and past frames to make an interpolated output block). The "B skip" is very cheap to send, and has pretty decent quality. As you lower bit rate, suddenly the B frames start picking "B skip" all over, and they actually have lower quality than the P frames. This is an example of a problem I mentioned in the PVQ posts - if you don't have a very smooth contimuum of R/D choices, then an RD optimizing coder will get stuck in some holes and there will be sudden pops of quality that are very ugly.

At higher bit rates, the B frames are easier to code to high quality, (among other things, the P frame is using mocomp from further in the past), so the pulsing of quality is high quality B's and lower quality P's.

It's just an issue that lagrange rate control can't handle. You either need a very good real perceptual quality metric to do B-P rate control, or you just need well tweaked heuristics, which seems to be what most people do.

12/17/2014

12-17-14 - PVQ Vector Distribution Note

So, PVQ, as in "Pyramid VQ" is just a way of making a VQ codebook for a unit vector that has a certain probability model (each value is laplacian (the same laplacian) and independent).

You have a bunch of values, you send the length separately so you're left with a unit vector. Independent values are not equally distributed on the unit sphere, so you don't want a normal unit vector quantizer, you want this one that is scaled squares.

Okay, that's all fine, but the application to images is problematic.

In images, we have these various AC residuals. Let's assume 8x8 blocks for now for clarity. In each block, you have ACij (ij in 0-7 and AC00 excluded). The ij are frequency coordinates for each coefficient. You also have spatial neighbors in the adjacent blocks.

The problem is that the AC's are not independent - their not independent either in frequency or spatial coordinates. AC42 is strongly correlated to AC21 and also to AC42 in the neighboring blocks. They also don't have the same distribution; lower freqency-index AC's have much higher means.

In order to use Pyramid VQ, we need to find a grouping of AC's into a vector, such that the values we are putting in that vector are as uncorrelated and equally distributed as possible.

One paper I found that I linked in the previous email forms a vector by taking all the coefficients in the same frequency slot in a spatial region (for each ij, take all ACij in the 4x4 neighorhood of blocks). This is appealing in the sense that it gathers AC's of the same frequency subband, so they have roughly the same distribution. The problem is there are strong spatial correlations.

In Daala they form "subbands" of the coefficients by grouping together chunks of ACs that are in similar frequency groups.

The reason why correlation is bad is that it makes the PVQ codebook not optimal. For correlated values you should have more codebook vectors with neighboring values similar. eg. more entries around {0, 2, 2, 0} and fewer around {2, 0, 0, 2}. The PVQ codebok assumes those are equally likely.

You can however, make up for this a bit with the way you encode the codebook index. It doesn't fix the quantizer, but it does extract the correlation.

In classical PVQ (P = Pyramid) you would simply form an index to the vector and send it with an equiprobable code. But in practice you might do it with binary subdivision or incremental enumeration schemes, and then you need not make all codebook vectors equiprobable.

For example in Daala one of the issues for the lower subbands is that the vectors that have signal in the low AC's are more probable than the high AC's. eg. for subband that spans AC10 - AC40 , {1,0,0,0} is much more likely than {0,0,0,1}.

Of course this becomes a big mess when you consider Predictive VQ, because the Householder transform scrambles everywhere up in a way that makes it hard to model these built-in skews. On the other hand, if the Predictive VQ removes enough of the correlation with neighbors and subband, then you are left with a roughly evenly distributed vector again which is what you want.

12/16/2014

12-16-14 - Daala PVQ Emails

First, I want to note that the PVQ Demo page has good links at the bottom with more details in them, worth reading.

Also, the Main Daala page has more links, including the "Intro to Video" series, which is rather more than an intro and is a good read. It's a broad survey of modern video coding.

Now, a big raw dump of emails between me, ryg, and JM Valin. I'm gonna try to color them to make it a bit easier to follow. Thusly :

cbloom
ryg
valin

And this all starts with me being not very clear on PVQ so the beginning is a little fuzzy.

I will be following this up with a "summary of PVQ as I now understand it" which is probably more useful for most people. So, read that, not this.

(also jebus the internet is rindoculuos. Can I have BBS's back? Like literally 1200 baud text is better than the fucking nightmare that the internet has become. And I wouldn't mind playing a little Trade Wars once a day...)


Hi, I'm the author of the latest Daala demo on PVQ on which you commented recently. Here's some comments on your comments. I wasn't able to submit this to your blog, but feel free to copy them there. > 1. I've long believed that blocks should be categorized into > "smooth", "detail" and "edge". For most of this discussion we're > going to ignore smooth and edge and just talk about detail. That is, > blocks that are not entirely smooth, and don't have a dominant edge > through them (perhaps because that edge was predicted). I agree here, although right now we're treating "smooth" and "detail" in the same way (smooth is just low detail). Do you see any reason to treat those separately? > 2. The most important thing in detail blocks is preserving the > amount of energy in the various frequency subbands. This is something > that I've talked about before in terms of perceptual metrics. This is exactly what I had in mind with this PVQ work. Before Daala, I worked on the CELT part of the Opus codec, which has strict preservation of the energy. In the case of Daala, it looks so far like we want to relax the energy constraint a little. Right now, the codebook has an energy-preserving structure, but the actual search is R/D optimized with the same weight given to the amount of energy and its location. It's pretty easy to change the code to give more weight to energy preservation. I could even show you how to play with it if you're interested. > 3. You can take a standard type of codec and optimize the encoding > towards this type of perceptual metric, and that helps a bit, but > it's the wrong way to go. Because you're still spending bits to > exactly specify the noise in the high frequency area. Correct, hence the gain-shape quantization in my post. > 4. What you really want is a joint quantizer of summed energy and > the distribution of that energy. At max bit rate you send all the > coefficients exactly. As you reduce bitrate, the sum is preserved > pretty well, but the distribution of the lower-right (highest > frequency) coefficients becomes lossy. As you reduce bit rate more, > the total sum is still pretty good and the overall distribution of > energy is mostly right, but you get more loss in where the energy is > going in the lower frequency subbands, and you also get more scalar > quantization of the lower frequency subbands, etc. Well, my experience with both Opus and CELT is that you want the same resolution for the energy as you use for the "details". That being said, having an explicit energy still means you can better preserve it in the quantization process (i.e. it won't increase or decrease too much due to the quantization). > 5. When the energy is unspecified, you'd like to restore in some nice > way. That is, don't just restore to the same quantization vector > every time ("center" of the quantization bucket), since that could > create patterns. I dunno. Maybe restore with some randomness; restore > based on prediction from the neighborhood; restore to maximum > likelihood? (ML based on neighborhood/prediction/image not just a > global ML) I experimented a little bit with adding noise at the correct energy and while it slightly improved the quality on still images, it wasn't clear how to apply it to video because then you have the problem of static vs dynamic noise. > 6. An idea I've tossed around for a while is a quadtree/wavelet-like > coding scheme. Take the 8x8 block of coefficients (and as always > exclude DC in some way). Send the sum of the whole thing. Divide into > four children. So now you have to send a (lossy) distribution of that > sum onto the 4 child slots. Go to the upper left (LL band) and do it > again, etc. I considered something along these lines, but it would not be easy to do because the lowest frequencies would kind of drown out the high frequencies. > 7. The more energy you have, the less important its exact > distribution, due to masking. As you have more energy to distribute, > the number of vectors you need goes up a lot, but the loss you can > tolerate also goes up. In terms of the bits to send a block, it > should still increase as a function of the energy level of that > block, but it should increase less quickly than naive > log2(distributions) would indicate. Yes, that is exactly what the PVQ companding does. > 8. Not all AC's are equally likely or equally perceptually important. > Specifically the vector codebook should contain more entries that > preserve values in the upper-left (low frequency) area. This is the equivalent of the quantization matrix, which PVQ has as well (though I didn't really talk about it). > 9. The interaction with prediction is ugly. (eg. I don't know how to > do it right). The nature of AC values after mocomp or > intra-prediction is not the same as the nature of AC's after just > transform (as in JPEG). Specifically, ideas like variance masking and > energy preservation apply to the transformed AC values, *not* to the > deltas that you typically see in video coding. Handling the prediction is exactly what the whole Householder reflection in PVQ is about (see the 6 steps figure). The PVQ gain encoding scheme is always done on the input and not on the prediction. So the activity masking is applied on the input energy and not based on the energy of the residual. > 10. You want to send the information about the AC in a useful order. > That is, the things you send first should be very strong classifiers > of the entropy of that block for coding purposes, and of the masking > properties for quantization purposes. Well, coding the energy first achieves most of this. > You don't want sending the "category" or "masking" information to be > separate side-band data. It should just be the first part of sending > the coefficients. So your category is maybe something like the > bit-vector of which coefficient groups have any non-zero > coefficients. Something like that which is not redundant with sending > them, it's just the first gross bit of information about their > distribution. Well, the masking information is tied to the gain. For now, the category information is only tied to the block size decision (based on the assumption that edges will be 4x4), but it's not ideal and it's something I'd like to improve. On the topic of lapped transform, this has indeed been causing us all sorts of headaches, but it also has interesting properties. Jury's still out on that one, but so far I think we've managed to make reasonably good use of it. Cheers, Jean-Marc

Thanks for writing! Before I address specific points, maybe you can teach me a bit about PVQ and how you use it? I can't find any good resources on the web (your abstract is rather terse). Maybe you can point me at some relevant reference material. (the CELT paper is rather terse too!) Are you constructing the PVQ vector from the various AC's within a single block? Or gathering the same subband from spatial neighbors? (I think the former, but I've seen the latter in papers) Assuming the former - Isn't it just wrong? The various AC's have different laplacian distributions (lower frequencies more likely) so using PVQ just doesn't seem right. In particular PVQ assumes all coefficients are equally likely and equally distributed. In your abstract you seem to describe a coding scheme which is not a uniform length codeword like traditional PVQ. It looks like it assigns shorter codes to vectors that have their values early on in some kind of z-scan order. How is K chosen?

Hi, On 02/12/14 08:52 PM, Charles Bloom wrote: > Thanks for writing! Before I address specific points, maybe you can > teach me a bit about PVQ and how you use it? I can't find any good > resources on the web (your abstract is rather terse). Maybe you can > point me at some relevant reference material. (the CELT paper is rather > terse too!) I'm currently writing a longer paper for a conference in February, but for now there isn't much more than the demo and the abstract I link to at the bottom. I have some notes that describe some of the maths, but it's a bit all over the place right now: http://jmvalin.ca/video/video_pvq.pdf > Are you constructing the PVQ vector from the various AC's within a > single block? Or gathering the same subband from spatial neighbors? (I > think the former, but I've seen the latter in papers) > > Assuming the former - Correct. You can see the grouping (bands) in Fig. 1 of: http://jmvalin.ca/video/spie_pvq_abstract.pdf > Isn't it just wrong? The various AC's have different laplacian > distributions (lower frequencies more likely) so using PVQ just doesn't > seem right. > > In particular PVQ assumes all coefficients are equally likely and > equally distributed. > > > In your abstract you seem to describe a coding scheme which is not a > uniform length codeword like traditional PVQ. It looks like it assigns > shorter codes to vectors that have their values early on in some kind of > z-scan order. One thing to keep in mind if that the P in PVQ now stands for "perceptual". In Daala we are no longer using the indexing scheme from CELT (which does assume identical distribution). Rather, we're using a coding scheme based on Laplace distribution of unequal variance. You can read more about the actual encoding process in another document: http://jmvalin.ca/video/pvq_encoding.pdf > How is K chosen? The math is described (poorly) in section 6.1 of http://jmvalin.ca/video/video_pvq.pdf Basically, the idea is to have the same resolution in the direction of the gain as in any other direction. In the no prediction case, it's roughly proportional to the gain times the square root of the number of dimensions. Because K only depends on values that are available to the decoder, we don't actually need to signal it. Hope this helps, Jean-Marc

Thanks for the responses and the early release papers, yeah I'm figuring most of it out. K is chosen so that distortion from the PVQ (P = Pyramid) quantization is the same as distortion from gain quantization. Presumably under a simple D metric like L2. The actual PVQ (P = Pyramid) part is the simplest and least ambiguous. The predictive stuff is complex. Let me make sure I understand this correctly - You never actually make a "residual" in the classic sense by subtracting the prediction off. You form the prediction in transformed space. (perhaps by having a motion vector, taking the pixels it points to and transforming them, dealing with lapping, yuck!) The gain of the current block is sent (for each subband). Not the gain of the delta. The gain of the prediction in the same band is used as coding context? (the delta of the quantized gains could be sent). The big win that you guys were after in sending the gain seems to have been the non-linear quantization levels; essentially you're getting "variance adaptive quantization" without explicitly sending per block quantizers. The Householder reflection is the way that vectors near the prediction are favored. This is the only way that the predicted block is used!? Madness! (presumably if the prediction had detail that was finer than the quantization level of the current block that could be used to restore within the quantization bucket; eg. for "golden frames")

On 03/12/14 12:18 AM, Charles Bloom wrote: > K is chosen so that distortion from the PVQ (P = Pyramid) quantization > is the same as distortion from gain quantization. Presumably under a > simple D metric like L2. Yes, it's an L2 metric, although since the gain is already warped, the distortion is implicitly weighted by the activity masking, which is exactly what we want. > You never actually make a "residual" in the classic sense by subtracting > the prediction off. Correct. > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do. > The gain of the current block is sent (for each subband). Not the gain > of the delta. Correct. > The gain of the prediction in the same band is used as > coding context? (the delta of the quantized gains could be sent). Yes, the gain is delta-coded, so coding "same gain" is cheap. Especially, there's a special symbol for gain=0,theta=0, which means "skip this band and use prediction as is". > The big win that you guys were after in sending the gain seems to have > been the non-linear quantization levels; essentially you're getting > "variance adaptive quantization" without explicitly sending per block > quantizers. Exactly. Not only that but it's adaptive based on the variance of the current band, not just an entire macroblock. > The Householder reflection is the way that vectors near the prediction > are favored. This is the only way that the predicted block is used!? > Madness! Well, the reference is used to compute the reflection *and* the gain. In the end, we're using exactly the same amount of information, just in a different space. > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? Jean-Marc

So one thing that strikes me is that at very low bit rate, it would be nice to go below K=1. In the large high-frequency subbands, the vector dimension N is very large, so even at K=1 it takes a lot of bits to specify where the energy should go. It would be nice to be more lossy with that location. It seems that for low K you're using a zero-runlength coder to send the distribution, with a kind of Z-scan order, which makes it very similar to standard MPEG. (maybe you guys aren't focusing on such low bit rates; when I looked at low bit rate video the K=1 case dominated) At 09:42 PM 12/2/2014, you wrote: > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? If you happen to have a very high quality previous block (much better than your current quantizer / bit rate should give you) - with normal mocomp you can easily carry that block forward, and perhaps apply corrections to it, but the high detail of that block is preserved. With the PVQ scheme it's not obvious to me that that works. When you send the quantized gain of the subbands you're losing precision (it looks like you guys have a special fudge to fix this, by offsetting the gain based on the prediction's gain?) But for the VQ part, you can't really "carry forward" detail in the same way. I guess the reflection vector can be higher precision than the quantizer, so in a sense that preserves detail, but it doesn't carry forward the same values, because they drift due to rotation and staying a unit vector, etc.

Some more questions - Is the Householder reflection method also used for Intra prediction? (do you guys do the directional Intra like H26x ?) How much of this scheme is because you believe it's the best thing to do vs. you have to avoid H26x patents? If you're not sending any explicit per-block quantizer, it seems like that removes a lot of freedom for future encoders to do more sophisticated perceptual optimization. (ROI bit allocation or whatever)

On 03/12/14 02:17 PM, Charles Bloom wrote: > So one thing that strikes me is that at very low bit rate, it would be > nice to go below K=1. In the large high-frequency subbands, the vector > dimension N is very large, so even at K=1 it takes a lot of bits to > specify where the energy should go. It would be nice to be more lossy > with that location. Well, for large N, the first gain step already has K>1, which I believe is better than K=1. I've considered adding an extra gain step with K=1 or below, but never had anything that was really worth it (didn't try very hard). > It seems that for low K you're using a zero-runlength coder to send the > distribution, with a kind of Z-scan order, which makes it very similar > to standard MPEG. > > (maybe you guys aren't focusing on such low bit rates; when I looked at > low bit rate video the K=1 case dominated) We're also targeting low bit-rates, similar to H.265. We're not yet at our target level of performance though. > Is the Householder reflection method also used for Intra prediction? > (do you guys do the directional Intra like H26x ?) We also use it for intra prediction, though right now our intra prediction is very limited because of the lapped transform. Except for chroma which we predict from the luma. PVQ makes this particularly easy. We just use the unit vector from luma as chroma prediction and code the gain. > How much of this scheme is because you believe it's the best thing to > do vs. you have to avoid H26x patents? The original goal wasn't to avoid patents, but it's a nice added benefit. > If you're not sending any explicit per-block quantizer, it seems like > that removes a lot of freedom for future encoders to do more > sophisticated perceptual optimization. (ROI bit allocation or > whatever) We're still planning on adding some per-block/macroblock/something quantizers, but we just won't need them for activity masking. Cheers, Jean-Marc

Hi, Just read your "smooth blocks" post and I thought I'd mention on thing we do in Daala to improve the quality of smooth regions. It's called "Haar DC" and the idea is basically to apply a Haar transforms to all the DCs in a superblock. This has the advantage of getting us much better quantization resolution at large scales. Unfortunately, there's absolutely no documentation about it, so you'd have to look at the source code, mostly in od_quantize_haar_dc() and a bit of od_compute_dcts() http://git.xiph.org/?p=daala.git;a=blob;f=src/encode.c;h=879dda;hb=HEAD Cheers, Jean-Marc

Yeah I definitely can't follow that code without digging into it. But this : "much better quantization resolution at large scales." is interesting. When I did the DLI test : http://cbloomrants.blogspot.com/2014/08/08-31-14-dli-image-compression.html something I noticed in both JPEG and DLI (and in everything else, I'm sure) is : Because everyone just does naive scalar quantization on DC's, large regions of solid color will shift in a way that is very visible. That is, it's a very bad perceptual RD allocation. Some bits should be taken away from AC detail and put into making that large region DC color more precise. The problem is that DC scalar quantization assumes the blocks are independent and random and so on. It models the distortion of each block as being independent, etc. But it's not. If you have the right scalar quantizer for the DC when the blocks are in a region of high variation (lots of different DC's) then that is much too large a quantizer for regions where blocks all have roughly the same DC. This is true even when there is a decent amount of AC energy, eg. the image I noticed it in was the "Porsche640" test image posted on that page - the greens of the bushes all color shift in a very bad way. The leaf detail does not mask this kind of perceptual error.

Two more questions - 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, how does that work with gain preservation and the Pyramid VQ unit vector? 2. Do you mind if I post all these mails publicly?

On 11/12/14 02:03 PM, Charles Bloom wrote: > 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, > how does that work with gain preservation and the Pyramid VQ unit vector? Right now, we just set a different quantizer value for each "band", so we can't change resolution on a coefficient-by-coefficient basis, but it still looks like a good enough approximation. If needed we might try doing something fancier at some point. > 2. Do you mind if I post all these mails publicly? I have no problem with that and in fact I encourage you to do so. Cheers, Jean-Marc


ryg: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? Classic JPEG/MPEG style coding rules (H.264 and later are somewhat different) 1. assume short runs are more probable than long ones and 2. give a really cheap way to end blocks early. The result is that the coder likes blocks with a fairly dense cluster in the first few coded components (and only this is where zig-zag comes in) and truncated past that point. Now take Fischer-style PVQ (original paper is behind a paywall, but this: http://www.nul.com/pbody17.pdf covers what seems to be the proposed coding scheme). You have two parameters, N and K. N is the dimensionality of the data you're coding (this is a constant at the block syntax level and not coded) and K is the number of unit pulses (=your "energy"). You code K and then send an integer (with a uniform model!) that says which of all possible arrangements of K unit pulses across N dimensions you mean. For 16-bit ACs in a 8x8 block so N=63, there's on the order of 2^(63*16) = 2^1008 different values you could theoretically code, so clearly for large K this integer denoting the configuration can get quite huge. Anyway, suppose that K=1 (easiest case). Then the "configuration number" will tell us where the pulse goes and what sign it has, uniformly coded. That's essentially a run length with *uniform* distribution plus sign. K=2: we have two pulses. There's N*2 ways to code +-2 in one AC and the rest zeros (code AC index, code sign), and (N choose 2) * 2^2 ways to code two slots at +-1 each. And so forth for higher K. From there, we can extrapolate what the general case looks like. I think the overall structure ends up being isomorphic to this: 1. You code the number M (<=N) of nonzero coefficients using a model derived from the combinatorics given N and K (purely counting-based). (K=1 implies M=1, so nothing to code in that case.) 2. Code the M sign bits. 3. Code the positions of the M nonzero coeffs - (N choose M) options here. 4. Code another number denoting how we split the K pulses among the M coeffs - that's an integer partition of K into exactly M parts, not sure if there's a nice name/formula for that. This is close enough to the structure of existing AC entropy coders that we can meaningfully talk about the differences. 1) and 2) are bog-standard (we use a different model knowing K than a regular codec that doesn't know K would, but that's it). You can view 3) in terms of significance masks, and the probabilities have a reasonably simple form (I think you can adapt the Reservoir sampling algorithm to generate them) - or, by looking at the zero runs, in term of run lengths. And 4) is a magnitude coder constrained by knowing the final sum of everything. So the big difference is that we know K at the start, which influences our choice of models forthwith. But it's not actually changing the internal structure that much! That said, I don't think they're actually doing Fischer-style PVQ of "just send a uniform code". The advantage of breaking it down like above is that you have separate syntax elements that you can apply additional modeling on separately. Just having a giant integer flat code is not only massively unwieldy, it's also a bit of a dead end as far as further modeling is concerned. -Fabian

cbloom: At 01:36 PM 12/2/2014, you wrote: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? My belief is that for R/D optimization, it's bad when there's a big R step that doesn't correspond to a big D step. You want the prices of things to be "fair". So the problem is cases like : XX00 X01 0 vs XX00 X001 000 00 0 which is not a very big D change at all, but is a very big R step. I think it's easy to see that even keeping something equivalent to the zigzag, you could change it so that the position of the next coded value is sent in a way such that the rates better match entropy and distortion. But of course, really what you want is to send the positions of those later values in a lossy way. Even keeping something zigzagish you can imagine easy ways to do it, like you send a zigzag RLE that's something like {1,2,3-4,5-7,8-13} whatever.

ryg: Actually the Fischer (magnitude enumeration) construction corresponds pretty much directly to a direct coder: from the IEEE paper, l = dim, k = number of pulses, then number of code words N(l,k) is N(l,k) = sum_{i=-k}^k N(l-1, k-|i|) This is really direct: N(l,k) just loops over all possible values i for the first AC coeff. The remaining uncoded ACs then are l-1 dimensional and <= k-|i|. Divide through by N(l,k) and you have a probability distribution for coding a single AC coeff. Splitting out the i=0 case and sign, we get: N(l,k) = N(l-1,k) + 2 * sum_{j=1}^k N(l-1,k-j) =: N(l-1,k) + 2 * S(l-1,k) which corresponds 1:1 to this encoder: // While energy (k) left for (i = 0; k > 0; i++) { { assert(i < Ndims); // shouldn't get to N with leftover energy int l = N - i; // remaining dims int coeff = coeffs[i]; // encode significance code_binary(coeff == 0, N(l-1,k) / N(l,k)); if (coeff != 0) { // encode sign code_binary(coeff < 0, 0.5); int mag = abs(coeff); // encode magnitude (multi-symbol) // prob(mag=j) = N(l-1,k-j) / S(l-1, k) // then: k -= mag; } } and this is probably how you'd want to implement it given an arithmetic back end anyway. Factoring it into multiple decisions is much more convenient (and as said before, easier to do secondary modeling on) than the whole "one giant bigint" mess you get if you're not low-dimensional. Having the high-dimensional crap in there blows because the probabilities can get crazy. Certainly Ndims=63 would suck to work with directly. Separately, I'd expect that for k "large" (k >= Ndims? Ndims*4? More? Less?) you can use a simpler coder and/or fairly inaccurate probabilities because that's gonna be infrequent. Maybe given k = AC_sum(1,63) = sum_{i=1}^63 |coeff_i|, there's a reasonably nice way to figure out say AC_sum(1,32) and AC_sum(33,63). And if you can do that once, you can do it more than once. Kind of a top-down approach: you start with "I have k energy for this block" and first figure out which subband groups that energy goes into. Then you do the "detail" encode like above within each subband of maybe 8-16 coeffs; with l<=Ndim<=8 and k<=Ndim*small, you would have reasonable (practical) model sizes. -Fabian

cbloom: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email)

ryg: On 12/2/2014 5:46 PM, Charles Bloom {RAD} wrote: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I wasn't stating that Fischer's scheme is variable-length; I was stating that the decomposition as given implies a corresponding way to encode it that is equivalent (in the sense of exact same cost). It's not variable length. It's variable number of symbols but the output length is always the same (provided you use an exact multi-precision arithmetic coder that is, otherwise it can end up larger due to round-off error). log2(N(l,k)) is the number of bits we need to spend to encode which one out of N(l,k) equiprobable codewords we use. The ceil(log2(N)) is what you get when you say "fuck it" and just round it to an integral number of bits, but clearly that's not required. So suppose we're coding to the exact target rate using a bignum rationals and an exact arithmetic coder. Say I have a permutation of 3 values and want to encode which one it is. I can come up with a canonical enumeration (doesn't matter which) and send an index stating which one of the 6 candidates it is, in log2(6) bits. I can send one bit stating whether it's an even or odd permutation, which partitions my 6 cases into 2 disjoint subsets of 3 cases each, and then send log2(3) bits to encode which of the even/odd permutations I am, for a total of log2(2) + log2(3) = log2(6) bits. Or I can get fancier. In the general case, I can (arbitrarily!) partition my N values into disjoint subsets with k_1, k_2, ..., k_m elements, respectively, sum_i k_i = N. To code a number, I then first code the number of the subset it's in (using probability p_i = k_i/N) and then send a uniform integer denoting which element it is, in log2(k_i) bits. Say I want to encode some number x, and it falls into subset j. Then I will spend -log2(p_i) + log2(k_i) = -log2(k_i / N) + log2(k_i) = log2(N / k_i) + log2(k_i) = log2(N) bits (surprise... not). I'm just partitioning my uniform distribution into several distributions over smaller sets, always setting probabilities exactly according to the number of "leaves" (=final coded values) below that part of the subtree, so that the product along each path is still a uniform distribution. I can nest that process of course, and it's easy to do so in some trees but not others meaning I get non-uniform path lengths, but at no point am I changing the size of the output bitstream. That's exactly what I did in the "coder" given below. What's the value of the first AC coefficient? It must obey -k <= ac_0 <= k per definition of k, and I'm using that to partition our codebook C into 2k+1 disjoint subsets, namely C_x = { c in C | ac0(c) = x } and nicely enough, by the unit-pulse definition that leads to the enumeration formula, each of the C_x corresponds to another PVQ codebook, namely with dimension l-1 and energy k-|x|. Which implies the whole thing decomposes into "send x and then do a PVQ encode of the rest", i.e. the loop I gave. That said, one important point that I didn't cover in my original mail: from the purposes of coding this is really quite similar to a regular AC coder, but of course the values being coded don't mean the same thing. In a JPEG/MPEG style entropy coder, the values I'm emitting are raw ACs. PVQ works (for convenience) with code points on an integer lattice Z^N, but the actual AC coeffs coded aren't those lattice points, they're (gain(K) / len(lattice_point)) * lattice_point (len here being Euclidean and not 1-norm!). I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email) The link to the extended abstract for the Daala scheme (which covers this) is on the Xiph demo page: http://jmvalin.ca/video/spie_pvq_abstract.pdf Page 2 has the assignment of coeffs to subbands. They're only using a handful, and notably they treat 4x4 blocks as a single subband. -Fabian

cbloom: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Send the sum in each subband ; this is the "gain" ; let's say g_s g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) Quantize the vector to a PVQ lattice point; send the lattice index So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). The slides (reference #4) clear things up a bit.

ryg: On 12/2/2014 8:21 PM, Charles Bloom {RAD} wrote: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. Well, the one I linked to has a couple variants already. But it's pretty much besides the point. You can of course turn this into a giant combinatorical circle-jerk, but I don't see the use. For example (that's one of the things in the paper I linked to) if you're actually assigning indexes to values then yeah, the difference between assigning codewords in order { 0, -1, 1, -2, 2, ... } and { -k, -k+1, ..., -1, 0, 1, 2, ..., k } matters, but once you decompose it into several syntax elements most of that incidental complexity just disappears completely. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Yeah, agreed. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Yup. Send the sum in each subband ; this is the "gain" ; let's say g_s No, the gain isn't sum (1-norm), it's the Euclidean (2-norm) length. If you used 1-norm you wouldn't deform the integer lattice, meaning you're still just a scalar quantizer, just one with a funky backend. E.g. in 2D, just sending k = q(|x| + |y|) (with q being a uniform scalar quantizer without dead zone for simplicity) and then coding where the pulses go is just using the same rectangular lattice as you would have if you were sending q(x), q(y) directly. (Once you add a dead zone that's not true any more; scalar favors a "+" shape around the origin whereas the 1-norm PVQ doesn't. But let's ignore that for now.) With an ideal vector quantizer you make the "buckets" (=Voronoi regions) approximately equally likely. For general arbitrary 2D points that means the usual hex lattice. The PVQ equivalent of that is the NPVQ pattern: https://people.xiph.org/~jm/daala/pvq_demo/quantizer4.png That's clearly suboptimal (not a real hex lattice at all), but it has the nice gain/shape-separation: the circles are all equal-gain. You unwrap each circle by normalizing the point in the 1-norm, and then sending the corresponding AC pulses. g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) q_s would come from the rate control, as usual. g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. You look at circles with a radius in the right neighborhood (most obviously, just floor(g) and ceil(g), though you might want to widen the search if you're doing RDO). You find the closest lattice points on both circles (this is convex, so no risk of getting stuck in a local min). Choose whichever of the two circles is better. (All points *on* the same circle have the same cost, at least with vanilla PVQ. So the only RD trade-off you do is picking which circle.) K_s is the index of the circle you're on. The origin is K_s=0, the first real circle is K_s=1 (and has 2N points where N is your dimensionality), and so forth. Quantize the vector to a PVQ lattice point; send the lattice index Finding that is the convex search. So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. Well, that's the combinatorical part. The actual vector quantizer is the idea of warping the 1-norm diamonds into sensibly-spaced 2-norm circles. The regular structure enables the simplified search. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Not sure if they're doing any additional masking beyond that. If they do, they're not talking about it. Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). As far as I can tell, yeah. And yes, definitely gain of the overall block, not of the residual! Again, you have the separation into gain and shape here. The gains are coded separately, and hence out of the equation. What remains is unit vectors for both your target block and your prediction. That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). This tells you how good the prediction is. If it's 0.9, you've just removed 90% of the energy to code. You need to quantize this appropriately - you want to make sure the quantizer resolution here is reasonably matched to quantizer resolution of points on your sphere, or you're wasting bits. Now you turn whatever is left of g into K (as above). You can iterate this as necessary. If you do bi-prediction, you can do another Householder reflection to align the energy of pred2 that was orthogonal to pred1 (the rest is gone already!) with the 2nd AC. You code another correlation coefficient and then deal with the residuals. Fade-in / fade-out just kind of fall out when you do prediction like this. It's not a special thing. The "ACs" don't change, just the gains. Handling cross-fades with one predictor is still shitty, but if you're doing bipred they kinda fall out as well. It all sounds pretty cool. But I have no clue at all how well it works in practice or where it stands cost/benefit-wise. -Fabian

ryg: That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. Important note for this and all that follows: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. If you weren't splitting by subbands, I believe you could finesse your way around this: (normalized) DCT and Householder reflections are both unitary, so they preserve both the L2 norm and dot products. Which means you could calculate both the overall gain and the correlation coeffs for your prediction *before* you do the DCT (and hence in the decoder, add that stuff back in post-IDCT, without having to DCT your reference). But with the subband splitting, that no longer works, at least not directly. You could still do it with a custom filter bank that just passes through precisely the DCT coeffs we're interested in for each subband, but eh, somehow I have my doubts that this is gonna be much more efficient than just eating the DCT. It would certainly add yet another complicated mess to the pile. -Fabian

cbloom: At 10:23 PM 12/2/2014, Fabian Giesen wrote: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. They transform the predicted image the same way as the current frame. Each subband gain is sent as a delta from the predicted image subband gain. Crazy! His words : > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do.

cbloom: At 09:31 PM 12/2/2014, you wrote: q_s would come from the rate control, as usual. Yeah, I just mean the details of that is actually one of the most important issues. eg. how does Q vary for the different subbands. Is there inter-subband masking, etc. In Daala the Q is non-linear (variance adaptive quantizer) g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! Err, yeah maybe. What they actually did was take the x264 VAQ and try to reproduce it. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah.

ryg: Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. That doesn't have anything to do with the lapping, I think - that's because they don't use regular block-based mocomp. At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation). There's no nice way to do this block-per-block in the first place, not with OBMC in the mix anyway; if you chop it up into tiles you end up doing a lot of work twice.

ryg: On 12/03/2014 10:26 AM, Charles Bloom {RAD} g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. Ah, my bad. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. Ah OK, that makes sense. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. Didn't check how they send it. I do find thinking of this in terms of cross-correlation between block and pred a lot simpler than phrasing it in terms of angles. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. The reflection itself doesn't do anything meaningful. Your normalized points were on a unit sphere before, and still are after. You're just spinning it around. It does mean that your coefficient numbering really loses all meaning. After one such reflection, you're already scrambled. Overall energy is still the same (because it's unitary) but the direction is completely different. Since PVQ already assumes that the directions are equiprobable (well, more or less, since the PVQ doesn't actually uniformly cover the sphere), they don't care. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah. Well, it's just a different goal for the predictors. Regular motion search tries to minimize SAD or similar, as do the H.264 spatial predictors. For this kind of scheme you don't care about differences at all, instead you want to maximize the normalized correlation coeff between image and reference. (You want texture matches, not pixel matches.) -Fabian

cbloom: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that.

ryg: On 12/3/2014 1:04 PM, Charles Bloom {RAD} wrote: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. Yeah, the decision to send a subband *at all* means you have to code gain, theta and your AC index. For N=16 that's gonna be hard to get below 8 bits even for trivial signals. At which point you get a big jump in the RD curve, which is bad. Terriberry has a few slides that explain how they're doing inter-band activity masking currently: https://people.xiph.org/~tterribe/daala/pvq201404.pdf The example image is kind of terrible though. The "rose" dress (you'll see what I mean) is definitely better in the AM variant, but the rest is hard to tell for me unless I zoom in, which is cheating. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that. This is only really interesting at low K, where the PVQ codebook is relatively small. So, er, let's just throw this one in: suppose you're actually sending codebook indices. You just have a rate allocation function that tells you how many bits to send, independent of how big the codebook actually is. If you truly believe that preserving narrowband energy is more important than getting the direction right, then getting a random vector with the right energy envelope is better than nothing. Say K=1, Ndim=16. You have N=32 codewords, so a codebook index stored directly is 5 bits. Rate function says "you get 0 bits". So you don't send an index at all, and the decoder just takes codeword 0. Or rate function says "you get 2 bits" so you send two bits of the codebook index, and take the rest as zero. This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). For any given block (subband, whatever), this doesn't help you at all. You either guess right or you guess wrong. But statistically, suppose you shaved 2 bits off the codebook IDs for 1000 blocks. Then you'd expect about 250 of these blocks to reconstruct the right ACs. For the rest, you reconstructed garbage ACs, but it's garbage with the right energy levels at least! :) No clue if this is actually a good idea at all. It definitely allows you to remove a lot of potholes from the RD curve. -Fabian

ryg: On 12/3/2014 1:48 PM, Fabian Giesen wrote: > [..] This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). Now this is all assuming you either get the right code or you get garbage, and living with whichever one it is. You can also go in the other direction and try to get the direction at least mostly right. You can try to determine an ordering of the code book so that distortion more or less smoothly goes down as you add extra bits. (First bit tells you which hemisphere, that kind of thing.) That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". (Really with K=1 and high dim it's always gonna be garbage, though, because you just don't have any other vector in the code book that's even close; this is more interesting at K=2 or up). This makes the per-block randomization (you want that to avoid systematic bias) harder, though. One approach that would work is to do a Householder reflection with a random vector (again hashed from position or similar). All that said, I don't believe in this at all. It's "solving" a problem by "reducing" it to a more difficult unsolved problem (in this case, "I want a VQ codebook that's close to optimal for embedded coding"). Of course, even if you do a bad job here, it's still not gonna be worse than the direct "random permutation" stuff. But I doubt it's gonna be appreciably better either, and it's definitely more complex. -Fabian

cbloom: At 02:17 PM 12/3/2014, Fabian Giesen wrote: That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". Yes, this is the type of scheme I imagine. Sort of like a wavelet significant bit thing. As you send fewer bits the location gets coarser. The codebook for K=1 is pretty obvious. You're just sending a location; you want the top bits to grossly classify the AC and the bottom bits to distinguish neighbors (H neighbors for H-type AC's, and V neighbors for V-type AC's) For K=2 and up it's more complex. You could just train them and store them (major over-train risk) up to maybe K=3 but then you have to switch to an algorithmic method. Really the only missing piece for me is how you get the # of bits used to specify the locations. It takes too many bits to actually send it, so it has to be implicit from some other factors like the block Q and K and I'm not sure how to get that.

old rants