1/31/2013

1-31-13 - Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -


if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
{
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
    {
        state = state*2;
    }
    else
    {
        state = state*4 + (val-1)*2 + 1;
    }
}

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :


void encode(int & state, int val)
{
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );
}

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
{
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;
}

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :


normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every simple. But this same kind of system can be used with more general code sets, as we'll see later.

1/28/2013

01-28-13 - Importing Eudora MBX's to Gmail

I'd like to import all my old Eudora mail to gmail, to get it all together in one place, and for searchability.

(my current mail solution is to use Eudora POP on my local machine, but forward all my mail through gmail for spam filtering and archiving and searchability; it's working pretty well finally).

Gmail does not offer any "import from local disk" options. Sigh. There appear to be a few ways to do this :

1. Change my gmail temporarily to IMAP. Get all my Eudora MBX's into an IMAP client (something like Outlook; perhaps requiring an MBX to PST conversion step or something). Open the IMAP client and connect to gmail; drag the mail from the Eudora boxes to the gmail boxes.

Should work in theory, but a bit scary, and extremely slow (moving mail on IMAP is ungodly slow).

(Also, when I switch back to POP, is it going to redeliver me all that mail that I just uploaded? That would double-suck.)

2. Make a POP server somewhere. Convert the mbx's to mbox's to maildirs and dump them on the POP server for it to serve up. Tell gmail to grab mail from that POP server.

One issue is where I could get a POP server with a public IP and admin access. The other is that any time I try to do networking stuff it's a massive fail of mysterious problems and no error messages.

3. Get a Google Apps gmail account (different from regular gmail account for unknowable reasons). Import MBX's to Outlook. Use "Google Apps Migration for Microsoft Outlook" to import mail to Apps mail account. Use gmail fetcher to bring mail from apps-gmail to my normal gmail.

(similar alternative : get apps gmail. Convert mbx to mbox. Find a Mac. Use "Google Email Uploader for Mac" to upload the mbox. Transfer mail from apps-mail to normal mail).

(I could also use gmail API to write my own importer, but that also requires an Apps gmail, so may as well just use the existing importers in method 3)

It's all such a hassle that I'm once again tempted to just write my own damn email client. Sigh I wish I'd done that long ago, but it's always the local optimization to not do it. I'm so fucking sick of getting penis emails. Hello spam filterers, *penis* -> spam. You're welcome. And of course if I used my own email client, my private property (words) wouldn't be data-mined to serve me ads (you bastards).

(oddly gmail does remarkably well at spam detection on the cases that would be hard for me to do with simple filters; things like bank phishing mails that are designed to look exactly like legitimate mails from my bank; I don't think I could give that up, so I'd still be stuck with routing mail through gmail even if I had my own client).

old rants