6/08/2011

06-08-11 - Tech Todos

Publicly getting my thoughts together :

1. Oodle. Just finish it! God damn it.

2. JPEG decoder. I got really close to having this done, need to finish it. The main thing left that I want to do is work on the edge-adaptive-bilteral filter a bit more; currently it's a bit too strong on the non-artifact areas, I think I can make it more selective about only working on the ringing and blockiness. The other things I want are chroma-from-luma support and a special mode for text/graphics.

3. Byte-wise LZ with optimal parse. This has been on my list for a long time. I'm not really super motivated though. But it goes along with -

4. LZ string matcher test. Hash tables, hash->list, hash->bintree, hash->MMC, suffix trees, suffix arrays, patricia tries, etc. ? Would be nice to make a thorough test bed for this. (would also help the Oodle LZ encoder which is currently a bit slow due to me not spending any time on the string matcher).

5. Cuckoo hash / cache aware hash ; I did a bunch of hash testing a while ago and want to add this to my tests. I'm very curious about it, but this is kind of pointless.

6. Image doubler / denoiser / etc ; mmm meh I've lost my motivation for this. It's a big project and I have too many other things to do.

7. Make an indy game. Sometimes I get the craving to do something interactive/artistic. I miss being able to play with my work. (I also get the craving to make a "demo" which would be fun and is rather less pain in the butt than making a full game). Anyhoo, probably not gonna happen, since there's just not enough time for this.

ADDENDUM : some I forgot :

8. Finish my video codec ; I still want to redo my back end coder which was really never intented for video; maybe support multiple sizes of blocks; try some more perceptual metrics for encoder decision making; various other ideas.

9. New lossy image codec ; I have an unfinished one that I did for RAD, but I think I should just scrap it and do the new one. I'm interested in directional DCT. Also simple highly asymetric schemes, such as static classes that are encoder-optimized (instead of adaptive models; adaptive models are very bad for LHS). More generally, I have some ideas about trying to make a codec that is more explicitly perceptual, it might be terrible in rmse, but look better to the human eye; one part of that is using the imdiff metrics I trained earlier, another part is block classification (smooth,edge,detail) and special coders per class.

6/07/2011

06-07-11 - How to read an LZ compressed file

An example of the kind of shite I'm doing in Oodle these days.

You have an LZ-compressed file on disk that you want to get decompressed into memory as fast as possible. How do you do this ?

Well, first of all, you make your compressor write in independent chunks so that the decompressor can run on multiple chunks at the same time with threads. But to start you need to know where the chunks are in the file, so the first step is :


1. Fire an async read of the first 64k of the file to get the header.

the header will tell you where all the independent chunks are. (aside : in final Oodle there may also be an option to aglomerate all the headers of all the files, so you may already have this first 64k in memory).

So after that async read is finished, you want to fire a bunch of decomps on the chunks, so the way to do this is :


2. Make a "Worklet" (async function callback) which parses the header ; set the Worklet to run when the IO op #1
finishes.

I used to do this by having the WorkMgr get a signal from IO thread (which still happens) but I now also have a mechanism to just run Worklets directly on the IO thread, which is preferrable for Worklets that are super trivial like this one.

Now, if the file is small you could just have your Worklet #2 read the rest of the file and then fire async works on each one, but if the file is large that means you are waiting a long time for the IO before you start any decomp work, so that's not ideal, instead what we do is :


3. In Worklet #2, after parsing header, fire an async IO for each independent compressed chunk.  For each chunk, create
a decompression Worklet which is dependent on the IO of that chunk (and also neighbors, since due to IO
sector alignment the compression boundaries and IO boundaries are not quite the same).

So what this will do is start a bunch of IO's that then retire one by one, as each one retires it starts up the decomp task for that chunk. This means you start decompressing almost immediately and for large files you keep the CPU and IO busy the whole time.

Finally the main thread needs a way to wait for this all to be done. But the handles to the actual decompression async tasks don't exist until async task #2 runs, so the main thread can't wait on them directly. Instead :


4. At the time of initial firing (#1), create an abstract waitable handle and set it to "pending" state; then
pass this handle through your async chain.  Task #2 should set it to needing "N to go", since it's the first
point that knows the count, and then the actual async decompresses in #3 should decrement that counter.  So
the main thread can wait on it being "0 to go".

You can think of this as a sempahore, though in practice I don't use a semaphore because there are some OS's where that's not possible (sadly).

What the client sees is just :


AsyncHandle h = OodleLZ_StartDecompress( fileName );

Async_IsPending(h); ?

Async_Block(h);

void * OodleLZ_GetFinishedDecompress( h );

if they just want to wait on the whole thing being done. But if you're going to parse the decompressed file, it's more efficient to only wait on the first chunk being decompressed, then parse that chunk, then wait on the next chunk, etc. So you need an alternate API that hands back a bunch of handles, and then a streaming File API that does the waiting for you.

6/04/2011

06-04-11 - Keep Case

I've been meaning to do this for a long time and finally got off my ass.

TR (text replace) and zren (rename) in ChukSH now support "keep case".

Keep case is pretty much what you always want when you do text replacement (especially in source code), and everybody should copy me. For example when I do a find-replace from "lzp1f" -> "lzp1g" what I want is :


lzp1f -> lzp1g  (lower->lower)
LZP1F -> LZP1G  (upper->upper)
Lzp1f -> Lzp1g  (first cap->first cap)
Lzp1G -> Lzp1G  (mixed -> mixed)

The kernel that does this is matchpat in cblib which will handle rename masks like : "poop*face" -> "shit*butt" with keep case option or not.

In a mixed-wild-literal renaming spec like that, the "keep case" applies only to the literal parts. That is, "poop -> shit" and "face -> butt" will be applied with keep-case independently , the "*" part will just get copied.

eg :


Poop3your3FACE -> Shit3your3BUTT

Also, because keep-case is applied to an entire chunk of literals, it can behave somewhat unexpectedly on file renames. For example if you rename

src\lzp* -> src\zzh*

the keep-case will apply to the whole chunk "src\lzp" , so if you have a file like "src\LZP" that will be considered "mixed case" not "all upper". Sometimes my intuition expects the rename to work on the file part, not the full path. (todo : add an option to separate the case-keeping units by path delims)

The way I handle "mixed case" is I leave it up to the user to provide the mixed case version they want. It's pretty impossible to get it right automatically. So the replacement text should be provided in the ideal mixed case capitalization. eg. to change "HelpSystem" to "QueryManager" you need to give me "QueryManager" as the target string, capitalized that way. All mixed case source occurances of "HelpSystem" will be changed to the same output, eg.


helpsystem -> querymanager
HELPSYSTEM -> QUERYMANAGER
Helpsystem -> Querymanager
HelpSystem -> QueryManager
HelpsYstem -> QueryManager
heLpsYsTem -> QueryManager
HeLPSYsteM -> QueryManager

you get it.

The code is trivial of course, but here it is for your copy-pasting pleasure. I want this in my dev-studio find/replace-in-files please !


// strcpy "putString" to "into"
//  but change its case to match the case in src
// putString should be mixed case , the way you want it to be if src is mixed case
void strcpyKeepCase(
        char * into,
        const char * putString,
        const char * src,
        int srcLen);

void strcpyKeepCase(
        char * into,
        const char * putString,
        const char * src,
        int srcLen)
{   
    // okay, I have a match
    // what type of case is "src"
    //  all lower
    //  all upper
    //  first upper
    //  mixed
    
    int numLower = 0;
    int numUpper = 0;
    
    for(int i=0;i<srcLen;i++)
    {
        ASSERT( src[i] != 0 );
        if ( isalpha(src[i]) )
        {
            if ( isupper(src[i]) ) numUpper++;
            else numLower++;
        }
    }
    
    // non-alpha :
    if ( numLower+numUpper == 0 )
    {
        strcpy(into,putString);
    }
    else if ( numLower == 0 )
    {
        // all upper :
        while( *putString )
        {
            *into++ = toupper( *putString ); putString++;
        }
        *into = 0;
    }
    else if ( numUpper == 0 )
    {
        // all lower :
        while( *putString )
        {
            *into++ = tolower( *putString ); putString++;
        }
        *into = 0;
    }
    else if ( numUpper == 1 && isalpha(src[0]) && isupper(src[0]) )
    {
        // first upper then low
        
        if( *putString ) //&& isalpha(*putString) )
        {
            *into++ = toupper( *putString ); putString++;
        }
        while( *putString )
        {
            *into++ = tolower( *putString ); putString++;
        }
        *into = 0;
    }
    else
    {
    
        // just copy putString - it should be mixed 
        strcpy(into,putString);
    }
}


ADDENDUM : on a roll with knocking off stuff I've been meaning to do for a while ...

ChukSH now also contains "fixhtmlpre.exe" which fixes any less-than signs that are found within a PRE chunk.

Hmm .. something lingering annoying going on here. Does blogger convert and-l-t into less thans?

ADDENDUM : yes it does. Oh my god the web is so fucked. I've been doing a bit of reading and it appears this is a common and atrocious hack. Basically the problem is that people use XML for the markup of the data transfer packets. Then they want to sent XML within those packets. So you have to form some shit like :


<data> I want to send <B> this </B> god help me </data>

but putting the less-thans inside the data packet is illegal XML (it's supposed to be plain text), so instead they send

<data> I want to send &-l-tB> this &-l-t/B> god help me </data>

but they want the receiver to see a less-than, not the characters &-l-t , so the receiver parses those codes back into less-than and then treats the data received as its own hunk of XML with internal markups.

Basically people use it as a way to send codes that the current parser will ignore, but the next parser will see. There are lots of pages about how this is against compliance standards but nobody cares and it seems to be widespread.

So anyway, the conclusion is : just changing less thans to &-l-t works fine if you are just posting html (eg. for rants.html it works fine) but for sending to Blogger (or probably any other modern XML-based app) it doesn't.

The method I use now which seems to work on Blogger is I convert less thans to


<code><</code>

How is there not a fucking "literal" tag ? (There is one called XMP but it's deprecated and causes line breaks, and it's really not just a literal tag around a bunch of characters, it's a browser format mode change)

6/03/2011

06-03-11 - Amalgamate

So, here you go : amalgamate code & exe (105k)

The help is this :


amalgamate built May 19 2011, 18:01:28
args: amalgamate
HELP :
usage : amalgamate [-options] <to> <from1> [from2...]
options:
-q  : quiet
-v  : verbose
-c  : add source file's local dir to search paths
-p  : use pragma once [else treat all as once]
-r  : recurse from dirs [list but don't recurse]
-xS : extension filter for includes S=c;cpp;h;inl or whatever
-eS : extension filter for enum of from dir
-iS : add S to include path to amalgamate

from names can be files or dirs
use -i only for include dirs you want amalgamated (not system dirs)

What it does : files that are specified in the list of froms (and match the extension filter for enum of from dir), or are found via #include (and match the extension filter for includes), are concatted in order to the output file. #includes are only taken if they are in one of the -I listed search dirs.

-p (use pragma once) is important for me - some of my #includes I need to occur multiple times, and some not. Amalgamate tells the difference by looking for "pragma once" in the file. eg. stuff like :

#define XX stuff
#include "use_XX.inc"
#define XX stuff2
#include "use_XX.inc"
needs to include the .inc both times. But most headers should only be included once (and those have #pragma once in them).

So for example I made a cblib.h thusly :


amalgamate cblib.h c:\src\cblib c:\src\cblib\LF c:\src\cblib\external -Ic:\src -p -xh;inc;inl -eh

which seems to work. As another test I made an amalgamated version of the test app for rmse_hvs_t that I gave to Ratcliff. This was made with :

amalgamate amalgamate_rmse_hvs_t.cpp main_rmse_hvs_t.cpp rmse_hvs_t.cpp -I. -v -Ic:\src -p

and the output is here : amalgamate_rmse_hvs_t.zip (83k)


But for anything large (like cblib.cpp) this way of sticking files together just doesn't work. It should be obvious why now that we're thinking about it - C definitions last until end of file (or "translation unit" if you like), and many files have definitions or symbols of the same name that are not the same thing - sometimes just by accidental collision, but often quite intentionally!

The accidental ones are things like using "#define XX" in lots of files ; you can fix those by always using your file name in front of definitions that you want to only be in your file scope (or by being careful to #undef) - also local namespacing variables and etc. etc. So you can deal with that.

But non-coincidental collisions are quite common as well. For example I have things like :

replace_malloc.h :
  #define malloc my_malloc

replace_malloc.c :
  void * my_malloc ( return malloc(); }

It's very important that replace_malloc.c doesn't include replace_malloc.h , but when you amalgamate it might (depending on order).

Another nasty one is the common case where you are supposed to do some #define before including something. eg. something like :

#define CB_HUFFMAN_UNROLL_COUNT 16
#include "Huffman.h"
that kind of thing is destroyed by amalgamate (only the first include will have effect, and later people who wanted different numbers don't get what they expected). Even windows.h with the WINNT_VER and LEAN_AND_MEAN gets hosed by this.

You can also get very nasty bugs just by tacking C files together. For example in plain C you could have :

file1 : 
static int x;
file 2 :
int x = 7;
and in C that is not an error, but now two separate variables have become one when amalgamated. I'm sure there are tons of other evil hidden ways this can fuck you.

So I think it's basically a no-go for anything but tiny code bases, or if you very carefully write your code for amalgamation from the beginning (and always test the amalgamated build, since it can pick up hidden bugs).

5/31/2011

05-31-11 - STB style code

I wrote a couple of LZP1 implementations (see previous) in "STB style" , that is, plain C, ANSI, single headers you can just include and use. It's sort of wonderfully simple and easy to use. Certainly I understand the benefit - if I'm grabbing somebody else's code to put in my project, I want it to be STB style, I don't want some huge damn library.

(for example I actually use the James Howse "lsqr.c" which is one file, I also use "divsufsort.c" which is a delightful single file, those are beautiful little pieces of code that do something difficult very well, but I would never use some beast like the GNU Triangulated Surface lib, or OpenCV or any of those big bloated libs)

But I just struggle to write code that way. Like even with something as simple as the LZP's , okay fine you write an ANSI version and it works. But it's not fast and it's not very friendly.

I want to add prefetching. Well, I have a module "mem.h" that does platform-independent prefetching, so I want to include that. I also want fast memsets and memcpys that I already wrote, so do I just copy all that code in? Yuck.

Then I want to support streaming in and out. Well I already have "CircularBuffer.h" that does that for me. Sure I could just rewrite that code again from scratch, but this is going backwards in programming style and efficiency, I'm duplicating and rewriting code and that makes unsafe buggy code.

And of course I want my assert. And if I'm going to actually make an EXE that's fast I want my async IO.

I just don't see how you can write good code this way. I can't do it; it totally goes against my style, and I find it very difficult and painful. I wish I could, it would make the code that I give away much more useful to the world.

At RAD we're trying to write code in a sort of heirarchy of levels. Something like :


very low level : includes absolutely nothing (not even stdlib)
low level : includes only low level (or lower) (can use stdlib)
              low level stuff should run on all platforms
medium level : includes only medium level (or lower)
               may run only on newer platforms
high level : do whatever you want (may be PC only)

This makes a lot of sense and serves us well, but I just have so much trouble with it.

Like, where do I put my assert? I like my assert to do some nice things for me, like log to file, check if a debugger is present and int 3 only if it is (otherwise do an interactive dialog). So that's got to be at least "medium level" - so now I'm writing some low level code and I can't use my assert!

Today I'm trying to make a low level logging faccility that I can call from threads and it will stick the string into a lock-free queue to be flushed later. Well, I've already got a bunch of nice lockfree queues and stuff ready to go, that are safe and assert and have unit tests - but those live in my medium level lib, so I can't use them in the low level code that I want to log.

What happens to me is I wind up promoting all my code to the lowest level so that it can be accessible to the place that I want it.

I've always sort of struggled with separated libs in general. I know it's a nice idea in theory to build your game out of a few independent (or heirarchical) libs, but in practice I've always found that it creates more friction than it helps. I find it much easier to just throw all my code in a big bag and let each bit of code call any other bit of code.

5/20/2011

05-20-11 - LZP1 Variants

LZP = String match compression using some predictive context to reduce the set of strings to match

LZP1 = variant of LZP without any entropy coding

I've just done a bunch of LZP1 variants and I want to quickly describe them for my reference. In general LZP works thusly :


Make some context from previous bytes
Use context to look in a table to see a set of previously seen pointers in that context
  (often only one, but maybe more)

Encode a flag for whether any match, which one, and the length
If no match, send a literal

Typically the context is made by hashing some previous bytes, usually with some kind of shift-xor hash. As always, larger hashes generally mean more compression at the cost of more memory. I usually use a 15 bit hash, which means 64k memory use if the table stores 16 bit offsets rather than pointers.

Because there's no entropy coding in LZP1, literals are always sent in 8 bits.

Generally in LZP the hash table of strings is only updated at literal/match decision points - not for all bytes inside the match. This helps speed and doesn't hurt compression much at all.

Most LZP variants benefit slightly from "lazy parsing" (that is, when you find a match in the encoder, see if it's better to instead send a literal and take the match at the next byte) , but this hurts encoder speed.

LZP1a : Match/Literal flag is 1 bit (eight of them are sent in a byte). Single match option only. 4 bit match length, if match length is >= 16 then send full bytes for additional match length. This is the variant of LZP1 that I did for Clariion/Data General for the Pentium Pro.

LZP1b : Match/Literal is encoded as 0 = LL, 10 = LM, 11 = M (this is the ideal encoding if literals are twice as likely as matches) ; match length is encoded as 2 bits, then if it's >= 4 , 3 more bits, then 5 more bits, then 8 bits (and after that 8 more bits as needed). This variant of LZP1 was the one published back in 1995.

LZP1c : Hash table index is made from 10 bits of backwards hash and 5 bits of forward hash (on the byte to be compressed). Match/Literal is a single bit. If a match is made, a full byte is sent, containing the 5 bits of forward hash and 3 bits of length (4 bits of forward hash and 4 bits of length is another option, but is generally slightly worse). As usual if match length exceeds 3 bits, another 8 bits is sent. (this is a bit like LZRW3, except that we use some backward context to reduce the size of the forward hash that needs to be sent).

LZP1d : string table contains 2 pointers per hash (basically a hash with two "ways"). Encoder selects the best match of the two and send a 4 bit match nibble consisting of 1 selection bit and 3 bits of length. Match flag is one bit. Hash way is the bottom bit of the position, except that when a match is made the matched-from pointer is not replaced. More hash "ways" provide more compression at the cost of more memory use and more encoder time (most LZP's are symmetric, encoder and decoder time is the same, but this one has a slower encoder) (nowadays this is called ROLZ).

LZP1e : literal/match is sent as run len; 4 bit nibble is divided as 0-4 = literal run length, 5-15 = match length. (literal run length can be zero, but match length is always >= 1, so if match length >= 11 additional bytes are sent). This variant benefits a lot from "Literal after match" - after a match a literal is always written without flagging it.

LZP1f is the same as LZP1c.

LZP1g : like LZP1a except maximum match length is 1, so you only flag literal/match, you don't send a length. This is "Predictor" or "Finnish" from the ancient days. Hash table stores chars instead of pointers or offsets.

Obviously there are a lot of ways that these could all be modifed to get more compression (*), but it's rather pointless to go down that path because then you should just use entropy coding.

(* a few ways : combine the forward hash of lzp1c with the "ways" of lzp1d ; if the first hash fails to match escape down to a lower order hash (such as maybe just order-1 plus 2 bits of position) before outputting a literal ; output literals in 7 bits instead of 8 by using something like an MTF code ; write match lengths and flags with a tuned variable-bit code like lzp1b's ; etc. )


Side note : while writing this I stumbled on LZ4 . LZ4 is almost exactly "LZRW1". It uses a hash table (hashing the bytes to match, not the previous bytes like LZP does) to find matches, then sends the offset (it's a normal LZ77, not an LZP). It encodes as 4 bit literal run lens and 4 bit match lengths.

There is some weird/complex stuff in the LZ4 literal run len code which is designed to prevent it from getting super slow on random data - basically if it is sending tons of literals (more than 128) it starts stepping by multiple bytes in the encoder rather than stepping one byte at a time. If you never/rarely compress random data then it's probably better to remove all that because it does add a lot of complexity.

REVISED : Yann has clarified LZ4 is BSD so you can use it. Also, the code is PC only because he makes heavy use of unaligned dword access. It's a nice little simple coder, and the speed/compression tradeoff is good. It only works well on reasonably large data chunks though (at least 64k). If you don't care so much about encode time then something that spends more time on finding good matches would be a better choice. (like LZ4-HC, but it seems the LZ4-HC code is not in the free distribution).

He has a clever way of handling the decoder string copy issue where you can have overlap when the offset is less than the length :


    U32     dec[4]={0, 3, 2, 3};

    // copy repeated sequence
    cpy = op + length;
    if (op-ref < 4)
    {
        *op++ = *ref++;
        *op++ = *ref++;
        *op++ = *ref++;
        *op++ = *ref++;
        ref -= dec[op-ref];
    }
    while(op < cpy) { *(U32*)op=*(U32*)ref; op+=4; ref+=4; }
    op=cpy;     // correction

This is something I realized as well when doing my LZH decoder optimization for SPU : basically a string copy with length > offset is really a repeating pattern, repeating with period "offset". So offset=1 is AAAA, offset=2 is ABAB, offset=3 is ABCABC. What that means is once you have copied the pattern a few times the slow way (one byte at a time), then you can step back your source pointer by any multiple of the offset that you want. Your goal is to step it back enough so that the separation between dest and source is bigger than your copy quantum size. Though I should note that there are several faster ways to handle this issue (the key points are these : 1. you're already eating a branch to identify the overlap case, you may as well have custom code for it, and 2. the single repeating char situation (AAAA) is by far more likely than any other).

ADDENDUM : I just found the LZ4 guy's blog (Yann Collet, who also did the fast "LZP2"), there's some good stuff on there. One I like is his compressor ranking . He does the right thing ( I wrote about here ) which is to measure the total time to encode,transmit,decode, over a limitted channel. Then you look at various channel speeds and you can see in what domain a compressor might be best. But he does it with nice graphs which is totally the win.

5/13/2011

05-13-11 - Avoiding Thread Switches

A very common threading model is to have a thread for each type of task. eg. maybe you have a Physics Thread, Ray Cast thread, AI decision thread, Render Thread, an IO thread, Prefetcher thread, etc. Each one services requests to do a specific type of task. This is good for instruction cache (if the threads get big batches of things to work on).

While this is conceptually simple (and can be easier to code if you use TLS, but that is an illusion, it's not actually simpler than fully reentrant code in the long term), if the tasks have dependencies on each other, it can create very complex flow with lots of thread switches. eg. thread A does something, thread B waits on that task, when it finishes thread B wakes up and does something, then thread A and C can go, etc. Lots of switching.

"Worklets" or mini work items which have dependencies and a work function pointer can make this a lot better. Basically rather than thread-switching away to do the work that depended on you, you do it immediately on your thread.

I started thinking about this situation :

A very simple IO task goes something like this :


Prefetcher thread :

  issue open file A

IO thread :

  execute open file A

Prefetcher thread :

  get size of file A
  malloc buffer of size
  issue read file A into buffer
  issue close file A

IO thread :

  do read on file A
  do close file A

Prefetcher thread :

  register file A to prefetched list

lots of thread switching back and forth as they finish tasks that the next one is waiting on.

The obvious/hacky solution is to create larger IO thread work items, eg. instead of just having "open" and "read" you could make a single operation that does "open, malloc, read, close" to avoid so much thread switching.

But that's really just a band-aid for a general problem. And if you keep doing that you wind up turning all your different systems into "IO thread work items". (eg. you wind up creating a single work item that's "open, read, decompress, parse animation tree, instatiate character"). Yay you've reduced the thread switching by ruining task granularity.

The real solution is to be able to run any type of item on the thread and to immediately execute them. Instead of putting your thread to sleep and waking up another one that can now do work, you just grab his work and do it. So you might have something like :


Prefetcher thread :

  queue work items to prefetch file A
  work items depend on IO so I can't do anything and go to sleep

IO thread :

  execute open file A

  [check for pending prefetcher work items]
  do work item :

  get size of file A
  malloc buffer of size
  issue read file A into buffer
  issue close file A

  do IO thread work :

  do read on file A
  do close file A

  [check for pending prefetcher work items]
  do work item :

  register file A to prefetched list

so we stay on the IO thread and just pop off prefetcher work items that depended on us and were waiting for us to be able to run.

More generally if you want to be super optimal there are complicated issues to consider :

i-cache thrashing vs. d-cache thrashing :

If we imagine the simple conceptual model that we have a data packet (or packets) and we want to do various types of work on it, you could prefer to follow one data packet through its chain of work, doing different types of work (thrashing i-cache) but working on the same data item, or you could try to do lots of the same type of work (good for i-cache) on lots of different data items.

Certainly in some cases (SPU and GPU) it is much better to keep i-cache coherent, do lots of the same type of work. But this brings up another issue :

Throughput vs. latency :

You can generally optimize for throughput (getting lots of items through with a minimum average time), or latency (minimizing the time for any one item to get from "issued" to "done"). To minimize latency you would prefer the "data coherent" model - that is, for a given data item, do all the tasks on it. For maxmimum throughput you generally preffer "task coherent" - that is, do all the data items for each type of task, then move on to the next task. This can however create huge latency before a single item gets out.

ADDENDUM :

Let me say this in another way.

Say thread A is doing some task and when it finishes it will fire some Event (in Windows parlance). You want to do something when that Event fires.

One way to do this is to put your thread to sleep waiting on that Event. Then when the event fires, the kernel will check a list of threads waiting on that event and run them.

But sometimes what you would rather do is to enqueue a function pointer onto that Event. Then you'd like the Kernel to check for any functions to run when the Event is fired and run them immediately on the context of the firing thread.

I don't know of a way to do this in general on normal OS's.

Almost every OS, however, recognizes the value of this type of model, and provides it for the special case of IO, with some kind of IO completion callback mechanism. (for example, Windows has APC's, but you cannot control when an APC will be run, except for the special case of running on IO completion; QueueUserAPC will cause them to just be run as soon as possible).

However, I've always found that writing IO code using IO completion callbacks is a huge pain in the ass, and is very unpopular for that reason.

5/08/2011

05-08-11 - Torque vs Horsepower

This sometimes confuses me, and certainly confuses a lot of other people, so let's go through it a bit.

I'm also motivated by this page : Torque and Horsepower - A Primer in which Bruce says some things that are slightly imprecise in a scientific sense but are in fact correct. Then this A-hole Thomas Barber responds with a dickish pedantic correction which adds nothing to our understanding.

We're going to talk about car engines, the goal is to develop sort of an intuition of what the numbers mean. If you look on Wikipedia or whatever there will be some frequently copy-pasted story about James Watt and horses pulling things and it's all totally irrelevant. We're not using our car engine to power a generator or grind corn or whatever. We want acceleration.

The horizontal acceleration of the car is proportional to the angular acceleration of the wheels (by the circumference of the wheels). The angular acceleration of the wheels is proportional to the angular acceleration of the flywheel, modulo the gear ratio in the transmission. The angular acceleration of the flywheel is proportional to the torque of the engine, modulo moment of inertia.

For a fixed gear ratio :

torque (at the engine) ~= vehicle acceleration

(where ~= means proportional)

So if we all had no transmission, then all we would care about is torque and horsepower could go stuff itself.

But we do have transmissions, so how does that come into play?

To maximize vehicle acceleration you want to maximize torque at the wheels, which means you want to maximize

vehicle acceleration ~= torque (at the engine) * gear ratio

where gear ratio is higher in lower gears, that is, gear ratio is the number of times the engine turns for one turn of the wheels :

gear ratio = (engine rpm) / (wheel rpm)

which means we can write :

vehicle acceleration ~= torque (at the engine) * (engine rpm) / (wheel rpm)

thus at any given vehicle speed (eg. wheel rpm held constant), you maximize acceleration by maximizing [ torque (at the engine) * (engine rpm) ] . But this is just "horsepower" (or more generally we should just say "power"). That is :

horsepower ~= torque (at the engine) * (engine rpm)

vehicle acceleration ~= horsepower / (wheel rpm)

Note that we don't have to say that the power is measured at the engine, because due to conservation of energy the power production must be the same no matter how you measure it (unlike torque which is different at the crank and at the wheels). Power is of course the energy production per unit time, or if you like it's the rate that work can be done. Work is force over distance, so Power is just ~= Force * velocity. So if you like :

horsepower ~= torque (at the engine) * (engine rpm)

horsepower ~= torque (at the wheels) * (wheel rpm)

horsepower ~= vehicle acceleration * vehicle speed

(note this is only true assuming no dissipative forces; in the real world the power at the engine is greater than the power at the wheels, and that is greater than the power measured from motion)

Now, let's go back to this statement : "any given vehicle speed (eg. wheel rpm held constant), you maximize acceleration by maximizing horsepower". The only degree of freedom you have at constant speed is changing gear. So this just says you want to change gear to maximize horsepower. On most real world engines this means you should be in as low a gear as possible at all times. That is, when drag racing, shift at the red line.

The key thing that some people miss is you are trying to maximize *wheel torque* and in almost every real world engine, the effect of the gear ratio is much more important that the effect of the engine's torque curve. That is, staying in as low a gear as possible (high ratio) is much more important than being at the engine's peak torque.


Let's consider some examples to build our intuition.

The modern lineup of 911's essentially all have the same torque. The Carrera, the GT3, and even the RSR all have around 300 lb-ft of torque. But they have different red lines, 7200, 8400 and 9400.

If we pretend for the moment that the masses are the same, then if you were all cruising along side by side in 2nd gear together and floored it - they would accelerate exactly the same.

The GT3 and RSR would only have an advantage when the Carrera is going to hit red line and has to shift to 3rd, and they can stay in 2nd - then their acceleration will be better by the factor of gear ratios (something like 1.34 X on most 2nd-3rd gears).

Note the *huge* difference in acceleration due to gearing. Even if the upshift got you slightly more torque by putting you in the power band of the engine, the 1.34 X from gearing is way too big to beat.

(I should note that in the real world, not only are the RSR/R/Cup (racing) versions of the GT3 lighter, but they also have a higher final drive ratio and some different gearing, so they are actually faster in all gears. A good mod to the GT3 is to get the Cup gears)


Another example :

Engine A has 200 torques (constant over the rpm range) and revs to 4000 rpm. Engine B has 100 torques and revs to 8000 rpm. They have the exact same peak horsepower (800 torques*krpm) at the top of their rev range. How do they compare ?

Well first of all, we could just gear down Engine B by 2X so that for every two turns it made the output shaft only made one turn. Then the two engines would be exactly identical. So in that sense we should see that horsepower is really the rating of the potential of the engine, whereas torque tells you how well the engine is optimized for the gearing. The higher torque car is essentially steeper geared at the engine.

How do they compare on the same transmission? In 1st gear Car A would pull away with twice the acceleration of Car B. It would continue up to 4000 rpm then have to change gears. Car B would keep running in 1st gear up to 8000 rpm, during which time it would have more acceleration than car A (by the ratio of 1st to 2nd gear).

So which is actually faster to 100 mph ?

You can't answer that without knowing about the transmission. If gear changes took zero time (and there was no problem with traction loss under high acceleration), the faster car would be the higher torque car. In fact if gear changes took zero time you would want an infinite number of gears so that you could keep the car at max rpm at the time, not because you are trying to stay in the "power band" but simply because max rpm means you can use higher gearing to the wheels.

I wrote a little simulator. Using the real transmission ratios from a Porsche 993 :


Transmission Gear Ratios: 3.154, 2.150, 1.560, 1.242, 1.024, 0.820 
Rear Differential Gear Ratio: 3.444 
Rear Tire Size: 255/40/17  (78.64 inch cirumference)
Weight : 3000 pounds

and 1/3 of a second to shift, I get :

200 torque, 4000 rpm redline :

time_to_100 = 15.937804

100 torque, 8000 rpm redline :

time_to_100 = 17.853252

higher torque is faster. But what if we can tweak our transmission for our engine? In particular I will make only the final drive ratio free and optimize that with the gear ratios left the same :

200 torque, 4000 rpm redline :

c_differential_ratio = 3.631966
time_to_100 = 15.734542

100 torque, 8000 rpm redline :

c_differential_ratio = 7.263932
time_to_100 = 15.734542

exact same times, as they should be, since the power output is the same, with double the gear ratio.

In the real world, almost every OEM transmission is geared too low for an enthusiast driver. OEMs offer transmission that minimize the number of shifts, offer over-drive gears for quiet and economy, etc. If you have a choice you almost always want to gear up. This is one reason why in the real world torque is king ; low-torque high-power engines could be good if you had sufficiently high gearing, but that high gearing just doesn't exist (*), so the alternative is to boost your torque.

(* = drag racers build custom gear boxes to optimize their gearing ; there are also various practical reasons why the gear ratios in cars are limitted to the typical range they are in ; you can't have too many teeth, because you want the gears to be reasonably small in size but also have a minimum thickness of teeth for strength, high gear ratios tend to produce a lot of whine that people don't like, etc. etc.)

One practical issue with this these days is that more and more sports cars use "transaxles". Older cars usually had the transmission up front and then a rear differential. It was easy to change the final drive ratio in the rear differential so all the old American muscle cars talk about running a 4.33 or whatever different ratios. Nowadays lots of cars have the transmission and rear differential together in the back to balance weight (from the Porsche 944 design). While that is mostly a cool thing, it makes changing the final drive much more expensive and much harder to find gears for. But it is still one of the best mods you can do for any serious driver.

(another reason that car gear ratios suck so bad is the emphasis on 0-60 times means that you absolutely have to be able to reach 60 in 2nd gear. That means 1st and 2nd can't be too high ratio. Without that constraint you might actually want 2nd to max out at 50 mph or something. There are other stupid goals that muck up gearings, like trying to acheive a high top speed).


Let's look at a final interesting case. Drag racers often use a formula like :


speed at end of 1/4 mile :

MPH = 234 * (Horsepower / Pounds) ^ .3333

and it is amazingly accurate. And yet it doesn't contain anything about torque or gear ratios. (they of course also use much more complex calculators that take everything into account). How does this work ?

A properly set up drag car is essentially running at power peak the whole time. They start off the line at high revs, and then the transmission is custom geared to keep the engine in power band, so it's a reasonable approximation to assume constant power the entire time.

So if you have constant power, then :


  d/dt E = P

  d/dt ( 1/2 mv^2 ) = P

  integrate :

  1/2 mv^2 = P * t

  v^2 = 2 * (P/m) * t 

  distance covered is : 
  
  x = 1/2 a t^2

  and P = m a v

  a = (P/m) / v

  so

  t = sqrt( 2*x*v / (P/m) )

  sqrt( 2*x*v / (P/m) ) = v^2 / ( 2 * (P/m) )

  simplify :

  v = 2 * ( x * (P/m) ) ^(1/3)

which is the drag racer's formula. Speed is proportional to distance covered times power-to-weight to the one third power.

If you're looking at "what is the time to reach X" (X being some distance or some mph), the only thing that matters is power-to-weight *assuming* the transmission has been optimized for the engine.


I think there's more to say about this, but I'm bored of this topic.

ADDENDUM :

Currently the two figures that we get to describe a car's engine are Horsepower (at peak rpm) and Torque (at peak rpm) (we also get 0-60 and top speed which are super useless).

I propose that the two figures that we'd really like are : Horsepower/weight (at peak rpm) and Horsepower/weight (at min during 10-100 run).

Let me explain why :

(Power/weight) is the only way that power ever actually shows up in the equations of dynamics (in a frictionless world). 220 HP in a 2000 pound car is better than 300 HP in a 3000 pound car. So just show me power to weight. Now, in the real world, the equations of dynamics are somewhat more complicated, so let's address that. One issue is air drag. For fighting air, power (ignoring mass) is needed, so for top speed you would prefer a car with more power than just power to weight. However, for braking and turning, weight is more important. So I propose that it roughly evens out and in the end just showing power to weight is fine.

Now, what about this "Horsepower/weight (at min during 10-100 run)" ? Well let's back up a second. The two numbers that we currently get (Power and Torque both at their peak) give us some rough idea of how broad the power band of an engine is, because Power is almost always at peak near the max rpm, and Torque is usually at peak somewhere around the middle, so a higher torque number (power being equal) indicates a broader power band. But good gearing (or bad gearing) can either hide or exagerate that problem. For example a tuned Honda VTEC might have a narrow power band that's all in the 7k - 10k RPM range, but with a "crossed" transmission you might be perfectly happy never dropping out of that rev range. Another car might have a wide power band, but really huge gear steps so that you do get a big power drop on shifts. So what I propose is you run the cars from 10mph-100 , shifting at red line, and measure the *min* horsepower the engine puts out. This will tell you what you really want to know, which is when doing normal upshifts do you drop out of the power band, and how bad is it? eg. what is the lowest power you will experience.

Of all the numbers that we actually get, quarter mile time is probably the best.

4/08/2011

04-08-11 - Friday Filters

So in the last post we made the analysis filters for given synthesis filters that minimize L2 error.

Another common case is if you have a set of discrete data and wish to preserve the original values exactly. That is, you want to make your synthesis interpolate the original points.

(obviously some synthesis filters like sinc and box are inherently interpolating, but others like gauss and cubic are not).

So, the construction proceeds as before, but is actually simpler. Rather than using the hat overlaps we only need the values of the synthesis at the lattice points. That is :


Given synthesis/reconstruction filter S(t)
and discrete points P_i
Find points Q_i such that :

f(t) = Sum_i Q_i * S(t - i)

f(j) = P_j for all integers j

We can write this as a matrix equation :

Sij = S(j - i) = S(i - j)

S * Q = P

note that S is band-diagonal. This is the exact same kind of matrix problem that you get when solving for B-splines.

In general, if you care about the domain boundary effects, you have to construct this matrix and solve it somehow. (matrix inversion is a pretty bad way, there are fast ways to solve sparse band-diagonal matrices).

However, if the domain is large and you don't care too much about the boundary, there's a much simpler way. You just find S^-1 and look at a middle row. As the domain size goes to infinity, all the rows of S^-1 become the same. For finite domain, the first and last rows are weird and different, but the middle rows are about the same.

This middle row of S^-1 is the analysis filter.


A = middle row of S^-1

Q = A <conv> P

Q_i = Sum_j A(i-j) P_j

note A is only defined discretely, not continuously. Now our final output is :

f(t) = Sum_i Q_i * S(t - i)

f(t) = Sum_i Sum_j A(i-j) P_j * S(t - i)

f(t) = Sum_j C(t-j) * P_j

C(t) = Sum_i A(i) * S(t-i)

where C is the combined analysis + synthesis filter. The final output is just a simple filter applied to the original points.

For example, for the "canonical cubic" synthesis , (box convolved thrice), we get :


cubic analysis :
const float c_filter[11] = { -0.00175, 0.00876, -0.03327, 0.12434, -0.46410, 1.73205, -0.46410, 0.12434, -0.03327, 0.00876, -0.00175 };
chart : 

(A is 11 wide because I used an 11x11 matrix; in general it's infinitely wide, but gets smaller as you go out)

The synthesis cubic is piece-wise cubic and defined over four unit intervals from [-2,2]
The combined filter is piece-wise cubic; each unit interval is a linear combo of the 4 parts of synthesis

{ ADDENDUM : BTW this *is* the B-spline cubic; see for example : Neil Dodgson Image resampling page 128-129 ; the coefficients are exactly the same }

So, you could do all this work, or you could just use a filter that looks like "combined" from the start.


gaussian analysis :
const float c_filter[11] = { -0.00001, 0.00008, -0.00084, 0.00950, -0.10679, 1.19614, -0.10679, 0.00950, -0.00084, 0.00008, -0.00001 };
chart : 

mitchell1 analysis :
const float c_filter[11] = { -0.00000, 0.00002, -0.00028, 0.00446, -0.07115, 1.13389, -0.07115, 0.00446, -0.00028, 0.00002, -0.00000 };
chart : 

Now, not surprisingly all of the "combined" filters look very similar, and they all look rather a lot like windowed sincs, because there simply aren't that many ways to make interpolating filters. They have to be = 1.0 at 0, and = 0.0 at all other integer locations.

ADDENDUM : well, I finally read the Nehab/Hoppe paper "Generalized Sampling" , and guess what, it's this. It comes from a 1999 paper by Blu et.al. called "Generalized Interpolation: Higher Quality at no Additional Cost".

The reason they claim it's faster than traditional filtering is that what we have done is to construct a sinc-like interpolating filter with wide support, which I call "combined", which can be separated into a simple compact "synthesis" filter, and a discrete matrix (S^-1). So for the case where you have only a few sets of data and you sample it many times (eg. texture in games), you can obviously implement this quickly by pre-applying the discrete matrix to the data, so you no longer have samples of the signal as your pixels, instead you have amplitudes of synthesis basis functions (that is, you store Q in your texture, not P). Now you can effectively sample the "combined" filter by applying only the "synthesis" filter. So for example Nehab/Hoppe suggests that you can do this fast on the GPU by using a GPU implementation of the cubic basis reconstruction.

Both of the papers are somewhat misleading in their claims of "higher quality than traditional filtering". You can of course compute a traditional linear filter that gives the same result, since their techniques are still just linear. They are higher quality *for the same complexity of filter* , and only if you have pre-done the S^-1 multiplication. The speed advantage is in being able to precompute the S^-1 multiplication.

4/07/2011

04-07-11 - Help me I can't stop

Some common synthesis filters and their corresponding analysis filter :
BEGIN review

if you want to approximate f(t) by

Sum_i P_i * synthesis(t-i)

you can find the P's by :

P_i = Convolve{ f(t) analysis(t-i) }

END review
a note on the method :

BEGIN note

the H overlap matrix was computed on a 9x9 domain
because my matrix inverse is ungodly slow

for sanity checking I compared to 11x11 a few times and found the difference to be small
for example :

linear filter invH 9x9 :

0.0038,-0.0189,0.0717,-0.2679,1.0000,-0.2679,0.0717,-0.0189,0.0038

linear filter invH 11x11 :

-0.0010,0.0051,-0.0192,0.0718,-0.2679,1.0000,-0.2679,0.0718,-0.0192,0.0051,-0.0010

(ideally I would use a very large matrix and then look at the middle row, because that is
where the boundary has the least effect)

for real use in a high precision environment you would have to take the domain boundary more seriously

also, I did something stupid and printed out the invH rows with the maximum value scaled to 1.0 ; the
unscaled values for linear are :

-0.0018,0.0088,-0.0333,0.1243,-0.4641,1.7320,-0.4641,0.1243,-0.0333,0.0088,-0.0018

but, I'm not gonna redo the output to fix that, so the numbers below have 1.0 in the middle.

END note

For box synthesis, analysis is box.

linear : invH middle row : = 
0.0038,-0.0189,0.0717,-0.2679,1.0000,-0.2679,0.0717,-0.0189,0.0038,
-0.0010,0.0051,-0.0192,0.0718,-0.2679,1.0000,-0.2679,0.0718,-0.0192,0.0051,-0.0010,

(note: we've see this linear analysis filter before when we talked about how to find the optimum image such that when it's bilinear interpolated you match some original as well as possible)

quadratic invH middle row : = 
0.0213,-0.0800,0.1989,-0.4640,1.0000,-0.4640,0.1989,-0.0800,0.0213,

gauss-unity : invH middle row : = 
0.0134,-0.0545,0.1547,-0.4162,1.0000,-0.4162,0.1547,-0.0545,0.0134,

note : "unity" means no window, but actually it's a rectangular window with width 5 ; the gaussian has sdev of 0.5

sinc half-width of 20 :

sinc-unity : invH middle row : = 
0.0129,-0.0123,0.0118,-0.0115,1.0000,-0.0115,0.0118,-0.0123,0.0129,

note : obviously sinc is its own analysis ; however, this falls apart very quickly when you window the sinc at all, or even just cut it off when the values get tiny :

sinc half-width of 8 :

sinc-unity : invH middle row : = 
0.0935,-0.0467,0.0380,-0.0354,1.0000,-0.0354,0.0380,-0.0467,0.0935,

lanczos6 : invH middle row : = 
0.0122,-0.0481,0.1016,-0.1408,1.0000,-0.1408,0.1016,-0.0481,0.0122,

lanczos4 : invH middle row : = 
0.0050,-0.0215,0.0738,-0.1735,1.0000,-0.1735,0.0738,-0.0215,0.0050,

Oh, also, note to self :

If you print URLs to the VC debug window they are clickable with ctrl-shift, and it actually uses a nice simple internal web viewer, it doesn't launch IE or any such shite. Nice way to view my charts during testing.

ADDENDUM : Deja vu. Rather than doing big matrix inversions, you can get these same results using Fourier transforms and Fourier convolution theorem.

cbloom rants 06-16-09 - Inverse Box Sampling
cbloom rants 06-17-09 - Inverse Box Sampling - Part 1.5
cbloom rants 06-17-09 - Inverse Box Sampling - Part 2

old rants