5/29/2010

05-29-10 - x64 so far

x64 linkage that's been useful so far :

__asm__ cmpxchg8bcmpxchg16b - comp.programming.threads Google Groups
_InterlockedCompareExchange Intrinsic Functions
x86-64 Tour of Intel Manuals
x64 Starting Out in 64-Bit Windows Systems with Visual C++
Writing 64-bit programs
Windows Data Alignment on IPF, x86, and x86-64
Use of __m128i as two 64 bits integers
Tricks for Porting Applications to 64-Bit Windows on AMD64
The history of calling conventions, part 5 amd64 - The Old New Thing - Site Home - MSDN Blogs
Snippets lifo.h
Predefined Macros (CC++)
Physical Address Extension - PAE Memory and Windows
nolowmem (Windows Driver Kit)
New Intrinsic Support in Visual Studio 2008 - Visual C++ Team Blog - Site Home - MSDN Blogs
Moving to Windows Vista x64
Moving to Windows Vista x64 - CodeProject
Mark Williams Blog jmp'ing around Win64 with ml64.exe and Assembly Language
Kernel Exports Added for Version 6.0
Is there a portable equivalent to DebugBreak()__debugbreak - Stack Overflow
How to Log Stack Frames with Windows x64 - Stack Overflow
BCDEdit Command-Line Options
Available Switch Options for Windows NT Boot.ini File
AMD64 Subpage
AMD64 (EM64T) architecture - CodeProject
20 issues of porting C++ code on the 64-bit platform

One unexpected annoyance has been that a lot of the Win32 function signatures have changed. For example LRESULT is now a pointer not a LONG. This is a particular problem because Win32 has always made heavy use of cramming the wrong type into various places, eg. for GetWindowLong and stuffing pointers in LPARAM's and all that kind of shit. So you wind up having tons of C-style casts when you write Windows code. I have made good use of these guys :



// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_value_cast( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // just value cast :
    return (t_to) from;
}

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_p( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // cast through char * to make aliasing work ?
    char * ptr = (char *) &from;
    return *( (t_to *) ptr );
}

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
// cast with union is better for gcc / Xenon :
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_u( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    union _bit_cast_union
    {
        t_fm fm;
        t_to to;        
    };
    _bit_cast_union converter = { from };
    return converter.to;
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
    t_to to = static_cast<t_to>(from);
    ASSERT( static_cast<t_fm>(to) == from );
    return to;
}

inline int ptr_diff_32( ptrdiff_t diff )
{
    return check_value_cast<int>(diff);
}

BTW this all has made me realize that the recent x86-32 monotony on PC's has been a delightful stable period for development. I had almost forgotten that it used to be always like this. Now to do simple shit in my code, I have to detect if it's x86 or x64 , if it is x64, do I have an MSC version that has the intrinsics I need? if not I have to write a got damn MASM file. Oh and I often have the check for Vista vs. XP to tell if I have various kernel calls. For example :


#if _MSC_VER > 1400

// have intrinsic
_InterlockedExchange64()

#elif _X86_NOT_X64_

// I can use inline asm
__asm { cmpxchg8b ... }

#elif OS_IS_VISTA_NO_XP

// kernel library call available
InterlockedExchange64()

#else

// X64 , not Vista (or want to be XP compatible) , older compiler without intrinsic,
//  FUCK !

#error just use a new newer MSVC version for X64 because I don't want to fucking write MASM rouintes

#endif

Even ignoring the pain of the last FUCK branch which requires making a .ASM file, the fact that I had to do a bunch of version/target checks to get the right code for the other paths is a new and evil pain.

Oh, while I'm ranting, fucking MSDN is now showing all the VS 2010 documentation by default, and they don't fucking tell you what version things became available in.

This actually reminds me of the bad old days when I got started, when processors and instruction sets were changing rapidly. You actually had to make different executables for 386/486 and then Pentium, and then PPro/P3/etc (not to mention the AMD chips that had their own special shiznit). Once we got to the PPro it really settled down and we had a wonderful monotony of well developed x86 on out-of-order machines that continued up to the new Core/Nehalem chips (only broken by the anomalous blip of Itanium that we all ignored as it went down in flames like the Hindenburg). Obviously we've had consoles and Mac and other platforms to deal with, but that was for real products that want portability to deal with, I could write my own Wintel code for home and not think about any of that. Well Wintel is monoflavor no more.

The period of CISC and chips with fancy register renaming and so-on was pretty fucking awesome for software developers, because you see the same interface for all those chips, and then behind the scenes they do magic mumbo jumbo to turn your instructions into fucking gene sequences that multiply and create bacterium that actually execute the instructions, but it doesn't matter because the architecture interface still just looks the same to the software developer.

5/27/2010

05-27-10 - Weird Compiler Error

Blurg just fought one of the weirder problems I've ever seen.

Here's the simple test case I cooked up :


void fuck()
{
#ifdef RR_ASSERT
#pragma RR_PRAGMA_MESSAGE("yes")
#else
#pragma RR_PRAGMA_MESSAGE("no")
#endif

    RR_ASSERT(true);
}

And here is the compiler error :

1>.\rrSurfaceSTBI.cpp(43) : message: yes
1>.\rrSurfaceSTBI.cpp(48) : error C3861: 'RR_ASSERT': identifier not found

Eh !? Serious WTF !? I know RR_ASSERT is defined, and then it says it's not found !? WTF !?

Well a few lines above that is the key. There was this :


#undef  assert
#define assert  RR_ASSERT

which seems like it couldn't possibly cause this, right? It's just aliasing the standard C assert() to mine. Not possible related, right? But when I commented out that bit the problem went away. So of course my first thought is clean-rebuild all, did I have precompiled headers on by mistake? etc. I assume the compiler has gone mad.

Well, it turns out that somewhere way back in RR_ASSERT I was in a branch that caused me to have this definition for RR_ASSERT :


#define RR_ASSERT(exp)  assert(exp)

This creates a strange state for the preprocessor. RR_ASSERT is now a recursive macro. When you actually try to use it in code, the preprocessor apparently just bails and doesn't do any text substitution. But, the name of the preprocessor symbol is still defined, so my ifdef check still saw RR_ASSERT existing. Evil.

BTW the thing that kicked this off is that fucking VC x64 doesn't support inline assembly. ARGH YOU COCK ASS. Because of that we had long ago written something like


#ifdef _X86
#define RR_ASSERT_BREAK()  __asm int 3
#else
#define RR_ASSERT_BREAK()  assert(0)
#endif

which is what caused the difficulty.

05-27-10 - Loop Branch Inversion

A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do. eg :

LoopAndDoWork(query,worktype)
{
    Make bounding area
    Do Kd-tree descent .. 
    loop ( tree nodes )
    {
        bounding intersection, etc.
        found an object
        DoPerObjectWork(object);
    }
}

The problem is that DoPerObjectWork then is some conditional, maybe something like :

DoPerObjectWork(object)
{
    switch(workType)
    {
    ...
    }
}

or even worse - it's a function pointer that you call back.

Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just propagate that branch up through the loops, but there's way to express it neatly in C.

The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :


template < int workType >
void t_LoopAndDoWork(query)
{
    ...
}

and then provide a dispatcher which does the branch outside :

LoopAndDoWork(query,worktype)
{
    switch(workType)
    {
    case 0 : t_LoopAndDoWork<0>(query); break;
    case 1 : t_LoopAndDoWork<1>(query); break;
    ...
    }
}

this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be kept in sync and can be far apart in the code.

This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable into the right template call. But it's ugly.

BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel can have between 1 and 4 channels. Instead of this :


for(int c=0;c<channels;c++)
{
    DoStuff(c);
}

where channels is a constant (template parameter), it's better to do :

DoStuff(0);
if ( channels > 1 ) DoStuff(1);
if ( channels > 2 ) DoStuff(2);
if ( channels > 3 ) DoStuff(3);

because those ifs reliably go away.

5/26/2010

05-26-10 - Windows Page Cache

The correct way to cache things is through Windows' page cache. The advantage from doing this over using your own custom cache code is :

1. Automatically resizes based on amount of memory needed by other apps. eg. other apps can steal memory from your cache to run.

2. Automatically gives pages away to other apps or to file IO or whatever if they are touching their cache pages more often.

3. Automatically keeps the cache in memory between runs of your app (if nothing else clears it out). This is pretty immense.

Because of #3, your custom caching solution might slightly beat using the Windows cache on the first run, but on the second run it will stomp all over you.

To do this nicely, generally the cool thing to do is make a unique file name that is the key to the data you want to cache. Write the data to a file, then memory map it as read only to fetch it from the cache. It will now be managed by the Windows page cache and the memory map will just hand you a page that's already in memory if it's still in cache.

The only thing that's not completely awesome about this is the reliance on the file system. It would be nice if you could do this without ever going to the file system. eg. if the page is not in cache, I'd like Windows to call my function to fill that page rather than getting it from disk, but so far as I know this is not possible in any easy way.

For example : say you have a bunch of compressed images as JPEG or whatever. You want to keep uncompressed caches of them in memory. The right way is through the Windows page cache.

05-26-10 - Windows 7 Snap

My beloved "AllSnap" doesn't work on Windows 7 / x64. I can't find a replacement because fucking Windows has a feature called "Snap" now, so you can't fucking google for it. (also searching for "Windows 7" stuff in general is a real pain because solutions and apps for the different variants of windows don't always use the full name of the OS they are for in their page, so it's hard to search for; fucking operating systems really need unique code names that people can use to make it possible to search for them; "Windows" is awful awful in this regard).

I contacted the developer of AllSnap to see if he would give me the code so I could fix it, but he is ignoring me. I can tell from debugging apps when AllSnap is installed that it seems to work by injecting a DLL. This is similar to how I hacked the poker sites for GoldBullion, so I think I could probably reproduce that. But I dunno if Win7/x64 has changed anything about function injection and the whole DLL function pointer remap method.

BTW/FYI the standard Windows function injection method goes like this : Make a DLL that has some event handler. Run a little app that causes that event to trip inside the app you want to hijack. Your DLL is now invoked in that app's process to handle that event. Now you are running in that process so you can do anything you want - in particular you can find the function table to any of their DLL's, such as user32.dll, and stuff your own function pointer into that memory. Now when the app makes normal function calls, they go through your DLL.

5/25/2010

05-25-10 - Thread Insurance

I just multi-threaded my video test app recently, and it was reasonably easy, but I had a few nagging bugs because of hidden ways they were touching shared memory without protection deep inside functions. Okay, so I found them and fixed them, but I'm left with a problem - any time I touch one of those deep functions, I could screw up the threading without realizing it. And I might not get any indication of what I did for weeks if it's a rare race.

What I would like is a way to make this more robust. I have very strong threading primitives, I want a way to make sure that I use them! In particular, I want to be able to mark certain structs as only touchable when a critsec is locked or whatever.

I think that a lot of this could be done with Win32 memory page protections. So far as I know there's no way to associate protections per-thread, (eg. to make a page read/write for thread A but no-access for thread B). If I could do that it would be super sweet.

One idea is to make the page no access and then install my own exception handler that checks what thread it is, but that might be too much overhead (and not sure if that would fail for other reasons).

The main usage is not for protected crit-sec'ed structs, that is really the easiest case to maintain because it's very obvious right there in the code that you need to take the critsec to touch the variables. The hard case to maintain is the ad hoc "I know this is safe to touch without protection". In particular I have a lot of code that runs like this :


Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

So what I would really like to do is :

Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

-- set A memory to be read-only !
-- set B,C,D memory to be read/write only for their own thread

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

-- make A,B,C,D read/write only for main thread !

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

The thing that this saves me from is when I'm tinkering in DoComplicatedStuff() which is some function called deep inside Phase 2 somewhere and I change it to no longer follow the memory access rule that it is supposed to be following. This is just my hate for having rules for code correctness that are not enforced by the compiler or at least by run-time asserts.

5/21/2010

05-21-10 - Video coding beyond H265

In the end movec-residual coding is inherently limitted and inefficient. Let's review the big advantage of it and the big problem.

The advantage is that the encoder can reasonably easy consider {movec,residual} coding choices jointly. This is a huge advantage over just picking what's my best movec, okay now code the residual. Because movec affects the residual, you cannot make a good R/D decision if you do it separately. By using block movecs, it reduces the number of options that need to be considered to a small enough set that encoders can practically consider a few important choices and make a smart R/D decision. This is what is behind all current good video encoders.

The disadvantage of movec-residual coding is that they are redundant and connected in a complex and difficult to handle way. We send them independently, but really they have cross-information about each other, and that is impossible to use in the standard framework.

There are obviously edges and shapes in the image which occur in both the movecs and the residuals. eg. a moving object will have a boundary, and really this edge should be used for both the movec and residual. In the current schemes we send a movec for the block, and then the residuals per pixel, so we now have finer grain information in the residual that should have been used to give us finer movecs per pixel, but it's too late now.

Let's back up to fundamentals. Assume for the moment that we are still working on an 8x8 block. We want to send that block in the current frame. We have previous frames and previous blocks within the current frame to help us. There are 256^3^64 possible values for this block. If we are doing lossy coding, then not all possible values for the block can be sent. I won't get into details of lossiness, so just say there are a large number of possible values for the pixels of the block; we want to code an index to one of those values.

Each index should be sent with a different bit length based on its probability. Already we see a flaw with {movec-residual} coding - there are tons of {movec,residual} pairs that specify the same index. Of course in a flat area lots of movecs might point to the same pixels, but even if that is eliminated, you could go movec +1, residual +3, or movec +3, residual +1, and both ways get to +4. Redundant encoding = bit waste.

Now, this bit waste might not be critically bad with current simple {movec,residual} schemes - but it is a major encumbrance if we start looking at more sophisticated mocomp options. Say you want to be able to send movecs for shapes, eg. send edges and then send a movec on each side. There are lots of possibilities here - you might just send a movec per pixel (this seems absurdly expensive, but the motion fields are very smooth so should code well from neighbors), or you might send a polygon mesh to specify shapes. This should give you much better motion fields, and then the information in the motion fields can be used to predict the residuals as well. But the problem is there's too much redundancy. You have greatly expanded the number of ways to code the same output pixels.

We could consider more modest steps as well, such as sticking with block mocomp + residual, but expanding what we can do for "mocomp". For example, you could use two motion vectors + arbitrary linear combination of the source blocks. Or you could do trapesoidal texture-mapping style mocomp. Or mocomp with a vector and scale + rotation. None of these is very valuable, there are numerous problems : 1. too many ways to encode for the encoder to do thorough R/D analysis of all of them, 2. too much redundancy, 3. still not using the joint information across residual & motion.

In the end the problem is that you are using a 6-d value {velocity,pixel} to specify a 3-d color. What you really want is a 3-d coordinate which is not in pixel space, but rather is a sort of "screw" in motion/pixel space. That is, you want the adjacent coordinates in motion/pixel space to be the ones that are closest together in the 6-d space. So for example RGB {100,0,0} and {0,200,50} might be neighbors in motion/pixel space if they can be reached by small motion adjustments.

Okay this is turning into rambling, but another way of seeing it is like this : for each block, construct a custom basis transform. Don't send a separate movec or anything - the axes of the basis transform select pixels by stepping in movec and also residual.

ADDENDUM : let me try to be more clear by doing a simple example. Say you are trying to code a block of pixels which only has 10 possible values. You want to code with a standard motion then residual method. Say there are only 2 choices for motion. It is foolish to code all 10 possible values for both motion vectors! That is, currently all video coders do something like :


Code motion = [0 or 1]
Code residual = [0,1,2,3,4,5,6,7,8,9]

Or in tree form :

   0 - [0,1,2,3,4,5,6,7,8,9]
*<
   1 - [0,1,2,3,4,5,6,7,8,9]

Clearly this is foolish. For each movec, you only need to code the residual which encodes that resulting pixel block the smallest under that movec. So you only need each output value to occur in one spot on the tree, eg.

   0 - [0,1,2,3,4]
*<
   1 - [5,6,7,8,9]

or something. That is, it's foolish to have to ways to encode the residual to reach a certain target when there were already cheaper ways to reach that target in the movec coding portion. To minimize this defficiency, most current coders like H264 will code blocks by either putting almost all the bits in the movec and very few in the residual, or the other way (almost none in the movec and most in the residual). The loss occurs most when you have many bits in the motion and many in the residual, something like :


   0 - [0,1,2]
   1 - [3,4,5,6]
   2 - [7,8]
   3 - [9]

The other huge fundamental defficiency is that the probability modeling of movecs and residuals is done in a very primitive way based only on "they are usually small" assumptions. In particular, probability modeling of movecs needs to be done not just based on the vector, but on the content of what is pointed at. I mentioned long ago there is a lot of redundancy there when you have lots of movecs pointing at the same thing. Also, the residual coding should be aware of what was pointed to by the movec. For example if the movec pointed at a hard edge, then the residual will likely also have a similar hard edge because it's likely we missed by a little bit, so you could use a custom transform that handles that better. etc.

ADDENDUM 2 : there's something else very subtle going on that I haven't seen discussed much. The normal way of sending {movec,residual} is actually over-complete. Mostly that's bad, too much over-completeness means you are just wasting bits, but actually some amount of over-completeness here is a good thing. In particular for each frame we are sending a little bit of extra side information that is useful for *later* frames. That is, we are sending enough information to decode the current frame to some quality level, plus some extra that is not really worth it for just the current frame, but is worth it because it helps later frames.

The problem is that the amount of extra information we are sending is not well understood. That is, in the current {movec,residual} schemes we are just sending extra information without being in control and making a specific decision. We should be choosing how much extra information to send by evaluating whether it is actually helpful on future frames. Obviously the last frames of the video (or a sequence before a cut) you shouldn't send any extra information.

In the examples above I'm showing how to reduce the overcomplete information down to a minimal set, but sometimes you might not want to do that. As a very course example say the true motion at a given pixel is +3, movec=3 to get to final pixel=7 , but you can code the same result smaller by using movec=1 - deciding whether to send the true motion or not should be done based on whether it actually helps in the future, but more importantly the code stream could collapse {3,7} and {1,7} so there is no redundant way to code if the difference is not helpful.

This becomes more important of course if you have a more complex motion scheme, like per-pixel motion or trapezoidal motion or whatever.

5/20/2010

05-20-10 - Some quick notes on H265

Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff", it's mostly "do the same thing but use more CPU".

The best source I know of at the moment is H265.net , but you can also find lots of stuff just by searching for video on citeseer. (addendum : FTP to Dresen April meeting downloads ).

H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas that are outside of this kind of scheme. Some quick notes on the kind of things we may see :

Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel mocomp from that.

Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition. This lets the block transform match the data better. Which also leads us to -

Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb. These techniques can also be used for ...

Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities). Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.

Unblocking filters ("loop filtering" huh?) are an obvious area for improvement. The biggest area for improvement is deciding when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve (eg. it could have but chose not to).

Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg. oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.

Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's (very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW or EZDCT.

5/19/2010

05-19-10 - Some quick notes on VP8

The VP8 release is exciting for what it might be in two years.

If it in fact becomes a clean open-source video standard with no major patent encumbrances, it might be well integrated in Firefox, Windows Media, etc. etc. - eg. we might actually have a video format that actually just WORKS! I don't even care if the quality/size is really competitive. How sweet would it be if there was a format that I knew I could download and it would just play back correctly and not give me any headaches. Right now that does not exist at all. (it's a sad fact that animated GIF is probably the most portable video format of the moment).

Now, you might well ask - why VP8 ? To that I have no good answer. VP8 seems like a messy cock-assed standard which has nothing in particular going for it. The entropy encoder in particular (much like H264) seems badly designed and inefficient. The basics are completely vanilla, in that it is block based, block modes, movecs, transforms, residual coding. In that sense it is just like MPEG1 or H265. That is a perfectly fine thing to do, and in fact it's what I've wound up doing, but you could pull a video standard like that out of your ass in about five minutes, there's no need to license code for that. If in fact VP8 does dodge all the existing patents then that would be a reason that it has value.

The VP8 code stream is probably pretty weak (I really don't know enough of the details to say for sure). However, what I have learned of late is that there is massive room for the encoder to make good output video even through a weak code stream. In fact I think a very good encoder could make good output from an MPEG2 level of code stream. Monty at Xiph has a nice page about work on Theora. There's nothing really cutting edge in there but it's nicely written and it's a good demonstration of the improvement you can get on a fixed standard code stream just with encoder improvements (and really their encoder is only up to "good but still basic" and not really into the realm of wicked-aggressive).

The only question we need to ask about the VP8 code stream is : is it flexible enough that it's possible to write a good encoder for it over the next few years? And it seems the answer is yes. (contrast this to VP3/Theora which has a fundamentally broken code stream which has made it very hard to write a good encoder).

ADDENDUM : this post by Greg Maxwell is pretty right on.

ADDENDUM 2 : Something major that's been missing from the web discussions and from the literature about video for a long time is the separation of code stream from encoder. The code stream basically gives the encoder a language and framework to work in. The things that Jason / Dark Shikary thinks are so great about x264 are almost entirely encoder-side things that could apply to almost any code stream (eg. "psy rdo" , "AQ", "mbtree", etc.). The literature doesn't discuss this much because they are trapped in the pit of PSNR comparisons, in which encoder side work is not that interesting. Encoder work for PSNR is not interesting because we generally know directly how to optimizing for MSE/SSD/L2 error - very simple ways like flat quantizers and DCT-space trellis quant, etc. What's more interesting is perceptual quality optimization in the encoder. In order to acheive good perceptual optimization, what you need is a good way to measure percpetual error (which we don't have), and the ability to try things in the code stream and see if they improve perceptual error (hard due to non-local effects), and a code stream that is flexible enough for the encoder to make choices that create different kinds of errors in the output. For example adding more block modes to your video coder with different types of coding is usually/often bad in a PSNR sense because all they do is create redundancy and take away code space from the normal modes, but it can be very good in a perceptual sense because it gives the encoder more choice.

ADDENDUM 3 : Case in point , I finally have noticed some x264 encoded videos showing up on the torrent sites. Well, about 90% of them don't play back on my media PC right. There's some glitching problem, or the audio & video get out of sync, or the framerate is off a tiny bit, or some shit and it's fucking annoying.

ADDENDUM 4 : I should be more clear - the most exciting thing about VP8 is that it (hopefully) provides an open patent-free standard that can then be played with and discussed openly by the development community. Hopefully encoders and decoder will also be open source and we will be able to talk about the techniques that go into them, and a whole new

5/13/2010

05-13-10 - P4 with NiftyPerforce and no P4SCC

I'm trying using P4 in MSDev with NiftyPerforce and no P4SCC.

What this means is VC thinks you have no SCC connection at all, your files are just on your disk. You need to change the default NiftyPerforce settings so that it checks out files for you when you edit/save etc.

Advantages of NiftyPerforce without P4SCC :

1. Much faster startup / project load, because it doesn't go and check the status of everything in the project with P4.

2. No clusterfuck when you start unconnected. This is one the worst problems with P4SCC, for example if you want to work on some work projects but can't VPN for some reason, P4SCC will have a total shit fit about working disconnected. With the NiftyPerforce setup you just attrib your files and go on with your business.

3. No difficulties with changing binding/etc. This is another major disaster with P4SCC. It's rare, but if you change the P4 location of a project or change your mappings or if you already have some files added to P4 but not the project, all these things give MSdev a complete shit-fit. That all goes away.

Disadvantages of NiftyPerforce without P4SCC :

1. The first few keystrokes are lost. When you try to edit a checked-in file, you can just start typing and Nifty will go check it out, but until the checkout is done your keystrokes go to never-never land. Mild suckitude. Alternatively you could let MSDev pop up the dialog for "do you want to edit this read only file" which would make you more aware of what's going on but doesn't actually fix the issue.

2. No check marks and locks in project browser to let you know what's checked in / checked out. This is not a huge big deal, but it is a nice sanity check to make sure things are working the way they should be. Instead you have to keep an eye on your P4Win window which is a mild productivity hit.

One note about making the changeover : for existing projects that have P4SCC bindings, if you load them up in VC and tell VC to remove the binding, it also will be "helpful" and go attrib all your files to make them writeable (it also will be unhelpful and not check out your projects to make the change to not have them bound). Then NiftyPerforce won't work because your files are already writeable. The easiest way to do this right is to just open your vcproj's and sln's in a text editor and rip out all the binding bits manually.

I'm not sure yet whether the pros/cons are worth it. P4SCC actually is pretty nice once it's set up, though the ass-pain it gives when trying to make it do something it doesn't want to do (like source control something that's out of the binding root) is pretty severe.

ADDENDUM :

I found the real pro & con of each way.

Pro P4SCC : You can just start editting files in VC and not worry about it. It auto-checks out files from P4 and you don't lose key presses. The most important case here is that it correctly handles files that you have not got the latest revision of - it will pop up "edit current or sync first" in that case. The best way to use Nifty seems to be Jim's suggestion - put checkout on Save, do not checkout on Edit, and make files read-only editable in memory. That works great if you are a single dev but is not super awesome in an actual shared environment with heavy contention.

Pro NiftyP4 : When you're working from home over an unreliable VPN, P4SCC is just unworkable. If you lose connection it basically hangs MSDev. This is so bad that it pretty much completely dooms P4SCC. ARG actually I take that back a bit, NiftyP4 also hangs MSDev when you lose connection, though it's not nearly as bad.

old rants