cbloom rants: 05/2010

5/29/2010

05-29-10 - Some more x64

Okay , MASM/MSDev support for x64 is a bit fucked. MSDev has built-in support for .ASM starting in VC 2005 which does everything for you, sets up custom build rule, etc. The problem is, it hard-codes to ML.EXE - not ML64. Apparently they have fixed this for VC 2010 but it is basically impossible to back-fix. (in VC 2008 the custom build rule for .ASM is in an XML file, so you can fix it yourself thusly )

The workaround goes like this :

Go to "c:\program files (x86)\microsoft visual studio 8\vc\bin". Find the occurance of ML64.exe ; copy them to ML.exe . Now you can add .ASM files to your project. Go to the Win32 platform config and exclude them from build in Win32.

You now have .ASM files for ML64. For x86/32 - just use inline assembly. For x64, you extern from your ASM file.

Calling to x64 ASM is actually very easy, even easier than x86, and there are more volatile registers and the convention is that caller has to do all the saving. All of this means that you as a little writer of ASM helper routines can get away with doing very little. Usually your args are right there in {rcx,rdx,r8,r9} , and then you can use {rax,r10,r11} as temp space, so you don't even have to bother with saving space on the stack or any of that. See list of volatile registers

BTW the best docs are just the full AMD64 manuals .

For example here's a full working .ASM file :


public my_cmpxchg64

.CODE

align 8
my_cmpxchg64 PROC 

  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  jne my_cmpxchg64_fail
  mov rax, 1
  ret

align 8
my_cmpxchg64_fail:
  mov [rdx], rax
  mov rax, 0
  ret
align 8
my_cmpxchg64 ENDP

END

And how to get to it from C :


extern "C"  extern int my_cmpxchg64( uint64 * val, uint64 * oldV, const uint64 newV );

BTW one of the great things about posting things on the net is just that it makes me check myself. That cmpxchg64 has a stupid branch, I think this version is better :


align 8
my_cmpxchg64 PROC
  mov rax, [rdx]
  lock cmpxchg [rcx], r8
  sete cl
  mov [rdx], rax
  movzx rax,cl
  ret
my_cmpxchg64 ENDP

and you can probably do better. (for example it's probably better to just define your function as returning unsigned char and then you can avoid the movzx and let the caller worry about that)

ADDENDUM : I just found a new evil secret way I'm fucked. Unions with size mismatches appears not to even be a warning of any kind. So for example you can silently have this in your code :


union Fucked
{
    struct
    {
        void * p1;
        int t;
    } s;
    uint64  i;
};

build in 64 bit and it's just hose city. BTW I think using unions as a datatype in general is probably bad practice. If you need to be doing that for some fucked reason, you should just store the member as raw bits, and then same_size_bit_cast() to convert it to the various types. In other words, the dual identity of that memory should be a part of the imperative code, not a part of the data declaration.

05-29-10 - Lock Free in x64

I mentioned long ago in the low level threading articles that some of the algorithms are a bit problematic on with 64 bit pointers because we don't have large enough atomic operations.

The basic problem is that for many of the lock-free algorithms we need to be able to do a DCAS , that is a CAS of two pointer-sized values, or a pointer and a counter. When our pointer was 32 bits, we could use a 64 bit CAS to implement DCAS. If our pointer is 64 bits then we need a 128 bit CAS to implement DCAS the same way. There are various solutions to this :

1. Use 128 bit CAS. x64 has cmpxchg16b now which is exactly what you need. This is obviously simple and nice. There are a few minor problems :

1.A. There are not other 128 bit atomics, eg. Exchange and Add and such are missing. These can be implemented in terms of loops of CAS, but that is a very minor suckitude.

1.B. Early AMD64 chips do not have cmpxchg16b. You have to check for its presence with a CPUID call. If it doesn't exist you are seriously fucked. Fortunately these chips are pretty rare, so you can just use a really evil fallback to keep working on them : either disable threading completely on them, or simply run the 32 bit version of your app. The easiest way to do that is to have your installer check the CPUID flag and install the 32 bit x86 version of your app instead of the 64 bit version.

1.C. All your lock-free nodes become 16 bytes instead of 8 bytes. This does things like make your minimum alloc size 16 bytes instead of 8 bytes. This is part of the general bloating of 64 bit structs and mildly sucks. (BTW you can see this in winnt.h as MEMORY_ALLOCATION_ALIGMENT is 16 on Win64 and 8 on Win32).

1.D. _InterlockedCompareExchange128 only exists on newer versions of MSVC so you have to write it yourself in ASM for older versions. Urg.

So #1 is an okay solution, but what are the alternative ?

2. Pack {Pointer,Count} into 64 bits. This is of course what Windows does for SLIST, so doing this is actually very safe. Currently pointers on Windows are only 44 bits because of this. They will move to 48 and then 52. You can easily store a 52 bit pointer + a 16 bit count in 64 bits (the 52 bit pointer has the bottom four bits zero so you actually have 16 bits to work with). Then you can just keep using 64 bit CAS. This has no disadvantage that I know of other than the fact that twenty years from now you'll have to touch your code again.

3. You can implement arbitrary-sized CAS in terms of pointer CAS. The powerful standard paradigm for this type of thing is to use pointers to data instead of data by value, so you are just swapping pointers instead of swapping values. It's very simple, when you want to change a value, you malloc a copy of it and change the copy, and then swap in the pointer to the new version. You CAS on the pointer swap. The "malloc" can just be taking data from a recycled buffer which uses hazard pointers to keep threads from using the same temp item at the same time. This is a somewhat more complex way to do things conceptually, but it is very powerful and general, and for anyone doing really serious lockfree work, a hazard pointer system is a good thing to have. See for example "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS".

You could also of course use a hybrid of 2 & 3. You could use a packed 64 bit {pointer,count} until your pointer becomes more than 52 bits, and then switch to a pointer to extra data.

05-29-10 - x64 so far

x64 linkage that's been useful so far :

__asm__ cmpxchg8bcmpxchg16b - comp.programming.threads Google Groups
_InterlockedCompareExchange Intrinsic Functions
x86-64 Tour of Intel Manuals
x64 Starting Out in 64-Bit Windows Systems with Visual C++
Writing 64-bit programs
Windows Data Alignment on IPF, x86, and x86-64
Use of __m128i as two 64 bits integers
Tricks for Porting Applications to 64-Bit Windows on AMD64
The history of calling conventions, part 5 amd64 - The Old New Thing - Site Home - MSDN Blogs
Snippets lifo.h
Predefined Macros (CC++)
Physical Address Extension - PAE Memory and Windows
nolowmem (Windows Driver Kit)
New Intrinsic Support in Visual Studio 2008 - Visual C++ Team Blog - Site Home - MSDN Blogs
Moving to Windows Vista x64
Moving to Windows Vista x64 - CodeProject
Mark Williams Blog jmp'ing around Win64 with ml64.exe and Assembly Language
Kernel Exports Added for Version 6.0
Is there a portable equivalent to DebugBreak()__debugbreak - Stack Overflow
How to Log Stack Frames with Windows x64 - Stack Overflow
BCDEdit Command-Line Options
Available Switch Options for Windows NT Boot.ini File
AMD64 Subpage
AMD64 (EM64T) architecture - CodeProject
20 issues of porting C++ code on the 64-bit platform

One unexpected annoyance has been that a lot of the Win32 function signatures have changed. For example LRESULT is now a pointer not a LONG. This is a particular problem because Win32 has always made heavy use of cramming the wrong type into various places, eg. for GetWindowLong and stuffing pointers in LPARAM's and all that kind of shit. So you wind up having tons of C-style casts when you write Windows code. I have made good use of these guys :



// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_value_cast( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // just value cast :
    return (t_to) from;
}

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_p( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    // cast through char * to make aliasing work ?
    char * ptr = (char *) &from;
    return *( (t_to *) ptr );
}

// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
// cast with union is better for gcc / Xenon :
template < typename t_to, typename t_fm >
t_to & same_size_bit_cast_u( t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    union _bit_cast_union
    {
        t_fm fm;
        t_to to;        
    };
    _bit_cast_union converter = { from };
    return converter.to;
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
    t_to to = static_cast<t_to>(from);
    ASSERT( static_cast<t_fm>(to) == from );
    return to;
}

inline int ptr_diff_32( ptrdiff_t diff )
{
    return check_value_cast<int>(diff);
}

BTW this all has made me realize that the recent x86-32 monotony on PC's has been a delightful stable period for development. I had almost forgotten that it used to be always like this. Now to do simple shit in my code, I have to detect if it's x86 or x64 , if it is x64, do I have an MSC version that has the intrinsics I need? if not I have to write a got damn MASM file. Oh and I often have the check for Vista vs. XP to tell if I have various kernel calls. For example :


#if _MSC_VER > 1400

// have intrinsic
_InterlockedExchange64()

#elif _X86_NOT_X64_

// I can use inline asm
__asm { cmpxchg8b ... }

#elif OS_IS_VISTA_NO_XP

// kernel library call available
InterlockedExchange64()

#else

// X64 , not Vista (or want to be XP compatible) , older compiler without intrinsic,
//  FUCK !

#error just use a new newer MSVC version for X64 because I don't want to fucking write MASM rouintes

#endif

Even ignoring the pain of the last FUCK branch which requires making a .ASM file, the fact that I had to do a bunch of version/target checks to get the right code for the other paths is a new and evil pain.

Oh, while I'm ranting, fucking MSDN is now showing all the VS 2010 documentation by default, and they don't fucking tell you what version things became available in.

This actually reminds me of the bad old days when I got started, when processors and instruction sets were changing rapidly. You actually had to make different executables for 386/486 and then Pentium, and then PPro/P3/etc (not to mention the AMD chips that had their own special shiznit). Once we got to the PPro it really settled down and we had a wonderful monotony of well developed x86 on out-of-order machines that continued up to the new Core/Nehalem chips (only broken by the anomalous blip of Itanium that we all ignored as it went down in flames like the Hindenburg). Obviously we've had consoles and Mac and other platforms to deal with, but that was for real products that want portability to deal with, I could write my own Wintel code for home and not think about any of that. Well Wintel is monoflavor no more.

The period of CISC and chips with fancy register renaming and so-on was pretty fucking awesome for software developers, because you see the same interface for all those chips, and then behind the scenes they do magic mumbo jumbo to turn your instructions into fucking gene sequences that multiply and create bacterium that actually execute the instructions, but it doesn't matter because the architecture interface still just looks the same to the software developer.

5/27/2010

05-27-10 - Weird Compiler Error

Blurg just fought one of the weirder problems I've ever seen.

Here's the simple test case I cooked up :


void fuck()
{
#ifdef RR_ASSERT
#pragma RR_PRAGMA_MESSAGE("yes")
#else
#pragma RR_PRAGMA_MESSAGE("no")
#endif

    RR_ASSERT(true);
}

And here is the compiler error :


1>.\rrSurfaceSTBI.cpp(43) : message: yes
1>.\rrSurfaceSTBI.cpp(48) : error C3861: 'RR_ASSERT': identifier not found

Eh !? Serious WTF !? I know RR_ASSERT is defined, and then it says it's not found !? WTF !?

Well a few lines above that is the key. There was this :


#undef  assert
#define assert  RR_ASSERT

which seems like it couldn't possibly cause this, right? It's just aliasing the standard C assert() to mine. Not possible related, right? But when I commented out that bit the problem went away. So of course my first thought is clean-rebuild all, did I have precompiled headers on by mistake? etc. I assume the compiler has gone mad.

Well, it turns out that somewhere way back in RR_ASSERT I was in a branch that caused me to have this definition for RR_ASSERT :


#define RR_ASSERT(exp)  assert(exp)

This creates a strange state for the preprocessor. RR_ASSERT is now a recursive macro. When you actually try to use it in code, the preprocessor apparently just bails and doesn't do any text substitution. But, the name of the preprocessor symbol is still defined, so my ifdef check still saw RR_ASSERT existing. Evil.

BTW the thing that kicked this off is that fucking VC x64 doesn't support inline assembly. ARGH YOU COCK ASS. Because of that we had long ago written something like


#ifdef _X86
#define RR_ASSERT_BREAK()  __asm int 3
#else
#define RR_ASSERT_BREAK()  assert(0)
#endif

which is what caused the difficulty.

05-27-10 - Loop Branch Inversion

A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do. eg :


LoopAndDoWork(query,worktype)
{
    Make bounding area
    Do Kd-tree descent .. 
    loop ( tree nodes )
    {
        bounding intersection, etc.
        found an object
        DoPerObjectWork(object);
    }
}

The problem is that DoPerObjectWork then is some conditional, maybe something like :


DoPerObjectWork(object)
{
    switch(workType)
    {
    ...
    }
}

or even worse - it's a function pointer that you call back.

Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just propagate that branch up through the loops, but there's way to express it neatly in C.

The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :


template < int workType >
void t_LoopAndDoWork(query)
{
    ...
}

and then provide a dispatcher which does the branch outside :


LoopAndDoWork(query,worktype)
{
    switch(workType)
    {
    case 0 : t_LoopAndDoWork<0>(query); break;
    case 1 : t_LoopAndDoWork<1>(query); break;
    ...
    }
}

this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be kept in sync and can be far apart in the code.

This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable into the right template call. But it's ugly.

BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel can have between 1 and 4 channels. Instead of this :


for(int c=0;c<channels;c++)
{
    DoStuff(c);
}

where channels is a constant (template parameter), it's better to do :


DoStuff(0);
if ( channels > 1 ) DoStuff(1);
if ( channels > 2 ) DoStuff(2);
if ( channels > 3 ) DoStuff(3);

because those ifs reliably go away.

5/26/2010

05-26-10 - Windows Page Cache

The correct way to cache things is through Windows' page cache. The advantage from doing this over using your own custom cache code is :

1. Automatically resizes based on amount of memory needed by other apps. eg. other apps can steal memory from your cache to run.

2. Automatically gives pages away to other apps or to file IO or whatever if they are touching their cache pages more often.

3. Automatically keeps the cache in memory between runs of your app (if nothing else clears it out). This is pretty immense.

Because of #3, your custom caching solution might slightly beat using the Windows cache on the first run, but on the second run it will stomp all over you.

To do this nicely, generally the cool thing to do is make a unique file name that is the key to the data you want to cache. Write the data to a file, then memory map it as read only to fetch it from the cache. It will now be managed by the Windows page cache and the memory map will just hand you a page that's already in memory if it's still in cache.

The only thing that's not completely awesome about this is the reliance on the file system. It would be nice if you could do this without ever going to the file system. eg. if the page is not in cache, I'd like Windows to call my function to fill that page rather than getting it from disk, but so far as I know this is not possible in any easy way.

For example : say you have a bunch of compressed images as JPEG or whatever. You want to keep uncompressed caches of them in memory. The right way is through the Windows page cache.

05-26-10 - Windows 7 Snap

My beloved "AllSnap" doesn't work on Windows 7 / x64. I can't find a replacement because fucking Windows has a feature called "Snap" now, so you can't fucking google for it. (also searching for "Windows 7" stuff in general is a real pain because solutions and apps for the different variants of windows don't always use the full name of the OS they are for in their page, so it's hard to search for; fucking operating systems really need unique code names that people can use to make it possible to search for them; "Windows" is awful awful in this regard).

I contacted the developer of AllSnap to see if he would give me the code so I could fix it, but he is ignoring me. I can tell from debugging apps when AllSnap is installed that it seems to work by injecting a DLL. This is similar to how I hacked the poker sites for GoldBullion, so I think I could probably reproduce that. But I dunno if Win7/x64 has changed anything about function injection and the whole DLL function pointer remap method.

BTW/FYI the standard Windows function injection method goes like this : Make a DLL that has some event handler. Run a little app that causes that event to trip inside the app you want to hijack. Your DLL is now invoked in that app's process to handle that event. Now you are running in that process so you can do anything you want - in particular you can find the function table to any of their DLL's, such as user32.dll, and stuff your own function pointer into that memory. Now when the app makes normal function calls, they go through your DLL.

5/25/2010

05-25-10 - Thread Insurance

I just multi-threaded my video test app recently, and it was reasonably easy, but I had a few nagging bugs because of hidden ways they were touching shared memory without protection deep inside functions. Okay, so I found them and fixed them, but I'm left with a problem - any time I touch one of those deep functions, I could screw up the threading without realizing it. And I might not get any indication of what I did for weeks if it's a rare race.

What I would like is a way to make this more robust. I have very strong threading primitives, I want a way to make sure that I use them! In particular, I want to be able to mark certain structs as only touchable when a critsec is locked or whatever.

I think that a lot of this could be done with Win32 memory page protections. So far as I know there's no way to associate protections per-thread, (eg. to make a page read/write for thread A but no-access for thread B). If I could do that it would be super sweet.

One idea is to make the page no access and then install my own exception handler that checks what thread it is, but that might be too much overhead (and not sure if that would fail for other reasons).

The main usage is not for protected crit-sec'ed structs, that is really the easiest case to maintain because it's very obvious right there in the code that you need to take the critsec to touch the variables. The hard case to maintain is the ad hoc "I know this is safe to touch without protection". In particular I have a lot of code that runs like this :


Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

So what I would really like to do is :


Phase 1 : I know no threads are touching shared data item A
main thread does lots of writing in A

-- set A memory to be read-only !
-- set B,C,D memory to be read/write only for their own thread

Phase 2 : fire up threads.  They only read from A and do so without protection.  They each write to unique areas B,C,D.

-- make A,B,C,D read/write only for main thread !

Phase 3 : spin down threads.  Now main thread can write A and read B,C,D.

The thing that this saves me from is when I'm tinkering in DoComplicatedStuff() which is some function called deep inside Phase 2 somewhere and I change it to no longer follow the memory access rule that it is supposed to be following. This is just my hate for having rules for code correctness that are not enforced by the compiler or at least by run-time asserts.

5/21/2010

05-21-10 - Video coding beyond H265

In the end movec-residual coding is inherently limitted and inefficient. Let's review the big advantage of it and the big problem.

The advantage is that the encoder can reasonably easy consider {movec,residual} coding choices jointly. This is a huge advantage over just picking what's my best movec, okay now code the residual. Because movec affects the residual, you cannot make a good R/D decision if you do it separately. By using block movecs, it reduces the number of options that need to be considered to a small enough set that encoders can practically consider a few important choices and make a smart R/D decision. This is what is behind all current good video encoders.

The disadvantage of movec-residual coding is that they are redundant and connected in a complex and difficult to handle way. We send them independently, but really they have cross-information about each other, and that is impossible to use in the standard framework.

There are obviously edges and shapes in the image which occur in both the movecs and the residuals. eg. a moving object will have a boundary, and really this edge should be used for both the movec and residual. In the current schemes we send a movec for the block, and then the residuals per pixel, so we now have finer grain information in the residual that should have been used to give us finer movecs per pixel, but it's too late now.

Let's back up to fundamentals. Assume for the moment that we are still working on an 8x8 block. We want to send that block in the current frame. We have previous frames and previous blocks within the current frame to help us. There are 256^3^64 possible values for this block. If we are doing lossy coding, then not all possible values for the block can be sent. I won't get into details of lossiness, so just say there are a large number of possible values for the pixels of the block; we want to code an index to one of those values.

Each index should be sent with a different bit length based on its probability. Already we see a flaw with {movec-residual} coding - there are tons of {movec,residual} pairs that specify the same index. Of course in a flat area lots of movecs might point to the same pixels, but even if that is eliminated, you could go movec +1, residual +3, or movec +3, residual +1, and both ways get to +4. Redundant encoding = bit waste.

Now, this bit waste might not be critically bad with current simple {movec,residual} schemes - but it is a major encumbrance if we start looking at more sophisticated mocomp options. Say you want to be able to send movecs for shapes, eg. send edges and then send a movec on each side. There are lots of possibilities here - you might just send a movec per pixel (this seems absurdly expensive, but the motion fields are very smooth so should code well from neighbors), or you might send a polygon mesh to specify shapes. This should give you much better motion fields, and then the information in the motion fields can be used to predict the residuals as well. But the problem is there's too much redundancy. You have greatly expanded the number of ways to code the same output pixels.

We could consider more modest steps as well, such as sticking with block mocomp + residual, but expanding what we can do for "mocomp". For example, you could use two motion vectors + arbitrary linear combination of the source blocks. Or you could do trapesoidal texture-mapping style mocomp. Or mocomp with a vector and scale + rotation. None of these is very valuable, there are numerous problems : 1. too many ways to encode for the encoder to do thorough R/D analysis of all of them, 2. too much redundancy, 3. still not using the joint information across residual & motion.

In the end the problem is that you are using a 6-d value {velocity,pixel} to specify a 3-d color. What you really want is a 3-d coordinate which is not in pixel space, but rather is a sort of "screw" in motion/pixel space. That is, you want the adjacent coordinates in motion/pixel space to be the ones that are closest together in the 6-d space. So for example RGB {100,0,0} and {0,200,50} might be neighbors in motion/pixel space if they can be reached by small motion adjustments.

Okay this is turning into rambling, but another way of seeing it is like this : for each block, construct a custom basis transform. Don't send a separate movec or anything - the axes of the basis transform select pixels by stepping in movec and also residual.

ADDENDUM : let me try to be more clear by doing a simple example. Say you are trying to code a block of pixels which only has 10 possible values. You want to code with a standard motion then residual method. Say there are only 2 choices for motion. It is foolish to code all 10 possible values for both motion vectors! That is, currently all video coders do something like :


Code motion = [0 or 1]
Code residual = [0,1,2,3,4,5,6,7,8,9]

Or in tree form :

   0 - [0,1,2,3,4,5,6,7,8,9]
*<
   1 - [0,1,2,3,4,5,6,7,8,9]

Clearly this is foolish. For each movec, you only need to code the residual which encodes that resulting pixel block the smallest under that movec. So you only need each output value to occur in one spot on the tree, eg.


   0 - [0,1,2,3,4]
*<
   1 - [5,6,7,8,9]

or something. That is, it's foolish to have to ways to encode the residual to reach a certain target when there were already cheaper ways to reach that target in the movec coding portion. To minimize this defficiency, most current coders like H264 will code blocks by either putting almost all the bits in the movec and very few in the residual, or the other way (almost none in the movec and most in the residual). The loss occurs most when you have many bits in the motion and many in the residual, something like :


   0 - [0,1,2]
   1 - [3,4,5,6]
   2 - [7,8]
   3 - [9]

The other huge fundamental defficiency is that the probability modeling of movecs and residuals is done in a very primitive way based only on "they are usually small" assumptions. In particular, probability modeling of movecs needs to be done not just based on the vector, but on the content of what is pointed at. I mentioned long ago there is a lot of redundancy there when you have lots of movecs pointing at the same thing. Also, the residual coding should be aware of what was pointed to by the movec. For example if the movec pointed at a hard edge, then the residual will likely also have a similar hard edge because it's likely we missed by a little bit, so you could use a custom transform that handles that better. etc.

ADDENDUM 2 : there's something else very subtle going on that I haven't seen discussed much. The normal way of sending {movec,residual} is actually over-complete. Mostly that's bad, too much over-completeness means you are just wasting bits, but actually some amount of over-completeness here is a good thing. In particular for each frame we are sending a little bit of extra side information that is useful for *later* frames. That is, we are sending enough information to decode the current frame to some quality level, plus some extra that is not really worth it for just the current frame, but is worth it because it helps later frames.

The problem is that the amount of extra information we are sending is not well understood. That is, in the current {movec,residual} schemes we are just sending extra information without being in control and making a specific decision. We should be choosing how much extra information to send by evaluating whether it is actually helpful on future frames. Obviously the last frames of the video (or a sequence before a cut) you shouldn't send any extra information.

In the examples above I'm showing how to reduce the overcomplete information down to a minimal set, but sometimes you might not want to do that. As a very course example say the true motion at a given pixel is +3, movec=3 to get to final pixel=7 , but you can code the same result smaller by using movec=1 - deciding whether to send the true motion or not should be done based on whether it actually helps in the future, but more importantly the code stream could collapse {3,7} and {1,7} so there is no redundant way to code if the difference is not helpful.

This becomes more important of course if you have a more complex motion scheme, like per-pixel motion or trapezoidal motion or whatever.

5/20/2010

05-20-10 - Some quick notes on H265

Since we're talking about VP8 I'd like to take this chance to briefly talk about some of the stuff coming in the future. H265 is being developed now, though it's still a long ways away. Basically at this point people are throwing lots of shit at the wall to see what sticks (and hope they get a patent in). It is interesting to see what kind of stuff we may have in the future. Almost none of it is really a big improvement like "duh we need to have that in our current stuff", it's mostly "do the same thing but use more CPU".

The best source I know of at the moment is H265.net , but you can also find lots of stuff just by searching for video on citeseer. (addendum : FTP to Dresen April meeting downloads ).

H265 is just another movec + residual coder, with block modes and quadtree-like partitions. I'll write another post about some ideas that are outside of this kind of scheme. Some quick notes on the kind of things we may see :

Super-resolution mocomp. There are some semi-realtime super-resolution filters being developed these days. Super-resolution lets you take a series of frames and great an output that's higher fidelity than any one source. In particular given a few assumptions about the underlying source material, it can reconstruct a good guess of the higher resolution original signal before sampling to the pixel grid. This lets you do finer subpel mocomp. Imagine for example that you have some black and white text that is slowly translating. On any one given frame there will be lots of gray edges due to the antialiased pixel sampling. Even if you perfectly know the subpixel location of that text on the target frame, you have no single reference frame to mocomp from. Instead you create super-resolution reference frame of the original signal and subpel mocomp from that.

Partitioned block transforms. One of the minor improvements in image coding lately, which is natural to move to video coding, is PBT with more flexible sizes. This means 8x16, 4x8, 4x32, whatever, lots of partition sizes, and having block transforms for that size of partitition. This lets the block transform match the data better. Which also leads us to -

Directional transforms and trained transforms. Another big step is not always using an X & Y oriented orthogonal DCT. You can get a big win by doing directional transforms. In particular, you find the directions of edges and construct a transform that has its bases aligned along those edges. This greatly reduces ringing and improves energy compaction. The problem is how do you signal the direction or the transform data? One option is to code the direction as extra side information, but that is probably prohibitive overhead. A better option is to look at the local pixels (you already have decoded neighbors) and run edge detection on them and find the local edge directions and use that to make your transform bases. Even more extreme would be to do a fully custom transform construction from local pixels (and the same neighborhood in the last frame), either using competition (select from a set of of transforms based on which one would have done best on those areas) or training (build the KLT for those areas). Custom trained bases are especially useful for "weird" images like Barb. These techniques can also be used for ...

Intra prediction. Like residual transforms, you want directional intra prediction that runs along the edges of your block, and ideally you don't want to send bits to flag that direction, rather figure it out from neighbors & previous frame (at least to condition your probabilities). Aside from finding direction, neighbors could be used to vote for or train fully custom intra predictors. One of the H265 proposals is basically GLICBAWLS applied to intra prediction - that is, train a local linear predictor by doing weighted LSQR on the neighborhood. There are some other equally insane intra prediction proposals - basically any texture synthesis or prediction paper over the last 10 years is fair game for insane H265 intra prediction proposals, so for example you have suggestions like Markov 2x2 block matching intra prediction which builds a context from the local pixel neighborhood and then predicts pixels that have been seen in similar contexts in the image so far.

Unblocking filters ("loop filtering" huh?) are an obvious area for improvement. The biggest area for improvement is deciding when a block edge has been created by the codec and when it is in the source data. This can actually usually be figured out if the unblocking filter has access to not just the pixels, but how they were coded and what they were mocomped from. In particular, it can see whether the code stream was *trying* to send a smooth curve and just couldn't because of quantization, or whether the code stream intentionally didn't send a smooth curve (eg. it could have but chose not to).

Subpel filters. There are a lot of proposal on improved sub-pixel filters. Obviously you can use more taps to get better (sharper) frequency response, and you can add 1/8 pel or finer. The more dramatic proposals are to go to non-separable filters, non-axis aligned filters (eg. oriented filters), and trained/adaptive filters, either with the filter coefficients transmitted per frame or again deduced from the previous frame. The issue is that what you have is just a pixel sampled aliased previous frame; in order to do sub-pel filtering you need to make some assumptions about the underlying image signal; eg. what is the energy in frequencies higher than the sampling limit? Different sub-pel filters correspond to different assumptions about the beyond-nyquist frequency content. As usual orienting filters along edges helps.

Improved entropy coding. So far as I can tell there's nothing too interesting here. Current video coders (H264) use entropy coders from the 1980's (very similar to the Q-coder stuff in JPEG-ari), and the proposals are to bring the entropy coding into the 1990's, on the level of ECECOW or EZDCT.

5/19/2010

05-19-10 - Some quick notes on VP8

The VP8 release is exciting for what it might be in two years.

If it in fact becomes a clean open-source video standard with no major patent encumbrances, it might be well integrated in Firefox, Windows Media, etc. etc. - eg. we might actually have a video format that actually just WORKS! I don't even care if the quality/size is really competitive. How sweet would it be if there was a format that I knew I could download and it would just play back correctly and not give me any headaches. Right now that does not exist at all. (it's a sad fact that animated GIF is probably the most portable video format of the moment).

Now, you might well ask - why VP8 ? To that I have no good answer. VP8 seems like a messy cock-assed standard which has nothing in particular going for it. The entropy encoder in particular (much like H264) seems badly designed and inefficient. The basics are completely vanilla, in that it is block based, block modes, movecs, transforms, residual coding. In that sense it is just like MPEG1 or H265. That is a perfectly fine thing to do, and in fact it's what I've wound up doing, but you could pull a video standard like that out of your ass in about five minutes, there's no need to license code for that. If in fact VP8 does dodge all the existing patents then that would be a reason that it has value.

The VP8 code stream is probably pretty weak (I really don't know enough of the details to say for sure). However, what I have learned of late is that there is massive room for the encoder to make good output video even through a weak code stream. In fact I think a very good encoder could make good output from an MPEG2 level of code stream. Monty at Xiph has a nice page about work on Theora. There's nothing really cutting edge in there but it's nicely written and it's a good demonstration of the improvement you can get on a fixed standard code stream just with encoder improvements (and really their encoder is only up to "good but still basic" and not really into the realm of wicked-aggressive).

The only question we need to ask about the VP8 code stream is : is it flexible enough that it's possible to write a good encoder for it over the next few years? And it seems the answer is yes. (contrast this to VP3/Theora which has a fundamentally broken code stream which has made it very hard to write a good encoder).

ADDENDUM : this post by Greg Maxwell is pretty right on.

ADDENDUM 2 : Something major that's been missing from the web discussions and from the literature about video for a long time is the separation of code stream from encoder. The code stream basically gives the encoder a language and framework to work in. The things that Jason / Dark Shikary thinks are so great about x264 are almost entirely encoder-side things that could apply to almost any code stream (eg. "psy rdo" , "AQ", "mbtree", etc.). The literature doesn't discuss this much because they are trapped in the pit of PSNR comparisons, in which encoder side work is not that interesting. Encoder work for PSNR is not interesting because we generally know directly how to optimizing for MSE/SSD/L2 error - very simple ways like flat quantizers and DCT-space trellis quant, etc. What's more interesting is perceptual quality optimization in the encoder. In order to acheive good perceptual optimization, what you need is a good way to measure percpetual error (which we don't have), and the ability to try things in the code stream and see if they improve perceptual error (hard due to non-local effects), and a code stream that is flexible enough for the encoder to make choices that create different kinds of errors in the output. For example adding more block modes to your video coder with different types of coding is usually/often bad in a PSNR sense because all they do is create redundancy and take away code space from the normal modes, but it can be very good in a perceptual sense because it gives the encoder more choice.

ADDENDUM 3 : Case in point , I finally have noticed some x264 encoded videos showing up on the torrent sites. Well, about 90% of them don't play back on my media PC right. There's some glitching problem, or the audio & video get out of sync, or the framerate is off a tiny bit, or some shit and it's fucking annoying.

ADDENDUM 4 : I should be more clear - the most exciting thing about VP8 is that it (hopefully) provides an open patent-free standard that can then be played with and discussed openly by the development community. Hopefully encoders and decoder will also be open source and we will be able to talk about the techniques that go into them, and a whole new

5/13/2010

05-13-10 - P4 with NiftyPerforce and no P4SCC

I'm trying using P4 in MSDev with NiftyPerforce and no P4SCC.

What this means is VC thinks you have no SCC connection at all, your files are just on your disk. You need to change the default NiftyPerforce settings so that it checks out files for you when you edit/save etc.

Advantages of NiftyPerforce without P4SCC :

1. Much faster startup / project load, because it doesn't go and check the status of everything in the project with P4.

2. No clusterfuck when you start unconnected. This is one the worst problems with P4SCC, for example if you want to work on some work projects but can't VPN for some reason, P4SCC will have a total shit fit about working disconnected. With the NiftyPerforce setup you just attrib your files and go on with your business.

3. No difficulties with changing binding/etc. This is another major disaster with P4SCC. It's rare, but if you change the P4 location of a project or change your mappings or if you already have some files added to P4 but not the project, all these things give MSdev a complete shit-fit. That all goes away.

Disadvantages of NiftyPerforce without P4SCC :

1. The first few keystrokes are lost. When you try to edit a checked-in file, you can just start typing and Nifty will go check it out, but until the checkout is done your keystrokes go to never-never land. Mild suckitude. Alternatively you could let MSDev pop up the dialog for "do you want to edit this read only file" which would make you more aware of what's going on but doesn't actually fix the issue.

2. No check marks and locks in project browser to let you know what's checked in / checked out. This is not a huge big deal, but it is a nice sanity check to make sure things are working the way they should be. Instead you have to keep an eye on your P4Win window which is a mild productivity hit.

One note about making the changeover : for existing projects that have P4SCC bindings, if you load them up in VC and tell VC to remove the binding, it also will be "helpful" and go attrib all your files to make them writeable (it also will be unhelpful and not check out your projects to make the change to not have them bound). Then NiftyPerforce won't work because your files are already writeable. The easiest way to do this right is to just open your vcproj's and sln's in a text editor and rip out all the binding bits manually.

I'm not sure yet whether the pros/cons are worth it. P4SCC actually is pretty nice once it's set up, though the ass-pain it gives when trying to make it do something it doesn't want to do (like source control something that's out of the binding root) is pretty severe.

ADDENDUM :

I found the real pro & con of each way.

Pro P4SCC : You can just start editting files in VC and not worry about it. It auto-checks out files from P4 and you don't lose key presses. The most important case here is that it correctly handles files that you have not got the latest revision of - it will pop up "edit current or sync first" in that case. The best way to use Nifty seems to be Jim's suggestion - put checkout on Save, do not checkout on Edit, and make files read-only editable in memory. That works great if you are a single dev but is not super awesome in an actual shared environment with heavy contention.

Pro NiftyP4 : When you're working from home over an unreliable VPN, P4SCC is just unworkable. If you lose connection it basically hangs MSDev. This is so bad that it pretty much completely dooms P4SCC. ARG actually I take that back a bit, NiftyP4 also hangs MSDev when you lose connection, though it's not nearly as bad.

5/12/2010

05-12-10 - P4 By Dir

(ADDENDUM : see comments, I am dumb).

I mentioned this before :

(Currently that's not a great option for me because I talk to both my home P4 server and my work P4 server, and P4 stupidly does not have a way to set the server by local directory. That is, if I'm working on stuff in c:\home I want to use one env spec and if I'm in c:\work, use another env spec. This fucks up things like NiftyPerforce and p4.exe because they just use a global environment setting for server, so if I have some work code and some home code open at the same time they shit their pants. I think that I'll make my own replacement p4.exe that does this the right way at some point; I guess the right way is probably to do something like CVS/SVN does and have a config file in dirs, and walk up the dir tree and take the first config you find).

But I'm having second thoughts, because putting little config shitlets in my source dirs is one of the things I hate about CVS. Granted it would be much better in this case - I would only need a handful of them in my top level dirs, but another disadvantage is my p4bydir app would need to scan up the dir tree all the time to find config files.

And there's a better way. The thing is, the P4 Client specs already have the information of what dirs on my local machine go with what depot mappings. The problem is the client spec is not actually associated with a server. What you need is a "port client user" setting. These are stored as favorites in P4Win, but there is no authoritative list of the valid/good "port client user" setups on a machine.

So, my new idea is that I store my own config file somewhere that lists the valid "port client user" sets that I want to consider in p4bydir. I load that and then grab all the client specs. I use the client specs to identify what dirs to map to where, and the "port client user" settings to tell what p4 environment to set for that dir.

I then replace the global p4.exe with my own p4bydir so that all apps (like NiftyPerforce) will automatically talk to the right connection whenever they do a p4 on a file.

05-12-10 - Cleartype

Since I ranted about Cleartype I thought I'd go into a bit more detail. this article on Cleartype in Win7 is interesting, though also willfully dense.

Another research question we�ve asked ourselves is why do some people prefer bi-level rendering over ClearType? Is it due to hardware issues or is there some other attribute that we don�t understand about visual systems that is playing a role. This is an issue that has piqued our curiosity for some time. Our first attempt at looking further into this involved doing an informal and small-scale preference study in a community center near Microsoft.

Wait, this is a research question ? Gee, why would I prefer perfect black and white raster fonts to smudged and color-fringed cleartype. I just can't imagine it! Better do some community user testing...

1. 35 participants. 2. Comments for bi-level rendering: Washed out; jiggly; sketchy; if this were a printer, I�d say it needed a new cartridge; fading out � esp. the numbers, I have to squint to read this, is it my glasses or it is me?; I can�t focus on this; broken up; have to strain to read; jointed. 3. Comments for ClearType: More defined, Looks bold (several times), looks darker, clearer (4 times), looks like it�s a better computer screen (user suggested he�d pay $500 more for the better screen on a $2000 laptop), sort of more blue, solid, much easier to read (3 times), clean, crisp, I like it, shows up better, and my favorite: from an elderly woman who was rather put out that the question wasn�t harder: this seems so obvious (said with a sneer.)

Oh my god, LOL, holy crap. They are obviously comparing Cleartyped anti-aliased fonts to black-and-white rendered TrueType fonts, NOT to raster fonts. They're probably doing big fonts on a high DPI screen too. Try it again on a 24" LCD with an 8 point font please, and compare something that has an unhinted TrueType and an actual hand-crafted raster font. Jesus. Oh, but I must be wrong because the community survey says 94% prefer cleartype!

Anyway, as usual the annoying thing is that in pushing their fuck-tard agenda, they refuse to acknowledge the actual pros and cons of each method and give you the controls you really want. What I would like is a setting to make Windows always prefer bitmap fonts when they exist, but use ClearType if it is actually drawing anti-aliased fonts. Even then I still might not use it because I fucking hate those color fringes, but it would be way more reasonable. Beyond that obviously you could want even more control like switching preferrence for cleartype vs. bitmap per font, or turning on and off hinting per font or per app, etc. but just some more reasonable global default would get you 90% of the way there. I would want something like "always prefer raster font for sizes <= 14 point" or something like that.

Text editors are a simple case because you just to let the user set the font and get what they want, and it doesn't matter what size the text is because it's not layed out. PDF's and such I guess you go ahead and use TT all the time. The web is a weird hybrid which is semi-formatted. The problem with the web is that it doesn't tell you when formatting is important or not important. I'd like to override the basic firefox font to be my own choice nice bitmap font *when formatting is not important* (eg. in blocks of text like I make). But if you do that globally it hoses the layout of some pages. And then other pages will manually request fonts which are blurry bollocks.

CodeProject has a nice font survey with Cleartype/no-Cleartype screen caps.

GDI++ is an interesting hack to GDI32.dll to replace the font rendering.

Entropy overload has some decent hinted TTF fonts for programmers you can use in VS 2010.

Electronic Dissonance has the real awesome solution : sneak raster fonts into asian fonts so that VS 2010 / WPF will use them. This is money if you use VS 2010.

5/11/2010

05-11-10 - Some New Cblib Apps

Coded up some new goodies for myself today and released them in a new cblib and chuksh .

RunOrActivate : useful with a hot key program, or from the CLI. Use RunOrActivate [program name]. If a running process of that program exists, it will be activated and made foreground. If not, a new instance is started. Similar to the Windows built-in "shortcut key" functionality but not horribly broken like that is.

(BTW for those that don't know, Windows "shortcut keys" have had huge bugs ever since Win 95 ; they sometimes work great, basically doing RunOrActivate, but they use some weird mechanism which causes them to not work right with some apps (maybe they use DDE?), they also have bizarre latency semi-randomly, usually they launch the app instantly but occasionally they just decide to wait for 10 seconds or so).

RunOrActivate also has a bonus feature : if multiple instances of that process are running it will cycle you between them. So for example my Win-E now starts an explorer, goes to existing one if there was one, and if there were a few it cycles between explorers. Very nice. Also works with TCC windows and Firefox Windows. This actually solves a long-time useability problem I've had with shortcut keys that I never thought about fixing before, so huzzah.

WinMove : I've been using this forever, lets you move and resize the active window in various ways, either by manual coordinate or with some shorthands for "left half" etc. Anyway the new bit is I just added an option for "all windows" so that I can reproduce the Win-M minimize all behavior and Win-Shift-M restore all.

I think that gives me all Win-Key functions I actually want.

ADDENDUM : One slightly fiddly bit is the question of *which* window of a process to activate in RunOrActivate. Windows refuses to give you any concept of the "primary" window of a process, simply sticking to the assertion that processes can have many windows. However we all know this is bullshit because Alt-Tab picks out an isolated set of "primary" windows to switch between. So how do you get the list of alt-tab windows? You don't. It's "undefined", so you have to make it up somehow. Raymond Chen describes the algorithm used in one version of Windows.

5/09/2010

05-09-10 - Some Win7 Shite

Perforce Server was being a pain in my ass to start up because the fucking P4S service doesn't get my P4ROOT environment variable. Rather than try to figure out the fucking Win 7 per-user environment variable shite, the easy solution is just to move your P4S.exe into your P4ROOT directory, that way when it sees no P4ROOT setting it will just use current directory.

New P4 Installs don't include P4Win , but you can just copy it from your old install and keep using it.

This is not a Win7 problem so much as a "newer MS systems" problem, but non-antialiased / non-cleartype text rendering is getting nerfed. Old stuff that uses GDI will still render good old bitmap fonts fine, but newer stuff that uses WPF has NO BITMAP FONT SUPPORT. That is, they are always using antialiasing, which is totally inappropriate for small text (especially without cleartype). (For example MSVC 2010 has no bitmap font support (* yes I know there are some workarounds for this)).

This is a huge fucking LOSE for serious developers. MS used to actually have better small text than Apple, Apple always did way better at smooth large blurry WYSIWYG text shit. Now MS is just worse all around because they have intentionally nerfed the thing they were winning at. I'm very disappointed because I always run no-cleartype, no-antialias because small bitmap fonts are so much better. A human font craftsman carefully choosing which pixels should be on or off is so much better than some fucking algorithm trying to approximate a smooth curve in 3 pixels and instead giving me fucking blue and red fringes.

Obviously anti-aliased text is the *future* of text rendering, but that future is still pretty far away. My 24" 1920x1200 that I like to work on is 94 dpi (a 30" 2560x2600 is 100 dpi, almost the same). My 17" lappy at 1920x1200 has some of the highest pixel density that you can get for a reasonable price, it's pretty awesome for photos, but it's still only 133 dpi which is shit for text (*). To actually do good looking antialiased text you need at least 200 dpi, and 300 would be better. This is 5-10 years away for consumer price points. (In fact the lappy screen is the unfortunate uncanny valley; the 24" at 1920x1200 is the perfect res where non-atialiased stuff is the right size on screen and has the right amount of detail. If you just go to slightly higher dpi, like 133, then everything is too small. If you then scale it up in software to make it the right size for the eye, you don't actually have enough pixels to do that scale up. The problem is that until you get above 200 dpi where you can do arbitrary scaling of GUI elements, the physical size of the pixel is important, and the 100 dpi pixel is just about perfect). (* = shit for anti-aliased text, obviously great for raster fonts at 14 pels or so).

( ADDENDUM : Urg I keep trying to turn on Cleartype and be okay with it. No no no it's not okay. They should call it "Clear Chromatic Abberation" or "Clearly the Developers who thing this is okay are colorblind". Do they think our eyes only see luma !? WTF !? Introducing colors into my black and white text is just such a huge visual artifact that no amount of improvement to the curve shapes can make up for that. )

It's actually pretty sweet right now living in a world where our CPU's are nice and multi-core, but most apps are still single core. It means I can control the load on my machine myself, which is damn nice. For example I can run 4 apps and know that they will all be pretty nice and snappy. These days I am frequently keeping 3 copies of my video test app running various tests all the time, and since it's single core I know I have one free core to still fuck around on the computer and it's full speed. The sad thing is that once apps actually all go multi-core this is going to go away, because when you actually have to share cores, Windows goes to shit.

Christ why is the registry still so fucking broken? 1. If you are a developer, please please make your apps not use the registry. Put config files in the same dir as your .exe. 2. The Registry is just a bunch of text strings, why is it not fucking version controlled? I want a log of the changes and I want to know what app made the change when. WTF.

The only decent way to get environment variables set is with TCC "set /S" or "set /U".

"C:\Program Files (x86)" is a huge fucking annoyance. Not only does it break by muscle memory and break a ton of batch files I had that looked for program files, but now I have a fucking quandary every time I'm trying to hunt down a program.. err is it in x86 or not? I really don't like that decision. I understand it's needed for if you actually have an x86 and x64 version of the same app installed, but that is very rare, and you should have only bifurcated paths on apps that actually do have a dual install. (also because lots of apps hard code to c:\program files , they have a horrible hack where they let 32 bit apps think they are actually in c:\program files when they are in "C:\Program Files (x86)"). Blurg.


[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]
"EnableSuperfetch"=dword:00000000
"EnablePrefetcher"=dword:00000000

[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]
"NoWinKeys"=dword:00000001

[HKEY_CURRENT_USER\Control Panel\Desktop]
"MenuShowDelay"="10"

Some links :

Types - Vista/Win7 has borked the "File Associations" setup. You need a 3rd party app like Types now to configure your file types (eg. to change default icons).

Shark007.net - Windows 7 Codecs - WMP12 Codecs - seem to work.

Pismo Technic Inc. - Pismo File Mount - nicest ISO mounter I've found (Daemon tools feels like it's made out of spit and straw).

Hot Key Plus by Brian Apps - ancient app that still works and I like because it's super simple.

Change Windows 7 Default Folder Icon - Windows 7 Forums ; presumably you have the Preview stuff for Folders turned off, so now make the icon not so ugly.

- how to move your perforce depot Annoyingly I used a different machine name for new lappy and thus a different clientview, so MSVC P4SCC fails to make the connection and wants to rebind every project. The easiest way to fix this is just to not use P4SCC and kill all your bindings and just use NiftyPerforce without P4SCC.

allSnap make all windows snap - AllSnap for x64/Win7 seems to be broken, but the old 32 bit one seems to work just fine still. (ADDENDUM : nope, old allsnap randomly crashes in Win 7, do not use)

KeyTweak Homepage - I used KeyTweak to remap my Caps Lock to Alt.

Firefox addons :


DownThemAll
Flashblock
PDF Download
Stop Autoplay
Adblock Plus
DownloadHelper

ADDENDUM : I found the last few guys who are ticking my disk :

One that you obviously want to disable is Windows Media Player Media sharing serivcee : Fix wmpnetwk.exe In Windows 7 . It just constantly scans your media dirs for shit to serve. Fuck you.

The next big culprit is the Windows Reliability stuff. Go ahead and disable the RAC scheduled task, but that's not the real problem. The nasty one is the "last alive stamp" which windows writes once a minute by default. This is to help diagnose crashes. You could change TimeStampInterval to 30 or so to make it once every 30 minutes, but I set it to zero to disable it. See :
Shutdown Event Tracker Tools and Settings System Reliability
How to disable hard disk thrashing with Vista - Page 7

And this is a decent summary/repeat of what I've said : How to greatly reduce harddisk grinding noises in Vista .