3/02/2009

03-02-09 - Sleep Sucks and VSync Woes

Ok, first let me describe the issue :

You want to hit 60 fps (or whatever) pretty reliably. Each frame there is some amount of work you *must* do (eg. rendering, respond to input), and there is some amount of work that you would like to do (eg. streaming in data, decompressing, etc.). You want to do as much of the optional work as possible each frame, without using too much time.

You would also like to behave well with other apps. eg. you don't want to just spin and eat 100% of the CPU when you're idle. (this is especially important for casual games where the user is likely browsing the web while playing). To do this you want to Sleep() or something with your idle time.

So, ideally your frame loop looks something like :


-- page flip --

start = timer();

[ do necessary work ]

remaining = start + frame_duration - timer()

[ do optional work as remaining time allows ]

remaining = start + frame_duration - timer()

[ if remaining is large , sleep a while ]

-- page flip --

You want the duration of these three things to add up :

(necessary work) + (optional work) + (sleep time) <= frame_duration

but as close to frame_duration as possible without going over

Okay. There is one specific case where this is pretty easy : if you are in full screen, using V-sync page flips, *AND* your graphics driver is actually polite and sleeps for V-sync instead of just spinning. I don't know what the state of modern drivers is, but in the past when I looked into this I found that many major companies don't actually sleep-to-vsync on Windows, they busy spin. (this is not entirely their fault, as there is not a very good vsync interrupt system on windows).

Sleep-to-vsync is important not just because you want to give time to other apps, but because the main thread wants to give time to the background threads. eg. if you want to put streaming on a lower priority background thread, then you write your loop just like :


-- page flip --
[ do necessary work ]
-- request page flip --
    puts main thread to sleep
    [ optional work runs until vsync interrupt switches me back ]
-- page flip --

and you assume the page flip will sleep you to vsync which will give the background loader time to run.

Even if your driver is nice and gives you a sleep-to-vsync, you can't use it in windowed mode.

A lot of games have horrible tearing in windowed mode. It seems to me that you could fix this by using the Dx9+ GetRasterStatus calls. You simply don't do a flip while the raster scan line is inside your window (or right near the top of your window). You wait for it to be past the bottom or in the vblank area or far enough above you. That should eliminate tearing and also make you hit the frame time more exactly because you sync to the raster.

(if you just use a timer without using the raster you can get the worst possible tearing, where a flip line scans up and down on your image because you're just barely missing sync each frame).

But we still have the problem that we can't sleep to vsync, we basically have to spin and poll the raster.

Now you might think you could look at the remaining time and try to sleep that amount. Ha! Sleep(millis) is a huge disaster. Here's a sampling of how long you actually sleep when you call Sleep(0) or Sleep(1) :


sleepMillis : 0 , actualMillis : 28.335571
sleepMillis : 0 , actualMillis : 22.029877
sleepMillis : 0 , actualMillis : 3.383636
sleepMillis : 0 , actualMillis : 18.398285
sleepMillis : 0 , actualMillis : 22.335052
sleepMillis : 0 , actualMillis : 6.214142
sleepMillis : 0 , actualMillis : 3.513336
sleepMillis : 0 , actualMillis : 15.213013
sleepMillis : 0 , actualMillis : 16.242981
sleepMillis : 0 , actualMillis : 8.148193
sleepMillis : 1 , actualMillis : 5.401611
sleepMillis : 0 , actualMillis : 15.296936
sleepMillis : 0 , actualMillis : 16.204834
sleepMillis : 1 , actualMillis : 7.736206
sleepMillis : 0 , actualMillis : 3.112793
sleepMillis : 0 , actualMillis : 16.194344
sleepMillis : 0 , actualMillis : 6.464005
sleepMillis : 0 , actualMillis : 3.378868
sleepMillis : 0 , actualMillis : 22.548676
sleepMillis : 0 , actualMillis : 4.644394
sleepMillis : 0 , actualMillis : 28.266907
sleepMillis : 0 , actualMillis : 51.839828
sleepMillis : 0 , actualMillis : 7.650375
sleepMillis : 0 , actualMillis : 19.336700
sleepMillis : 0 , actualMillis : 5.380630
sleepMillis : 0 , actualMillis : 6.172180
sleepMillis : 0 , actualMillis : 30.097961
sleepMillis : 0 , actualMillis : 17.053604
sleepMillis : 0 , actualMillis : 3.190994
sleepMillis : 0 , actualMillis : 27.353287
sleepMillis : 0 , actualMillis : 23.557663
sleepMillis : 0 , actualMillis : 27.498245
sleepMillis : 0 , actualMillis : 26.178360
sleepMillis : 0 , actualMillis : 29.123306
sleepMillis : 0 , actualMillis : 23.521423
sleepMillis : 0 , actualMillis : 6.811142

You basically cannot call Sleep() at all EVER in a Windows game if you want to reliably hit your frame rate.

And yes yes this is with timeBeginPeriod(1) and this is even with setting my priority to THREAD_PRIORITY_TIME_CRITICAL. And no nothing at all is running on my machine (I'm sure the average user machine that has fucking Windows Crippleware grinding away in the background all the time is far worse).

So I have no solution and I am unhappy.

There definitely seems to be some weird stuff going on in the scheduler triggering this. Like if I just sit in my main thread and do render-flip , with no thread switches and no file IO, then my sleep times seem to be pretty reliable. If I kick some streaming work that activates the other threads and does file IO, then my sleep times suddenly go nuts - and they stay nuts for about a second after all the streaming work is done. It's like the scheduler increases my time slice quantum and then lets it come back down later.

My links on this topic :

Using Waitable Timers with an Asynchronous Procedure Call (Windows)
Timing in Win32 - Geiss
Priority Boosts (Windows)
Molly Rocket View topic - Why Win32 is a bad platform for games - part 1
IDirectDrawGetScanLine
How to fight tearing - virtualdub.org
Examining and Adjusting Thread Priority
Detecting Vertical Retrace with crazy vxd
CodeProject Tearing-free drawing with mmtimer
CodeGuru Creating a High-Precision, High-Resolution, and Highly Reliable Timer, Utilising Minimal CPU Resources

I am the linkenator. I am the linkosaurus rex. Linkotron 5000. I have strong Link-Fu. I see your Links are as big as mine. My name is Inlinko Montoya, you linked my father, prepare to be linked!


ADDENDUM : The situation seems to be rather better in Dx9+. You can use VSync in Windowed Mode, and the VSync does in fact seem to sleep your app. (older versions of Dx could only Vsync in full screen, I didn't expect this to actually work, but it does).

However, I am still seeing some really weird stuff with it. For one thing, it seems to frequently miss VSync for no good reason. If you take one of the trivial Dx9 sample apps from the SDK, in their framework you will find that they don't display the frame rate when Vsync is On. Sneaky little bastards, it hides their flaw.

In any sample app using the DXUT framework you can find a line like :


txtHelper.DrawTextLine( DXUTGetFrameStats( DXUTIsVsyncEnabled() ) );

This line shows the frame rate only if Vsync if Off. Go ahead and try turning off VSync - you should see a framerate like 400 - 800 fps. Now change it to :

txtHelper.DrawTextLine( DXUTGetFrameStats( true ) );

so that you show framerate even if Vsync is on. You will see a framerate of 40 !!!

40 fps is what you get when you alternate between a 30 fps and a 60 fps frame. ((1/30) + (1/60))/2 = 1/40 . That means they are missing vsync on alternating frames.

If you go and stick a timer around just the Present() call to measure the duration spent in Present, you should see something like :


no vsync :
    duration : lo : 1.20 millis , hi : 1.40 millis

vsync :

    duration : lo : 15.3 millis , hi : 31.03 millis

Yuck. Fortunately, the samples are really easy to fix. Just put this line anywhere in the sample app :

timeBeginPeriod(1);

to change the scheduler granularity. This says to me that Present() is actually just calling the OS Sleep() , and is not using a nice interrupt or anything like that, so if your scheduler granularity is too high you will sleep too long and miss vsync a lot. In fact I would not be surprised if Present just had a loop like :
    for(;;)
    {
        if ( IsRasterReady() ) break;

        Sleep(1);
    }
which sucks donkey balls for various reasons. (* addendum : yes they are doing this, see below). If in fact they are doing what I think they are doing, then you will hit vsync more reliably if you pump up your thread priority before calling Sleep :

    int oldp = GetThreadPriority( GetCurrentThread() );
    SetThreadPriority( GetCurrentThread() , THREAD_PRIORITY_SOMETHING_REALLY_FUCKING_HIGH );

    Present( ... );

    SetThreadPriority( GetCurrentThread() , oldp );
    
this tries to ensure that the scheduler will actually switch back to you for your vsync the way you want. (this has no effect in the samples cuz they have no other threads, but may be necessary if you're actually stressing the system).

Note the samples also freak out if they don't have focus. That's because they actually call Sleep(50) when they don't have focus. That's a cool thing and not a bug ;)

To wrap up, here's the approach that I believe works :


Use dx9+

Use Vsync and 1 back buffer

Bump up thread priority & scheduler interval to make sure Present() doesn't miss

Make main thread higher priority than worker threads so that workers run when main is sleeping
(must also check that workers are not starving)

On machines with speed-step , *and* if the app has focus :
Run a very low priority thread that just spins to keep the cpu clocked up

and cross your fingers and pray.

BTW whenever you show something like framerate you should show four numbers :


Instant current value this frame.
Average over last N frames.
Low over last N frames.
High over last N frames.

with N like 10 or something.
obviously you can do this trivially with a little circular buffer.

You can see a lot of things that way. For example when you're doing something like missing vsync on alternating frames, you will see the average is perfectly stable, a rock solid 40.0 , but the instantaneous value is jumping up and down like mad.

BTW cb::circular_array does this pretty well.

ADDENDUM : I found a note in the Dx9 docs that is quite illuminating. They mention that if you use PRESENT_INTERVAL_DEFAULT, the action is to sync to vblank. If you use PRESENT_INTERVAL_ONE, the action is the same - but there is a difference, it automatically does a timeBeginPeriod(1) for you to help you hit vsync more precisely. This tells me that they are in fact doing the Sleep()/GetRasterStatus loop that I suspected.

14 comments:

won3d said...

Reminds me a bit of this:

https://mollyrocket.com/forums/viewtopic.php?t=520

Also, I'm curious if SwitchToThread helps you at all.

http://msdn.microsoft.com/en-us/library/ms686352(VS.85).aspx

cbloom said...

Urg I need to fix the damn blogger putting things out of order.

cbloom said...

Yeah I linked to Casey's post, there's a lot of stuff he gets right in there.

One is that if you let your machine speedstep, all bets are off. That sucks because I would like to let my lappy chill out, but it doesn't seem possible. So you need a super low priority thread to just spin.

The other thing is about the priorities and the thread switching. Maybe one reason Windows was seeing those particular weird priority issues is because of the boosting :

http://msdn.microsoft.com/en-us/library/ms684828(VS.85).aspx

Presumably the background thread is getting stuff like IO Completions which will give it a boost and mess up what you're trying to accomplish.

I guess one issue is that Windows schedules threads *globally* - so if you want your threads to really get the time slices you ask for, you may need to bump up all your priorities so that you're higher than everything else in the system.

cbloom said...

Yeah so the Casey method is pretty good I think. I still get some bad over-time sleeps though. It helps if I bump the main thread up to HIGHEST all the time. And I think it also helps if I bump the whole Process priority class up.

Basically the goal is that the main rendering loop is super high priority so it only gives up time when it really wants to.

This does have two disadvantages though :

1. It increases the latency of IO because now IO really only runs in the time slot where you yield to the worker thread. Ideally you could let it get some IO's started earlier in the frame and then have it sleep waiting for them to finish, and it could process them in the time slot you give it later.

2. You have to worry about starvation and getting behind. Because you're only giving up the little time gap that you can afford to give up without missing framerate, if you aren't rendering fast enough so you keep giving up too little time, your background work can keep building up and up. Now you need some heuristic to go ahead and eat a big hitch and flush the processing queues.

Sigh.

Anonymous said...

A lot of games have horrible tearing in windowed mode. It seems to me that you could fix this by using the Dx9+ GetRasterStatus calls.

The popcap framework uses one of those sorts of APIs to do exactly that, try to avoid tearing. I replicated their algorithm for Lost in the Static, but I didn't test on lots of machines so I'm not positive I got it right. I used a DirectDraw function GetScanLine, and blit the window in two pieces; the top piece draws after the raster position reaches the bottom half (but not too close to the bottom edge), and then as soon as the raster pos passes the bottom of the window blit the rest. This scales properly between windowed and fullscreen and would can handle slow blits without tearing, up to the point where the blit takes 1.5 frames to complete, although I eventually added code to allow it to start blitting during vertical blank but that requires the blit to be faster than the refresh, plus actually it looks like it's just broken in that case actually.

code snippet

And I've never tried to make it work for OpenGL windowed apps or such.

castano said...

What about CreateWaitableTimer? Have you tried that?

http://meshula.net/wordpress/?p=189

cbloom said...

WaitableTimers don't seem to be any more reliable than Sleep()

this is what I just got from WaitableTimer :

sleepMillis : 14 , actualMillis : 26.492477
sleepMillis : 15 , actualMillis : 28.600931
sleepMillis : 15 , actualMillis : 32.035351
sleepMillis : 15 , actualMillis : 20.810127
sleepMillis : 15 , actualMillis : 32.970428

cbloom said...

Mmm.. I take that back, maybe WaitableTimer is better. But the code sample on meshula doesn't work. You have to use the priority boost trick :



int oldpri = GetThreadPriority( GetCurrentThread() );
SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );

WaitForSingleObject(Timer,INFINITE);

SetThreadPriority( GetCurrentThread(), oldpri );


and that actually seems pretty reliable so far.

ElonNarai said...

I was wondering, but wouldn't it be easier to just use dynamic timesteps to update everything?
If the user is browsing the web then the application is out of focus which means you can use another update scheme or just reduce the priority all the way and use a different rendering scheme or put a sleep of 100 ms in it to reduce fps to 10. The game time will continue to have the same speed (thanks to deltatime) but since you calculate once every 100 ms you free up the rest of the time.

Further setting the priority in windows is a suggestion. If you tell windows that you are time critical (and assuming all the other applications are asleep) it might still not allocate all it's processing power to you.

Besides that some cards report that vsync works while they are running far above the vertical refresh speed

cbloom said...

"I was wondering, but wouldn't it be easier to just use dynamic timesteps to update everything?"

I do use dynamic timesteps, but I still want to hit vsync if possible for various reasons.

1. you avoid tearing

2. there's no point in rendering faster than vsync

3. the game feels better if it's at constant framerate even with very good dynamic timestep support

"Further setting the priority in windows is a suggestion."

Yeah I mentioned a little about that. It definitely does help, though it makes stability a real risk and it's a pain during development. (partly this is because the damn stupid way that TaskMgr is just a normal process and it's not very high priority)

Nick said...

I am curious about what you meant by "doesn't work"? I'm guessing that by adding the priority boost, you got a stable result?

I'm going to dig through my app and see if I had a lurking priority boost hanging around in a dark corner.

cbloom said...

I assume this is Nick from meshula?

Yeah, I just mean "doesn't work" as in "doesn't reliably actually wait for the amount you asked for".

If you do the priority-bump trick and timeBeginPeriod() it helps a lot, but it doesn't actually 100% make it reliable.

You won't see the bad behavior unless you actually have other threads running (and in particular other threads doing IO).

Also, the Waitable Timer fractional precision seems to be a bit of a false promise. I don't think it's actually used at all unless your machine is completely idle. If you actually have to switch from another running thread, it's still done on the scheduler granularity, which is never less than 1 milli.

Hell if I could reliably even get 1 milli precision I'd be very happy.

Unknown said...

yo having the same problem, i wish vsync could be a non-busy wait :(

Unknown said...

Read your thread with multi CPU's becoming common if you SetThreadAffinityMask to issolate your output process thread to a CPU help? By the way this is maddening and has been an issue for a decade. Just shows how stupid MS and Windows really is.

old rants