Historically the way you address this is run lots of trials (like 100) and take the MIN time of any trial.
(* important note : if you aren't trying to time "hot cache" performance, you need to wipe the cache between each run. I dunno if there's an easy instruction or system call that would invalidate all cache pages; what I usually do is have a routine that goes and munges over some big array).
It's a bit better these days because of many cores. Now you can quite often find a core which is unmolested by annoying services popping up and stealing CPU time and messing up your profile. But sometimes you get unlucky, and your process runs on an IdealProc that has some other shite.
So a simple loop helps :
template
which makes it reasonably probable that you get a clean run on some core. For published results you will still
want to repeat the whole thing N times.
<
typename t_func>
uint64 GetFuncTime( t_func * pfunc )
{
HANDLE proc = GetCurrentProcess();
HANDLE thread = GetCurrentThread();
DWORD_PTR affProc,affSys;
GetProcessAffinityMask(proc,&affProc,&affSys);
uint64 tick_range = 1ULL << 62;
for(int rep=0;rep<24;rep++)
{
DWORD mask = 1UL<<
rep;
if ( mask & affProc )
SetThreadAffinityMask(thread,mask);
else
continue;
uint64 t1 = __rdtsc();
(*pfunc)();
uint64 t2 = __rdtsc();
uint64 cur_tick_range = t2 - t1;
tick_range = MIN(tick_range,cur_tick_range);
}
SetThreadAffinityMask(thread,0xFFFFFFFFUL);
return tick_range;
}
I'm just curious, is there benefit of using "template " vs uint64 GetFuncTime( void (*pfunc)() ) or just being cplusplusey?
ReplyDeleteWorks with other func types. In particular you don't have to get the __cdecl or whatever nonsense right.
ReplyDeleteIn practice I use a version that also takes templated args and passes them through.
You're welcome to change it to void *pfunc if that works for you.