In the same family as this post :
cbloom rants 06-14-11 - A simple allocator
On newer chips with POPCNT this should be reasonably fast (assuming query is common and insert is rare, since insert requires a realloc/memmove or something similar). (and of course everything is a disaster on platforms where variable shift is slow).
struct PackedArray
{
uint32 mask; // numItems = num_bits_set(mask)
int32 items[1]; // variable allocation size
};
static inline uint32 num_bits_set( uint32 v )
{
//return _mm_popcnt_u32(v);
// from "Bit Twiddling Hacks" :
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
uint32 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
return c;
}
bool PackedArray_Get(const PackedArray * a,int index,int32 * pgot)
{
ASSERT( index >= 0 && index < 32 );
uint32 mask = 1UL << index;
if ( a->mask & mask )
{
// it exists, find it
uint32 numPreceding = num_bits_set( a->mask & (mask-1) );
*pgot = a->items[numPreceding];
return true;
}
else
{
return false;
}
}
bool PackedArray_Put(PackedArray * a,int index,int32 item)
{
ASSERT( index >= 0 && index < 32 );
uint32 mask = 1UL << index;
uint32 numPreceding = num_bits_set( a->mask & (mask-1) );
if ( a->mask & mask )
{
// it exists, replace
a->items[numPreceding] = item;
return true;
}
else
{
// have to add it
// realloc items here or whatever your scheme is
// make room :
uint32 numFollowing = num_bits_set(a->mask) - numPreceding;
// slide up followers :
int32 * pitem = a->items + numPreceding;
memmove(pitem+1,pitem,numFollowing*sizeof(int32));
// put me in
*pitem = item;
a->mask |= mask;
return false;
}
}
(BTW the GCC builtins are annoying. What they should have done is provide a way to test if the builtin actually corresponds to a machine instruction, because if it doesn't I generally don't want their fallback implementation, I'll do my own thank you.)
2 comments:
"(and of course everything is a disaster on platforms where variable shift is slow)"
This one can be resolved for the low, low price of a single cache line on the most annoying platforms with that particular misfeature. :)
Similarly, a small LUT (32-64 entries) for small integer->float works well.
True, I guess 1 << n is easy; x << n is hard.
On most platforms this code is faster with a cached occupation count, but I hate redundant state so I left it out of the post.
Post a Comment