In the same family as this post :
cbloom rants 06-14-11 - A simple allocator
On newer chips with POPCNT this should be reasonably fast (assuming query is common and insert is rare, since insert requires a realloc/memmove or something similar). (and of course everything is a disaster on platforms where variable shift is slow).
struct PackedArray { uint32 mask; // numItems = num_bits_set(mask) int32 items[1]; // variable allocation size }; static inline uint32 num_bits_set( uint32 v ) { //return _mm_popcnt_u32(v); // from "Bit Twiddling Hacks" : v = v - ((v >> 1) & 0x55555555); // reuse input as temporary v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp uint32 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count return c; } bool PackedArray_Get(const PackedArray * a,int index,int32 * pgot) { ASSERT( index >= 0 && index < 32 ); uint32 mask = 1UL << index; if ( a->mask & mask ) { // it exists, find it uint32 numPreceding = num_bits_set( a->mask & (mask-1) ); *pgot = a->items[numPreceding]; return true; } else { return false; } } bool PackedArray_Put(PackedArray * a,int index,int32 item) { ASSERT( index >= 0 && index < 32 ); uint32 mask = 1UL << index; uint32 numPreceding = num_bits_set( a->mask & (mask-1) ); if ( a->mask & mask ) { // it exists, replace a->items[numPreceding] = item; return true; } else { // have to add it // realloc items here or whatever your scheme is // make room : uint32 numFollowing = num_bits_set(a->mask) - numPreceding; // slide up followers : int32 * pitem = a->items + numPreceding; memmove(pitem+1,pitem,numFollowing*sizeof(int32)); // put me in *pitem = item; a->mask |= mask; return false; } }
(BTW the GCC builtins are annoying. What they should have done is provide a way to test if the builtin actually corresponds to a machine instruction, because if it doesn't I generally don't want their fallback implementation, I'll do my own thank you.)
2 comments:
"(and of course everything is a disaster on platforms where variable shift is slow)"
This one can be resolved for the low, low price of a single cache line on the most annoying platforms with that particular misfeature. :)
Similarly, a small LUT (32-64 entries) for small integer->float works well.
True, I guess 1 << n is easy; x << n is hard.
On most platforms this code is faster with a cached occupation count, but I hate redundant state so I left it out of the post.
Post a Comment