-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
bpo-42673 prevent branch misprediction in round_size (used in rehash) #23833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace the loop in round_size using bit_length to prevent branch misprediction delays.
Forgot the #linclude...
|
The fact that test_asyncio failed on Windows x86 shouldn't have anything to do with this: the change is deep into the core, and if there actually is an error there, many more tests should fail. After forcing a new test, all was OK. There is something fishy with this test ... |
Updated comment. I hope the new build test doesn't fail. (That wasn't my fault!)
Python/hashtable.c
Outdated
| return i; | ||
| // 1 << _Py_bit_length(s) is the smallest value k so that 1 << k > s | ||
| // subtract one in case s is an exact power of two, saving space | ||
| return 1 << _Py_bit_length(s - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please try _Py_SIZE_ROUND_DOWN() which could be even faster? It uses simple operations:
#define _Py_SIZE_ROUND_DOWN(n, a) ((size_t)(n) & ~(size_t)((a) - 1))
About _Py_bit_length(): it has a special case for 0, but s >= HASHTABLE_MIN_SIZE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's brilliant! I didn't know about that macro.
EDITED:
.... but, that macro doesn't do what we want: we want to round up to the nearest power of two, not to a multiple of a given power of two.
I've been wrecking my brain for a quick binary trick to do that, but this is the fastest one I could find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, this macro is pure black magic... but oops, I picked the wrong macro, I was thinking at: _Py_SIZE_ROUND_UP().
/* Below "a" is a power of 2. */
/* Round down size "n" to be a multiple of "a". */
#define _Py_SIZE_ROUND_DOWN(n, a) ((size_t)(n) & ~(size_t)((a) - 1))
/* Round up size "n" to be a multiple of "a". */
#define _Py_SIZE_ROUND_UP(n, a) (((size_t)(n) + \
(size_t)((a) - 1)) & ~(size_t)((a) - 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe _Py_SIZE_ROUND_UP() should be rewritten as a static inline function to ensure that (a-1) expression is only computed once.
| @@ -109,10 +110,9 @@ round_size(size_t s) | |||
| size_t i; | |||
| if (s < HASHTABLE_MIN_SIZE) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro-optimisation :-)
| if (s < HASHTABLE_MIN_SIZE) | |
| if (s <= HASHTABLE_MIN_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually may make the code slower. The value passed to round_size will almost never be < HASHTABLE_MIN_SIZE, but it sometimes is equal.
The branch prediction will probably be faster if it can assume the condition is always false.
|
@serhiy-storchaka: using _Py_SIZE_ROUND_DOWN() avoids the loop and might be faster, no? What do you think? |
In the last patch, I forgot to remove an unused variable.
|
|
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. So this would fail if the hashtable gets a number of entries that couldn't fit in an unsigned long, which is indeed possible in theory (but only for humongous computers: 8<<32 = 32 GB, for a single hash table. Would be a bit hard to test :-).
I think the best solution would be to fix the macro to handle this case.
When looking at the definition of _Py_bit_length(), I am a afraid to copy it for size_t. What if size_t is equivalent to unsigned long? Or more accurately, what is the safe #if to write to make sure there is no double definition? Something like or is there a better way? |
This is needed for the new version of the hashtable.c, in case the hashtable is extremely large (>32GB).
Apperently, you can't use sizeof in an #if
There should be a good way to check sizeof(size_t) > sizeof(unsigned long); I hope this is the one
Swapped a } and an #endif...
Made a separate routine _Py_bit_length_size_t since overloading doesn't seem to work
Use the separate _Py_bit_length_size_t so that round_size works in extreme cases of hashtables >32GB
Cast the 1 to size_t to get a proper shift
I changed one size_t too many
I am determined to get this to work!
| // Same routine as above, for size_t, if that is bigger than unsigned long | ||
| // This means bitscanreverse (as used above) is not going to work | ||
| static inline int | ||
| _Py_bit_length_size_t(size_t x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to add a _Py_next_pow2() function which takes a size_t parameter. So we can use more efficient implementation when __builtin_clzl() and _BitScanReverse() are not available. Like:
static inline uint32_t next_pow2(uint32_t x)
{
x -= 1;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return x + 1;
}
I suggest to require the argument to be >= 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is reference at http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 for example.
|
So it appears to work now, but I am not completely happy with the way the different types are specified. But at least we can be sure that the optimization is possible and works. |
|
I would be happy if somebody with more experience will have a good look at this. |
|
I would want to see evidences that the proposed code is faster than the current code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I found exactly the same code in pyobject.c under the name calculate_keysize so this can be replaced by a call to the new routine. |
calculate_keysize() is defined in Objects/dictobject.c. It's called when creating a new dict, when inserting an item in a dict, or when merging two dicts. It might be interesting, but so far, you didn't provide any micro-benchmark. |
|
Similar code is used for dict, set, and other implementations of hashtable (one or two). |
|
The unsigned long version in pycore_bitutils is even more optimized, with a 5 bit lookup; i would suggest to copy that code for the size_t version (as I did), and use it for all locations where rounding up to a two power is used. |
|
This PR is stale because it has been open for 30 days with no activity. |
|
Closing as https://bugs.python.org/issue42673 has been rejected. |
Replace the loop in round_size using bit_length to prevent branch misprediction delays.
https://bugs.python.org/issue42673