Skip to content

bpo-42673 prevent branch misprediction in round_size (used in rehash) #23833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

jneb
Copy link
Contributor

@jneb jneb commented Dec 18, 2020

Replace the loop in round_size using bit_length to prevent branch misprediction delays.

https://bugs.python.org/issue42673

Replace the loop in round_size using bit_length to prevent branch misprediction delays.
blurb-it bot and others added 2 commits December 18, 2020 09:39
@jneb jneb changed the title bpo-42673 prevent branch misprediction in round_size (using in rehash) bpo-42673 prevent branch misprediction in round_size (used in rehash) Dec 18, 2020
@jneb
Copy link
Contributor Author

jneb commented Dec 18, 2020

The fact that test_asyncio failed on Windows x86 shouldn't have anything to do with this: the change is deep into the core, and if there actually is an error there, many more tests should fail.
Does anybody understand what is going on here?
The actual error is:

FAIL: test_sendfile_close_peer_in_the_middle_of_receiving (test.test_asyncio.test_sendfile.ProactorEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\lib\test\test_asyncio\test_sendfile.py", line 458, in test_sendfile_close_peer_in_the_middle_of_receiving
    self.run_loop(
AssertionError: ConnectionError not raised

After forcing a new test, all was OK. There is something fishy with this test ...

Updated comment.
I hope the new build test doesn't fail. (That wasn't my fault!)
return i;
// 1 << _Py_bit_length(s) is the smallest value k so that 1 << k > s
// subtract one in case s is an exact power of two, saving space
return 1 << _Py_bit_length(s - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please try _Py_SIZE_ROUND_DOWN() which could be even faster? It uses simple operations:

#define _Py_SIZE_ROUND_DOWN(n, a) ((size_t)(n) & ~(size_t)((a) - 1))

About _Py_bit_length(): it has a special case for 0, but s >= HASHTABLE_MIN_SIZE.

Copy link
Contributor Author

@jneb jneb Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's brilliant! I didn't know about that macro.
EDITED:
.... but, that macro doesn't do what we want: we want to round up to the nearest power of two, not to a multiple of a given power of two.
I've been wrecking my brain for a quick binary trick to do that, but this is the fastest one I could find.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this macro is pure black magic... but oops, I picked the wrong macro, I was thinking at: _Py_SIZE_ROUND_UP().

/* Below "a" is a power of 2. */
/* Round down size "n" to be a multiple of "a". */
#define _Py_SIZE_ROUND_DOWN(n, a) ((size_t)(n) & ~(size_t)((a) - 1))

/* Round up size "n" to be a multiple of "a". */
#define _Py_SIZE_ROUND_UP(n, a) (((size_t)(n) + \
        (size_t)((a) - 1)) & ~(size_t)((a) - 1))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe _Py_SIZE_ROUND_UP() should be rewritten as a static inline function to ensure that (a-1) expression is only computed once.

@@ -109,10 +110,9 @@ round_size(size_t s)
size_t i;
if (s < HASHTABLE_MIN_SIZE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-optimisation :-)

Suggested change
if (s < HASHTABLE_MIN_SIZE)
if (s <= HASHTABLE_MIN_SIZE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually may make the code slower. The value passed to round_size will almost never be < HASHTABLE_MIN_SIZE, but it sometimes is equal.
The branch prediction will probably be faster if it can assume the condition is always false.

@vstinner
Copy link
Member

@serhiy-storchaka: using _Py_SIZE_ROUND_DOWN() avoids the loop and might be faster, no? What do you think?

In the last patch, I forgot to remove an unused variable.
@serhiy-storchaka
Copy link
Member

_Py_SIZE_ROUND_UP() does not have anything in common with this code. It returns the smallest number not less than n which is divisible by a. _Py_SIZE_ROUND_UP(42, 2) -> 42, _Py_SIZE_ROUND_UP(42, 4) -> 44, _Py_SIZE_ROUND_UP(42, 8) -> 48.

@serhiy-storchaka
Copy link
Member

_Py_bit_length() works with unsigned long, but the argument of round_size() has type size_t.

Copy link
Contributor Author

@jneb jneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. So this would fail if the hashtable gets a number of entries that couldn't fit in an unsigned long, which is indeed possible in theory (but only for humongous computers: 8<<32 = 32 GB, for a single hash table. Would be a bit hard to test :-).
I think the best solution would be to fix the macro to handle this case.

@jneb
Copy link
Contributor Author

jneb commented Dec 18, 2020

_Py_bit_length() works with unsigned long, but the argument of round_size() has type size_t.

When looking at the definition of _Py_bit_length(), I am a afraid to copy it for size_t. What if size_t is equivalent to unsigned long? Or more accurately, what is the safe #if to write to make sure there is no double definition? Something like

#if SIZEOF_SIZE_T > SIZEOF_LONG
(copied code of Py_bit_length, for size_t)
#endif

or is there a better way?
EDITED:
That didn't work. I made a separate _Py_bit_length_size_t routine; I am not happy with the result as it looks ugly to repeat all this source code.

jneb added 9 commits December 18, 2020 16:22
This is needed for the new version of the hashtable.c, in case the hashtable is extremely large (>32GB).
Apperently, you can't use sizeof in an #if
There should be a good way to check sizeof(size_t) > sizeof(unsigned long); I hope this is the one
Swapped a } and an #endif...
Made a separate routine _Py_bit_length_size_t since overloading doesn't seem to work
Use the separate _Py_bit_length_size_t so that round_size works in extreme cases of hashtables >32GB
Cast the 1 to size_t to get a proper shift
I changed one size_t too many
I am determined to get this to work!
// Same routine as above, for size_t, if that is bigger than unsigned long
// This means bitscanreverse (as used above) is not going to work
static inline int
_Py_bit_length_size_t(size_t x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to add a _Py_next_pow2() function which takes a size_t parameter. So we can use more efficient implementation when __builtin_clzl() and _BitScanReverse() are not available. Like:

static inline uint32_t next_pow2(uint32_t x)
{
	x -= 1;
	x |= (x >> 1);
	x |= (x >> 2);
	x |= (x >> 4);
	x |= (x >> 8);
	x |= (x >> 16);
	return x + 1;
}

I suggest to require the argument to be >= 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jneb
Copy link
Contributor Author

jneb commented Dec 18, 2020

So it appears to work now, but I am not completely happy with the way the different types are specified. But at least we can be sure that the optimization is possible and works.
It won't make a very big difference, but the fact that we probably save a mispredicted branch for every hashtable expansion makes me happy.

@jneb
Copy link
Contributor Author

jneb commented Dec 18, 2020

I would be happy if somebody with more experience will have a good look at this.

@serhiy-storchaka
Copy link
Member

I would want to see evidences that the proposed code is faster than the current code.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jneb
Copy link
Contributor Author

jneb commented Dec 22, 2020

I found exactly the same code in pyobject.c under the name calculate_keysize so this can be replaced by a call to the new routine.
Furthermore, it almost proves that it is the fastest way :-)

@vstinner
Copy link
Member

I found exactly the same code in pyobject.c under the name calculate_keysize so this can be replaced by a call to the new routine. Furthermore, it almost proves that it is the fastest way :-)

calculate_keysize() is defined in Objects/dictobject.c. It's called when creating a new dict, when inserting an item in a dict, or when merging two dicts. It might be interesting, but so far, you didn't provide any micro-benchmark.

@serhiy-storchaka
Copy link
Member

Similar code is used for dict, set, and other implementations of hashtable (one or two).

@jneb
Copy link
Contributor Author

jneb commented Dec 23, 2020

The unsigned long version in pycore_bitutils is even more optimized, with a 5 bit lookup; i would suggest to copy that code for the size_t version (as I did), and use it for all locations where rounding up to a two power is used.
I'm a bit nervous to do this myself as I am a core development newbie, and there are so many different types used in this code.

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jan 23, 2021
@iritkatriel
Copy link
Member

Closing as https://bugs.python.org/issue42673 has been rejected.

@iritkatriel iritkatriel closed this Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review stale Stale PR or inactive for long period of time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants