Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46192: Optimize builtin functions min() and max() #30286

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

colorfulappl
Copy link
Contributor

@colorfulappl colorfulappl commented Dec 29, 2021

Builtin functions min() and max() are labeled as METH_VARARGS | METH_KEYWORDS, which use tp_call calling convention.
After changing their label to METH_FASTCALL | METH_KEYWORDS, they can be invoked by vectorcall.

This optimization simplifies parameter passing and avoids creation of temporary tuple while parsing arguments, brings about up to 200%+ speed up on microbenchmarks.

faster-cpython/ideas#199

https://bugs.python.org/issue46192

@the-knights-who-say-ni

This comment has been minimized.

@colorfulappl colorfulappl changed the title bpo-46192: Builtin functions min() and max() now use METH_FASTCALL bpo-46192: Optimize builtin functions min() and max() Dec 29, 2021
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Dec 29, 2021
@erlend-aasland
Copy link
Contributor

erlend-aasland commented Dec 29, 2021

If we're touching these, it would make sense to convert them to Argument Clinic, now that *args support is implemented (9af34c9).

cpython/Python/bltinmodule.c

Lines 1818 to 1823 in 77195cd

/* AC: cannot convert yet, waiting for *args support */
static PyObject *
builtin_min(PyObject *self, PyObject *args, PyObject *kwds)
{
return min_max(args, kwds, Py_LT);
}

cpython/Python/bltinmodule.c

Lines 1835 to 1840 in 77195cd

/* AC: cannot convert yet, waiting for *args support */
static PyObject *
builtin_max(PyObject *self, PyObject *args, PyObject *kwds)
{
return min_max(args, kwds, Py_GT);
}

@colorfulappl
Copy link
Contributor Author

colorfulappl commented Dec 30, 2021

I wrote an "Argument Clinic" version here:
colorfulappl@29b9559

But seems it is not as fast as current "without Argument Clinic" version:
colorfulappl@a9413ab

Result of microbench:

code snippet with AC w/o AC
max(1, 2) 1.11x faster 3.25x faster
max([1, 2]) 1.14x faster 1.41x faster
max((1, ), (2, ), key=lambda x: x[0]) 1.89x faster 1.85x faster
max([(1, ), (2, )], key=lambda x: x[0]) 1.73x faster 1.52x faster
max([], default=-1) 2.39x faster 2.61x faster
max(1, 2, 3, 4, 5) 1.02x faster 2.46x faster

On the most commonly used case, "max(a, b [, ...])", there is not noticeable speed up when we use AC, especially when multiple arguments are passed (the last case).

I noticed that in 9af34c9 , the varargs on stack are packed to tuple before passed to callee, and callee obtains each argument by access the tuple's elements.
This slows down the function invocation.

@colorfulappl
Copy link
Contributor Author

colorfulappl commented Dec 30, 2021

One goal of fastcall is to avoid the creation of a temporary tuple to pass positional arguments (https://bugs.python.org/issue29259).

IMHO, the process of pack/unpack arguments to a tuple is unnecessary in 9af34c9 .
Perhaps it's better to pass an Object * const * pointer which points to the first argument and a nargs integer to indicate how many positional arguments are passed when we are processing varargs.

I would have a try, then may open another issue if necessary.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Dec 30, 2021

IMHO, the process of pack/unpack arguments to a tuple is unnecessary in 9af34c9 . Perhaps it's better to pass an Object * const * pointer which points to the first argument and a nargs integer to indicate how many positional arguments are passed when we are processing varargs.

I would have a try, then may open another issue if necessary.

Yes, such a change demands a separate issue / PR. It would be nice if you could add proof-of-concept benchmark results when you present the idea in the new bpo issue. I can add the relevant core devs on the nosy list, if you want.

Keep in mind that readability and maintainability weigh very heavy when considering a PR; that also goes for optimisation changes.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Dec 30, 2021

I wrote an "Argument Clinic" version here: colorfulappl@29b9559

While that change certainly looks better, it is buggy; it does not heed the key function. With that in mind, the benchmarks you posted in #30286 (comment) are obviously wrong; the benchmarks with the key keyword are not correct.

@sweeneyde
Copy link
Member

sweeneyde commented Dec 30, 2021

I wonder if Argument Clinic Itself could be updated to convert *args to vectorcall/fastcall if some kind of flag appears in the AC declaration

@sweeneyde
Copy link
Member

sweeneyde commented Dec 31, 2021

*args support was added to AC in #18609

@isidentical any thoughts on removing the tuple creation in some cases, or with some flag?

@erlend-aasland
Copy link
Contributor

erlend-aasland commented Dec 31, 2021

*args support was added to AC in #18609

@isidentical any thoughts on removing the tuple creation in some cases, or with some flag?

I suggest to move that discussion to bpo-20291.

@colorfulappl
Copy link
Contributor Author

colorfulappl commented Dec 31, 2021

I would have a try, then may open another issue if necessary.

I opened a new issue (https://bugs.python.org/issue46212),
and made a PR (#30312).

@colorfulappl
Copy link
Contributor Author

colorfulappl commented Dec 31, 2021

the benchmarks you posted in #30286 (comment) are obviously wrong; the benchmarks with the key keyword are not correct.

Thanks for revising my mistake. I'v seen the new results after your amend.
Actually, when the key function is empty, the "AC version" runs faster.

Copy link
Contributor

@MaxwellDupre MaxwellDupre left a comment

Tested with 3.10.2 and 3.11.0a5 resp; the results:

108 98.6
164 164
   
150 367
   
192 478
   
432 638
   
702 1.11
   
482 891
   
745 1.4
   
288 297

So doesn't look worth it to me.
Yes the 3.11.0a5 is not optimised and may be better when full release. You may like to wait for later release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants