[Python-Dev] Micro-benchmarks for PEP 580
On Wed, Jul 11, 2018 at 7:47 AM Victor Stinner <vstinner at redhat.com> wrote:
> 2018-07-10 14:59 GMT+02:00 INADA Naoki <songofacandy at gmail.com>:
> > PyObject_CallFunction(func, "n", 42);
> > Currently, we create temporary long object for passing argument.
> > If there is protocol for exposeing format used by PyArg_Parse*, we can
> > bypass temporal Python object and call myfunc_impl directly.
> I'm not sure that it's worth it. It seems complex to implement.
Both of my idea and PEP 580 are complicated.
For Python stdlibs, I expect no significant benefit.
We already bypass Python function calling by typecheck + concrete
function call idiom.
But for Cython users, especially people using Cython on Jupyter, I expect
there are many extension-to-extension calls.
Both of this idea and PEP 580 is complicated. And we don't have
realistic example to demonstrate real world benefit of them.
Without application benchmark, I think both of idea and PEP 580
shouldn't be happened.
That's why I requested application benchmark again and again.
PEP 576 seems minimalistic, straightforward way to allow FASTCALL
for Cython and other 3rd party libraries.
But if we accept PEP 576, it makes harder to allow more optimization
in the future.
I expect best balance is between PEP 576 and 580. Maybe, adding
new slot as struct pointer with some flags, but don't add per-instance data.
But I'm not sure because I'm not data scientist.
I don't know what's the typical usage and where is main bottleneck of
Jeroen seems want we to discuss on PEP 576 and 580.
So I explained to him why we need example application first.
> I proposed something simpler, but nobody tried to implement it.
> Instead of calling the long and complex PyArg_Parse...() functions,
> why not generating C code to parse arguments instead? The idea looks
> like "inlining" PyArg_Parse...() in its caller, but technically it
> means that Argument Clinic generates C code to parse arguments.
I believe Cython did it already.
But I have worrying about it. If we do it for all function, it makes Python
binary fatter, consume more CPU cache. Once CPU cache is start
stashing, application performance got slower quickly.
And benchmarking CPU cache efficiency is very difficult. Current
Python benchmark is too small. We benchmarked HTTP server,
SQLAlchemy, JSON, template engine individually. But real application
do all of them in loop. And many processes share L3 cache.
Even L1 cache is shared by several processes by HyperThreading
and context switch.
> PyArg_Parse...() is cheap and has been optimized, but on very fast
> functions (less than 100 ns), it might be significant. Well, to be
> sure, someone should run a benchmark :-)
INADA Naoki <songofacandy at gmail.com>