[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Python profiling

Cool ! Can we document it somewhere such that other Runners could pick it up later ?

Manu Zhang
On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <mxm@xxxxxxxxxx>, wrote:
This looks very helpful for debugging performance of portable pipelines.
Great work!

Enabling local directories for Flink or other portable Runners would be
useful for debugging, e.g. per

On 26.10.18 18:08, Robert Bradshaw wrote:
Now that we've (mostly) moved from features to performance for
BeamPython-on-Flink, I've been doing some profiling of Python code,
and thought it may be useful for others as well (both those working on
the SDK, and users who want to understand their own code), so I've
tried to wrap this up into something useful.

Python already had some existing profile options that we used with
Dataflow, specifically --profile_cpu and --profile_location. I've
hooked these up to both the DirectRunner and the SDK Harness Worker.
One can now run commands like

python -m apache_beam.examples.wordcount
--output=counts.txt--profile_cpu --profile_location=path/to/directory

and get nice graphs like the one attached. (Here the bulk of the time
is spent reading from the default input in gcs. Another hint for
reading the graph is that due to fusion the call graph is cyclic,
passing through operations:86:receive for every output.)

The raw python profile stats [1] are produced in that directory, along
with a dot graph and an svg if both dot and gprof2dot are installed.
There is also an important option --direct_runner_bundle_repeat which
can be set to gain more accurate profiles on smaller data sets by
re-playing the bundle without the (non-trivial) one-time setup costs.

These flags also work on portability runners such as Flink, where the
directory must be set to a distributed filesystem. Each bundle
produces its own profile in that directory, and they can be
concatenated and manually fed into tools like below. In that case
there is a --profile_sample_rate which can be set to avoid profiling
every single bundle (e.g. for a production job).

The PR is up at https://github.com/apache/beam/pull/6847 Hope it's useful.

- Robert

[1] https://docs.python.org/2/library/profile.html
[2] https://github.com/jrfonseca/gprof2dot