Multiprocessing performance question
Actually not a ?toy example? at all. It is simply the first step in gridding some data I am working with - a problem that is solved by tools like SatPy, but unfortunately I can?t use SatPy because it doesn?t recognize my file format, and you can?t load data directly. Writing a custom file importer for SatPy is probably my next step.
That said, the entire process took around 60 seconds to run. As this step was taking 10, I figured it would be low-hanging fruit for speeding up the process. Obviously I was wrong. For what it?s worth, I did manage to re-factor the code, so instead of generating the entire grid up-front, I generate the boxes as needed to calculate the overlap with the data grid. This brought the processing time down to around 40 seconds, so a definite improvement there.
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
> On Feb 20, 2019, at 4:30 PM, DL Neil <PythonList at DancesWithMice.info> wrote:
> On 21/02/19 1:15 PM, george trojan wrote:
>> def create_box(x_y):
>> return geometry.box(x_y - 1, x_y, x_y, x_y - 1)
>> x_range = range(1, 1001)
>> y_range = range(1, 801)
>> x_y_range = list(itertools.product(x_range, y_range))
>> grid = list(map(create_box, x_y_range))
>> Which creates and populates an 800x1000 ?grid? (represented as a flat list
>> at this point) of ?boxes?, where a box is a shapely.geometry.box(). This
>> takes about 10 seconds to run.
>> Looking at this, I am thinking it would lend itself well to
>> parallelization. Since the box at each ?coordinate" is independent of all
>> others, it seems I should be able to simply split the list up into chunks
>> and process each chunk in parallel on a separate core. To that end, I
>> created a multiprocessing pool:
> I recall a similar discussion when folk were being encouraged to move away from monolithic and straight-line processing to modular functions - it is more (CPU-time) efficient to run in a straight line; than it is to repeatedly call, set-up, execute, and return-from a function or sub-routine! ie there is an over-head to many/all constructs!
> Isn't the 'problem' that it is a 'toy example'? That the amount of computing within each parallel process is small in relation to the inherent 'overhead'.
> Thus, if the code performed a reasonable analytical task within each box after it had been defined (increased CPU load), would you then notice the expected difference between the single- and multi-process implementations?
> From AKL to AK
> Regards =dn