[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Python-Dev] Proposal: dict.with_values(iterable)

On 22Apr2019 2143, Inada Naoki wrote:
> On Tue, Apr 23, 2019 at 11:30 AM Steve Dower <steve.dower at python.org> wrote:
>> Or possibly just "dict(existing_dict).update(new_items)".
> Do you mean .update accepts values tuple?
> I can't think it's

Not sure what you were going to go on to say here, but why not?

If it's a key-sharing dict, then all the keys are strings. We know that 
when we go to do the update, so we can intern all the strings (going to 
do that anyway) and then it's a quick check if it already exists. If 
it's a regular dict, then we calculate hashes as normal. Updating the 
value is just a decref, incref and assignment.

If not all these conditions are met, we convert to a regular dict. The 
proposed function was going to raise an error in this case, so all we've 
done is make it transparent. The biggest downside is now you don't get a 
warning that your preferred optimization isn't actually working when you 
pass in new_items with different keys from what were in existing_dict.

Note that it .update() would probably require a dict or key/value tuples 
here - but if you have the keys in a tuple already then zip() is going 
to be good enough for setting it (in fact, zip(existing_dict, 
new_values) should be fine, and we can internally special-case that 
scenario, too). I'd assumed the benefit was in memory usage after 
construction, rather than speed-to-construct, since everyone keeps 
talking about "key-sharing dictionaries" and not "arrays" ;)

(Randomizing side note: is this scenario enough to make a case for a 
built-in data frame type?)

>> My primary concern is still to avoid making CPython performance
>> characteristics part of the Python language definition. That only makes
>> it harder for alternate implementations.
> Note that this proposal is not only for key sharing dict:
> * We can avoid rebuilding hash table again and again.
> * We can avoid checking duplicated keys again and again.
> These characteristics are not only for Python, but for all mapping
> implementations using hash table.

I believe all of these are met by making d2=dict(d1) construct a dict d2 
that shares keys with d1 by default. Can you show how they are not?

* when you only d2.update existing keys, no need to rebuild the table
* a duplicated key overwrites multiple times - what else are you going 
to do? This is already easiest, fastest, uses the least memory and is 
most consistent with every other form of setting dict items. Why 
complicate things by checking them? Let the caller do it