[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

how to fast processing one million strings to remove quotes

On 08/04/2017 01:52 AM, Peter Otten wrote:

> It looks like Python is fairly competetive:
> $ wc -l hugequote.txt 
> 1000000 hugequote.txt612250
> $ cat unquote.py 
> import csv
> with open("hugequote.txt") as instream:
>     for field, in csv.reader(instream):
>         print(field)
> $ time python3 unquote.py > /dev/null
> real    0m3.773s
> user    0m3.665s
> sys     0m0.082s
> $ time cat hugequote.txt | sed 's/"""/"/g;s/""/"/g' > /dev/null
> real    0m4.862s
> user    0m4.721s
> sys     0m0.330s
> Run on ancient AMD hardware ;)

It's actually better than sed.  What you're seeing is - I believe -
load time dominating the overall time.  I reran this with a 20M line

time cat superhuge.txt | sed 's/"""/"/g;s/""/"/g' >/dev/null

real    0m53.091s
user    0m52.861s
sys     0m0.820s

time python unquote.py >/dev/null

real    0m22.377s
user    0m22.021s
sys     0m0.352s

Note that this is with python2, not python3.  Also, I confirmed that the
cat and pipe into sed was not a factor in the performance.

My guess is that delimiter recognition logic in the csv module is far
more efficient than the general purpose regular expression/dfa
implementaton in sed.

Extra Credit Assignment:

Reimplement in python using:

- string substitution
- regular expressions