osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

how to fast processing one million strings to remove quotes


On 8/2/2017 1:05 PM, MRAB wrote:
> On 2017-08-02 16:05, Daiyue Weng wrote:
>> Hi, I am trying to removing extra quotes from a large set of strings (a
>> list of strings), so for each original string, it looks like,
>>
>> """str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
>>
>>
>> I like to remove the start and end quotes and extra pairs of quotes on 
>> each
>> string value, so the result will look like,
>>
>> "str_value1","str_value2","str_value3",1,"str_value4"
>>
>>
>> and then join each string by a new line.
>>
>> I have tried the following code,
>>
>> for line in str_lines[1:]:
>>              strip_start_end_quotes = line[1:-1]
>>              splited_line_rem_quotes =
>> strip_start_end_quotes.replace('\"\"', '"')
>>              str_lines[str_lines.index(line)] = splited_line_rem_quotes
>>
>> for_pandas_new_headers_str = '\n'.join(splited_lines)

Do you actually need the list of strings joined up like that into one 
string, or will the one string just be split again into multiple strings?

>> but it is really slow (running for ages) if the list contains over 1
>> million string lines. I am thinking about a fast way to do that.
>>
> [snip]
> 
> The problem is the line:
> 
>      str_lines[str_lines.index(line)]
> 
> It does a linear search through str_lines until time finds a match for 
> the line.
> 
> To find the 10th line it must search through the first 10 lines.
> 
> To find the 100th line it must search through the first 100 lines.
> 
> To find the 1000th line it must search through the first 1000 lines.
> 
> And so on.
> 
> In Big-O notation, the performance is O(n**2).
> 
> The Pythonic way of doing it is to put the results into a new list:
> 
> 
> new_str_lines = str_lines[:1]
> 
> for line in str_lines[1:]:
>      strip_start_end_quotes = line[1:-1]
>      splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"')
>      new_str_lines.append(splited_line_rem_quotes)
> 
> 
> In Big-O notation, the performance is O(n).

Making a slice copy of all but the first member of the list is also 
unnecessary.  Use an iterator instead.

lineit = iter(str_lines)
new_str_lines = [next(lineit)]
for line in lineit:
     ...


-- 
Terry Jan Reedy