how to fast processing one million strings to remove quotes
On 2017-08-02 19:05, MRAB wrote:
> On 2017-08-02 16:05, Daiyue Weng wrote:
>> Hi, I am trying to removing extra quotes from a large set of strings (a
>> list of strings), so for each original string, it looks like,
>> I like to remove the start and end quotes and extra pairs of quotes on
>> string value, so the result will look like,
>> and then join each string by a new line.
>> I have tried the following code,
>> for line in str_lines[1:]:
>> strip_start_end_quotes = line[1:-1]
>> splited_line_rem_quotes =
>> strip_start_end_quotes.replace('\"\"', '"')
>> str_lines[str_lines.index(line)] = splited_line_rem_quotes
>> for_pandas_new_headers_str = '\n'.join(splited_lines)
>> but it is really slow (running for ages) if the list contains over 1
>> million string lines. I am thinking about a fast way to do that.
> The problem is the line:
> It does a linear search through str_lines until time finds a match for
> the line.
> To find the 10th line it must search through the first 10 lines.
> To find the 100th line it must search through the first 100 lines.
> To find the 1000th line it must search through the first 1000 lines.
> And so on.
> In Big-O notation, the performance is O(n**2).
> The Pythonic way of doing it is to put the results into a new list:
> new_str_lines = str_lines[:1]
> for line in str_lines[1:]:
> strip_start_end_quotes = line[1:-1]
> splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"')
> In Big-O notation, the performance is O(n).
Sometimes it's desirable to modify the list in-place (such as in this
case, where you don't really want to double the memory use:
for idx, line in enumerate(str_lines):
str_lines[idx] = fixed(line)
The most Pythonic way to process a large "list" of data is often to not
use a list at all, but to use iterators. Whether it's feasible to access
the strings one-by-one will depend on where they come from and where
they're going. Something like this may or may not be useful:
for line in lines:
yield line[1:-1].replace('""', '"')
with open('weird_file.txt', 'r') as input:
with open('not_so_weird_file.txt', 'w') as output:
for fixed_line in remove_quotes_from_all(input):