[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

how to fast processing one million strings to remove quotes

Hi, I am trying to removing extra quotes from a large set of strings (a
list of strings), so for each original string, it looks like,


I like to remove the start and end quotes and extra pairs of quotes on each
string value, so the result will look like,


and then join each string by a new line.

I have tried the following code,

for line in str_lines[1:]:
            strip_start_end_quotes = line[1:-1]
            splited_line_rem_quotes =
strip_start_end_quotes.replace('\"\"', '"')
            str_lines[str_lines.index(line)] = splited_line_rem_quotes

for_pandas_new_headers_str = '\n'.join(splited_lines)

but it is really slow (running for ages) if the list contains over 1
million string lines. I am thinking about a fast way to do that.

I also tried to multiprocessing this task by

def preprocess_data_str_line(data_str_lines):

    :param data_str_lines:
    for line in data_str_lines:
        strip_start_end_quotes = line[1:-1]
        splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"',
        data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes

    return data_str_lines

def multi_process_prepcocess_data_str(data_str_lines):

    :param data_str_lines:
    # if cpu load < 25% and 4GB of ram free use 3 cores
    # if cpu load < 70% and 4GB of ram free use 2 cores
    cores_to_use = how_many_core()

    data_str_blocks = slice_list(data_str_lines, cores_to_use)

    for block in data_str_blocks:
        # spawn processes for each data string block assigned to every cpu
        p = multiprocessing.Process(target=preprocess_data_str_line,

but I don't know how to concatenate the results back into the list so that
I can join the strings in the list by new lines.

So, ideally, I am thinking about using multiprocessing + a fast function to
preprocessing each line to speed up the whole process.