[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Filter columns of a csv file with Flink

Hi all,

I'm a new user to Flink community. This tool sounds great to achieve some data loading of millions-rows files into a pgsql db for a new project.

As I read docs and examples, a proper use case of csv loading into pgsql can't be found.
The file I want to load isn't following the same structure than the table, I have to delete some columns and make a json string from several others too prior to load to pgsql

I plan to use Flink 1.5 Java API and a batch process.
Does the DataSet class is able to strip some columns out of the records I load or should I iterate over each record to delete the columns?

Same question to make a json string from several columns of the same record?
E.g json_column =3D {"field1":col1, "field2":col2...}

I work with 20 millions length files and it sounds pretty ineffective to iterate over each records.
Can someone tell me if it's possible or if I have to change my mind about this?

Thanks in advance, all the best

François Lacombe