In current PR, there will be two parameters that can control the final row group size, row_group_buffer_size and record_batch_size. The records are first stored as a list of columns and then transformed into a record batch (a data structure defined in pyarrow) when the number of records in the list reaches record_batch_size. Record batches form another list that will be written as a single row group when the byte size of the record batch list exceeds row_group_buffer_size. row_group_buffer_size is normally much bigger than a row group data size in a parquet file so it's not an exact estimation of a row group size written in a file but I guess this is the best option we can do on the given limitation of python parquet libraries. For better estimation of row group size in bytes, the parquet library should provide buffered writing of a row group and a method returning the size of encoded data in the writing buffer. No currently available python parquet library implements these features.