TableRepo

class parquetranger.TableRepo(root_path: Path | str, max_records: int = 0, group_cols: str | list | HashPartitioner | None = None, env_parents: dict[str, Path | str] | None = None, mkdirs=True, extra_metadata: dict | None = None, drop_group_cols: bool = False, fixed_metadata: dict | None = None, allow_metadata_extension: bool = False)

Bases: object

helps with storing, extending and reading tabular data in parquet format

tries dividing based on group_cols, if that is None tries dividing based on max_records, if max records is 0 just writes the file to root_path.parquet

if both group_cols and max_records is given, it will create directories for the groups (nested directories if multiple columns given)

Attributes Summary

dfs

full_metadata

group_cols

main_path

n_files

paths

tables

vc_path

Methods Summary

batch_extend(df_iterator, **para_kwargs)

env_ctx(env_name)

extend(df)

get_extending_df_batch_writer([max_records])

get_extending_dict_batch_writer([max_records])

get_extending_fixed_dict_batch_writer(cols)

get_full_df()

get_full_table()

get_partition_df(partition[, partition_col])

get_partition_paths(partition_col)

get_partition_table(partition[, partition_col])

get_replacing_df_batch_writer([max_records])

get_replacing_dict_batch_writer([max_records])

map_partitions(fun[, level])

mkdirs([force])

purge()

purges everything

read_df_from_path(path[, lock, release])

read_table_from_path(path[, lock, release])

replace_all(df)

purges everything and writes df instead

replace_groups(df)

replace files based on file name, only viable if group_cols is set

replace_records(df[, by_groups])

replace records in files based on index

set_env(env)

set_env_to_default()

Attributes Documentation

dfs
full_metadata
group_cols
main_path
n_files
paths
tables
vc_path

Methods Documentation

batch_extend(df_iterator, **para_kwargs)
env_ctx(env_name)
extend(df: DataFrame)
get_extending_df_batch_writer(max_records=1000000)
get_extending_dict_batch_writer(max_records=1000000)
get_extending_fixed_dict_batch_writer(cols, max_records=1000000)
get_full_df() DataFrame
get_full_table() Table
get_partition_df(partition: str, partition_col: str | None = None) DataFrame
get_partition_paths(partition_col: str) Iterable[tuple[str, Iterable[Path]]]
get_partition_table(partition: str, partition_col: str | None = None) Table
get_replacing_df_batch_writer(max_records=1000000)
get_replacing_dict_batch_writer(max_records=1000000)
map_partitions(fun, level=None, **para_kwargs)
mkdirs(force=False)
purge()

purges everything

read_df_from_path(path: Path, lock: allocate_lock | None = None, release=True) DataFrame
read_table_from_path(path, lock: allocate_lock | None = None, release=True) Table
replace_all(df: DataFrame)

purges everything and writes df instead

replace_groups(df: DataFrame)

replace files based on file name, only viable if group_cols is set

replace_records(df: DataFrame, by_groups=False)

replace records in files based on index

set_env(env: str)
set_env_to_default()