parquetranger

Documentation Status codeclimate codecov pypi

Installation:

using pip

pip install parquetranger

Quickstart

import pandas as pd

from parquetranger import TableRepo
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5, 6],
        "B": ["x", "y", "z", "x1", "x2", "x3"],
        "C": [1, 2, 1, 1, 1, 2],
        "C2": ["a", "a", "b", "a", "c", "c"],
    },
    index=["a1", "a2", "a3", "a4", "a5", "a6"],
)
df
A B C C2
a1 1 x 1 a
a2 2 y 2 a
a3 3 z 1 b
a4 4 x1 1 a
a5 5 x2 1 c
a6 6 x3 2 c
trepo = TableRepo("some_tmp_path", group_cols="C2")  # this creates the directory
trepo.extend(df)
trepo.get_full_df()
A B C C2
a1 1 x 1 a
a2 2 y 2 a
a4 4 x1 1 a
a3 3 z 1 b
a5 5 x2 1 c
a6 6 x3 2 c
df2 = pd.DataFrame(
    {
        "A": [21, 22, 23],
        "B": ["X", "Y", "Z"],
        "C": [10,20,1],
        "C2": ["a", "b", "a"],
    },
    index=["a1", "a4", "a7"]
    )
trepo.replace_records(df2)  # replaces based on index
trepo.get_full_df()
A B C C2
a2 2 y 2 a
a1 21 X 10 a
a7 23 Z 1 a
a3 3 z 1 b
a4 22 Y 20 b
a5 5 x2 1 c
a6 6 x3 2 c
trepo.replace_groups(df2)
trepo.get_full_df()  # replaced the whole groups where C2==a and C2==b with the records that were present in df2
A B C C2
a1 21 X 10 a
a7 23 Z 1 a
a4 22 Y 20 b
a5 5 x2 1 c
a6 6 x3 2 c
trepo.replace_all(df2)  # erases everything and puts df2 in. all traces of df are lost
trepo.get_full_df()
A B C C2
a1 21 X 10 a
a7 23 Z 1 a
a4 22 Y 20 b
trepo.replace_records(df, by_groups=True)  # replaces records based on index, but only looks for indices within groups, so this way duplicate a4 index is possible
# as they are in different groups, with different values in C2
trepo.get_full_df()
A B C C2
a7 23 Z 1 a
a1 1 x 1 a
a2 2 y 2 a
a4 4 x1 1 a
a4 22 Y 20 b
a3 3 z 1 b
a5 5 x2 1 c
a6 6 x3 2 c
trepo.purge()  # deletes everything

API

parquetranger Package

Read and write parquet files

Classes

DfBatchWriter(trepo, record_limit, ...)

HashPartitioner([col, num_groups])

ObjIngestor(root[, root_id_key, force_key, ...])

RecordWriter(trepo, record_limit, ...)

TableRepo(root_path[, max_records, ...])

helps with storing, extending and reading tabular data in parquet format

Class Inheritance Diagram

digraph inheritance114b48ec0a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "DfBatchWriter" [URL="index.html#parquetranger.DfBatchWriter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DfBatchWriter(trepo: parquetranger.core.TableRepo, record_limit: int = 1000000, writer_function: Callable = <function TableRepo.extend at 0x7f1c36e7bbe0>, batch: list = <factory>)"]; "RecordWriter" -> "DfBatchWriter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "HashPartitioner" [URL="index.html#parquetranger.HashPartitioner",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="HashPartitioner(col: str | None = None, num_groups: int = 128)"]; "ObjIngestor" [URL="index.html#parquetranger.ObjIngestor",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ObjIngestor(root: pathlib.Path, root_id_key: Optional[str] = None, force_key: bool = False, forward_uuids: bool = False, total_atoms: int = 0, largest_size: int = 0)"]; "RecordWriter" [URL="index.html#parquetranger.RecordWriter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="RecordWriter(trepo: parquetranger.core.TableRepo, record_limit: int = 1000000, writer_function: Callable = <function TableRepo.extend at 0x7f1c36e7bbe0>, batch: list = <factory>)"]; "TableRepo" [URL="index.html#parquetranger.TableRepo",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="helps with storing, extending and reading tabular data in parquet format"]; }

Release Notes

v0.0.0

  • first release of parquetranger, yay!!

v0.0.1

  • fix specific dask

v0.0.10

v0.0.11

v0.0.12

  • add extra metadata feature

v0.0.2

  • add accessors

v0.0.3

  • add col stretching

v0.0.4

  • fix axis error

v0.0.5

  • fix names indices

v0.0.6

  • add env handling

v0.0.7

  • add name prop

v0.0.8

  • remove duplicate indices

v0.0.9

  • params and renames

v0.1.0

  • add extra metadata feature

v0.1.1

v0.1.2

v0.1.3

lazy client loading

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.2.0

  • first release of parquetranger, yay!!

v0.2.1

return partition map

v0.2.2

try_dask default change

v0.2.3

partition gb path

v0.3.0

smarter lock business

v0.4.0

dump dask

v0.4.1

expose vc path

v0.4.2

add group col deleting

v0.4.3

fit new atqo

v0.4.4

no col issue

v0.4.5

map partition with level

v0.4.6

add batch writers

v0.5.0

massive speedup

v0.5.1

refix mysteries

v0.5.2

fixed thing

Indices and tables