parquetranger
Installation:
using pip
pip install parquetranger
Quickstart
import pandas as pd
from parquetranger import TableRepo
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6],
"B": ["x", "y", "z", "x1", "x2", "x3"],
"C": [1, 2, 1, 1, 1, 2],
"C2": ["a", "a", "b", "a", "c", "c"],
},
index=["a1", "a2", "a3", "a4", "a5", "a6"],
)
df
A | B | C | C2 | |
---|---|---|---|---|
a1 | 1 | x | 1 | a |
a2 | 2 | y | 2 | a |
a3 | 3 | z | 1 | b |
a4 | 4 | x1 | 1 | a |
a5 | 5 | x2 | 1 | c |
a6 | 6 | x3 | 2 | c |
trepo = TableRepo("some_tmp_path", group_cols="C2") # this creates the directory
trepo.extend(df)
trepo.get_full_df()
A | B | C | C2 | |
---|---|---|---|---|
a1 | 1 | x | 1 | a |
a2 | 2 | y | 2 | a |
a4 | 4 | x1 | 1 | a |
a3 | 3 | z | 1 | b |
a5 | 5 | x2 | 1 | c |
a6 | 6 | x3 | 2 | c |
df2 = pd.DataFrame(
{
"A": [21, 22, 23],
"B": ["X", "Y", "Z"],
"C": [10,20,1],
"C2": ["a", "b", "a"],
},
index=["a1", "a4", "a7"]
)
trepo.replace_records(df2) # replaces based on index
trepo.get_full_df()
A | B | C | C2 | |
---|---|---|---|---|
a2 | 2 | y | 2 | a |
a1 | 21 | X | 10 | a |
a7 | 23 | Z | 1 | a |
a3 | 3 | z | 1 | b |
a4 | 22 | Y | 20 | b |
a5 | 5 | x2 | 1 | c |
a6 | 6 | x3 | 2 | c |
trepo.replace_groups(df2)
trepo.get_full_df() # replaced the whole groups where C2==a and C2==b with the records that were present in df2
A | B | C | C2 | |
---|---|---|---|---|
a1 | 21 | X | 10 | a |
a7 | 23 | Z | 1 | a |
a4 | 22 | Y | 20 | b |
a5 | 5 | x2 | 1 | c |
a6 | 6 | x3 | 2 | c |
trepo.replace_all(df2) # erases everything and puts df2 in. all traces of df are lost
trepo.get_full_df()
A | B | C | C2 | |
---|---|---|---|---|
a1 | 21 | X | 10 | a |
a7 | 23 | Z | 1 | a |
a4 | 22 | Y | 20 | b |
trepo.replace_records(df, by_groups=True) # replaces records based on index, but only looks for indices within groups, so this way duplicate a4 index is possible
# as they are in different groups, with different values in C2
trepo.get_full_df()
A | B | C | C2 | |
---|---|---|---|---|
a7 | 23 | Z | 1 | a |
a1 | 1 | x | 1 | a |
a2 | 2 | y | 2 | a |
a4 | 4 | x1 | 1 | a |
a4 | 22 | Y | 20 | b |
a3 | 3 | z | 1 | b |
a5 | 5 | x2 | 1 | c |
a6 | 6 | x3 | 2 | c |
trepo.purge() # deletes everything
API
parquetranger Package
Read and write parquet files
Classes
|
|
|
|
|
|
|
|
|
helps with storing, extending and reading tabular data in parquet format |
Class Inheritance Diagram
digraph inheritance114b48ec0a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "DfBatchWriter" [URL="index.html#parquetranger.DfBatchWriter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="DfBatchWriter(trepo: parquetranger.core.TableRepo, record_limit: int = 1000000, writer_function: Callable = <function TableRepo.extend at 0x7f1c36e7bbe0>, batch: list = <factory>)"]; "RecordWriter" -> "DfBatchWriter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "HashPartitioner" [URL="index.html#parquetranger.HashPartitioner",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="HashPartitioner(col: str | None = None, num_groups: int = 128)"]; "ObjIngestor" [URL="index.html#parquetranger.ObjIngestor",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ObjIngestor(root: pathlib.Path, root_id_key: Optional[str] = None, force_key: bool = False, forward_uuids: bool = False, total_atoms: int = 0, largest_size: int = 0)"]; "RecordWriter" [URL="index.html#parquetranger.RecordWriter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="RecordWriter(trepo: parquetranger.core.TableRepo, record_limit: int = 1000000, writer_function: Callable = <function TableRepo.extend at 0x7f1c36e7bbe0>, batch: list = <factory>)"]; "TableRepo" [URL="index.html#parquetranger.TableRepo",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="helps with storing, extending and reading tabular data in parquet format"]; }Release Notes
v0.0.0
first release of parquetranger, yay!!
v0.0.1
fix specific dask
v0.0.10
v0.0.11
v0.0.12
add extra metadata feature
v0.0.2
add accessors
v0.0.3
add col stretching
v0.0.4
fix axis error
v0.0.5
fix names indices
v0.0.6
add env handling
v0.0.7
add name prop
v0.0.8
remove duplicate indices
v0.0.9
params and renames
v0.1.0
add extra metadata feature
v0.1.1
v0.1.2
v0.1.3
lazy client loading
v0.1.4
v0.1.5
v0.1.6
v0.1.7
v0.2.0
first release of parquetranger, yay!!
v0.2.1
return partition map
v0.2.2
try_dask default change
v0.2.3
partition gb path
v0.3.0
smarter lock business
v0.4.0
dump dask
v0.4.1
expose vc path
v0.4.2
add group col deleting
v0.4.3
fit new atqo
v0.4.4
no col issue
v0.4.5
map partition with level
v0.4.6
add batch writers
v0.5.0
massive speedup
v0.5.1
refix mysteries
v0.5.2
fixed thing