First approach to parquet

In 2024, the National Institute of Statistics (Instituto Nacional de Estadística, abbreviated INE) orchestrated the latest census in Chile. I encourage you to visit the official webpage.

I discover the page just today! And found out that the data is available in the tab resultados. There are links to download data as csv or parquet.

I saw the opportunity to use parquet for the first time. Hence, I downloaded the files to start playing with them.

What is `parquet`?

parquet is a columnar storage format, thought to optimize the distributed processing of large datasets. It is widely used in processing systems like Hadoop and Spark. In python, working with parquet files is made easy with the use of the libraries pyarrow and pandas.

Let's go back to the concept of columnar storage. Imagine a dataset containing the alumna's grades, obtained in a course. The set of evaluations can be very diversified: homeworks, written essays, research works, exams, seminaries, debates, etc.

We can present the dataset as a table:

Name	h1	h2	\(\cdots\)	w1	w2	\(\cdots\)	r1	r2	\(\cdots\)	e1	e2	\(\cdots\)
Alice	A	C	\(\cdots\)	B	A	\(\cdots\)	A	A	\(\cdots\)	C	B	\(\cdots\)
Bob	B	A	\(\cdots\)	B	C	\(\cdots\)	B	A	\(\cdots\)	A	D	\(\cdots\)
Charley	C	B	\(\cdots\)	A	C	\(\cdots\)	A	B	\(\cdots\)	F	A	\(\cdots\)

If we want to analyze the performance of the students in their homeworks, in principle we have to upload the whole dataset to the system. This is very inefficient for large datasets.

What if we could be able of upload just the columns containing the grades of homeworks? We would save time in transmitting information, use our storage efficiently, save processing time. All of this translate directly to money… particularly in cloud computing.

This is exactly one of the problems solved by parquet.

Moreover, columnar data often has more repetitive values within a column, making it more amenable to compression algorithms. Reducing storage space requirements.

Using `parquet` in `python`

With `pandas`

pandas has methods to read and write parquet files: .read_parquet() and .to_parquet()

Reading

import pandas as pd

df = pd.read_parquet('example.parquet')

Writing

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)
df = pd.to_parquet('output.parquet')

With `pyarrow`

pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.

pyarrow has a parquet module that allows to manipulate this format, in particular reading and writing parquet files: .read_table() and .write_table()

Reading

import pyarrow.parquet as pq

table = pq.read_table('example.parquet')

Writing

import pyarrow as pa
import pyarrow.parquet as pq

col1 = pa.array([1, 2, 3], type=pa.int16())
col2 = pa.array(['a', 'b', 'c'])

table = pa.Table.from_arrays(
    [col1, col2],
    names=["index", "name"]
)

pq.write_table(table, 'output.parquet')

For the sake of completeness, I show you how the parquet table looks like:

: pyarrow.Table
: index: int16
: name: string
: ----
: index: [[1,2,3]]
: name: [["a","b","c"]]

Converting `pyarrow` tables to `pandas`

The pyarrow tables have a method to be converted to pandas dataframe, namely .to_pandas()

import pyarrow as pa
import pyarrow.parquet as pq

col1 = pa.array([1, 2, 3], type=pa.int16())
col2 = pa.array(['a', 'b', 'c'])

table = pa.Table.from_arrays(
    [col1, col2],
    names=["index", "name"]
)

df = table.to_pandas()
print(df.head())

   index name
0      1    a
1      2    b
2      3    c

Loading a few columns of the `parquet` dataset

The read_table function of the pyarrow.parquet module has a column argument, which is used to provide the list of columns to be imported. Below, I use it to upload only four of the columns in the viv_hog_per_censo2024/viviendas_censo2024.parquet file.

import pyarrow as pa 
import pyarrow.parquet as pq 

table = pq.read_table("viv_hog_per_censo2024/viviendas_censo2024.parquet", columns=["id_vivienda", "region", "provincia", "comuna"])
print(table)

pyarrow.Table
id_vivienda: int32
region: int32
provincia: int32
comuna: int32
----
id_vivienda: [[1,2,3,4,5,...,131068,131069,131070,131071,131072],[131073,131074,131075,131076,131077,...,262140,262141,262142,262143,262144],...,[7471105,7471106,7471107,7471108,7471109,...,7602172,7602173,7602174,7602175,7602176],[7602177,7602178,7602179,7602180,7602181,...,7664462,7664463,7664464,7664465,7664466]]
region: [[5,4,11,1,8,...,3,13,6,6,9],[1,13,3,9,13,...,10,13,13,15,7],...,[13,7,8,5,13,...,8,13,2,5,14],[8,8,13,4,10,...,3,14,9,5,13]]
provincia: [[58,43,112,11,83,...,33,131,61,61,92],[11,131,31,91,131,...,101,131,131,151,74],...,[131,74,81,56,131,...,81,132,21,56,141],[81,81,131,41,103,...,31,142,91,51,131]]
comuna: [[5802,4303,11202,1101,8301,...,3302,13130,6109,6101,9203],[1101,13114,3101,9101,13114,...,10102,13131,13115,15101,7406],...,[13118,7406,8101,5604,13115,...,8111,13202,2101,5605,14101],[8106,8102,13104,4101,10306,...,3101,14202,9105,5101,13122]]

A bit more using `pandas`

As I mentioned, we can load the parquet files directly from pandas. In this example, we shall load a few columns of the parquet dataset using the columns argument of the read_parquet function.

import pandas as pd

df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet",
                     columns=["id_vivienda", "region", "provincia", "comuna"])
print(df.head())

   id_vivienda  region  provincia  comuna
0            1       5         58    5802
1            2       4         43    4303
2            3      11        112   11202
3            4       1         11    1101
4            5       8         83    8301

It is also possible to use filters to load the dataset partially:

import pandas as pd

df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet",
                     filters=[("region", "=", 5)],
                     columns=["id_vivienda", "region", "provincia", "comuna"])

print(df.head(n=10))

   id_vivienda  region  provincia  comuna
0            1       5         58    5802
1           21       5         55    5504
2           23       5         55    5504
3           39       5         56    5605
4           40       5         51    5109
5           42       5         51    5107
6           47       5         55    5504
7           60       5         51    5101
8           65       5         51    5105
9           74       5         55    5502

import pandas as pd

df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet",
                     # filters=[("region", "=", 5)],
                     columns=["id_vivienda", "region", "provincia", "comuna"])

print(df.groupby(['region', "provincia"])["comuna"].count())

region  provincia
1       11            123677
        14             21470
2       21            152576
        22             66800
        23             13930
3       31             87560
        32             14070
        33             38479
4       41            246491
        42             46389
        43             80798
5       51            379042
        52              3115
        53             45292
        54             52596
        55             82741
        56            133388
        57             63522
        58            156894
6       61            271686
        62             41512
        63             95620
7       71            191687
        72             32452
        73            127697
        74            135716
8       81            405166
        82             70442
        83            186347
9       91            389269
        92             92276
10      101           200465
        102            86152
        103           107343
        104            10883
11      111            28797
        112            15358
        113             2924
        114             5350
12      121            60256
        122              784
        123             4179
        124            11555
13      131          2067326
        132           209289
        133           113462
        134           184487
        135            81172
        136           117471
14      141           136010
        142            47730
15      151            87208
        152             2477
16      161           154726
        162            28841
        163            51521
Name: comuna, dtype: int64

I'll continue exploring parquet databases and the data from the CENSUS2024 in following post.

First approach to parquet

What is parquet?

Using parquet in python

With pandas

With pyarrow

Converting pyarrow tables to pandas

Loading a few columns of the parquet dataset