First approach to parquet
In 2024, the National Institute of Statistics (Instituto Nacional de EstadÃstica, abbreviated INE) orchestrated the latest census in Chile. I encourage you to visit the official webpage.
I discover the page just today! And found out that the data is available in the tab resultados. There are links to download data as csv or parquet.
I saw the opportunity to use parquet for the first time. Hence, I downloaded the files to start playing with them.
What is parquet?
parquet is a columnar storage format, thought to optimize the distributed processing of large datasets. It is widely used in processing systems like Hadoop and Spark. In python, working with parquet files is made easy with the use of the libraries pyarrow and pandas.
Let's go back to the concept of columnar storage. Imagine a dataset containing the alumna's grades, obtained in a course. The set of evaluations can be very diversified: homeworks, written essays, research works, exams, seminaries, debates, etc.
We can present the dataset as a table:
| Name | h1 | h2 | \(\cdots\) | w1 | w2 | \(\cdots\) | r1 | r2 | \(\cdots\) | e1 | e2 | \(\cdots\) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alice | A | C | \(\cdots\) | B | A | \(\cdots\) | A | A | \(\cdots\) | C | B | \(\cdots\) |
| Bob | B | A | \(\cdots\) | B | C | \(\cdots\) | B | A | \(\cdots\) | A | D | \(\cdots\) |
| Charley | C | B | \(\cdots\) | A | C | \(\cdots\) | A | B | \(\cdots\) | F | A | \(\cdots\) |
If we want to analyze the performance of the students in their homeworks, in principle we have to upload the whole dataset to the system. This is very inefficient for large datasets.
What if we could be able of upload just the columns containing the grades of homeworks? We would save time in transmitting information, use our storage efficiently, save processing time. All of this translate directly to money… particularly in cloud computing.
This is exactly one of the problems solved by parquet.
Moreover, columnar data often has more repetitive values within a column, making it more amenable to compression algorithms. Reducing storage space requirements.
Using parquet in python
With pandas
pandas has methods to read and write parquet files: .read_parquet() and .to_parquet()
- Reading
import pandas as pd df = pd.read_parquet('example.parquet')
- Writing
import pandas as pd data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']} df = pd.DataFrame(data) df = pd.to_parquet('output.parquet')
With pyarrow
pyarrow: A Python package that provides a Python interface to the Arrow C++ library for working with columnar data.
pyarrow has a parquet module that allows to manipulate this format, in particular reading and writing parquet files: .read_table() and .write_table()
- Reading
import pyarrow.parquet as pq table = pq.read_table('example.parquet')
- Writing
import pyarrow as pa import pyarrow.parquet as pq col1 = pa.array([1, 2, 3], type=pa.int16()) col2 = pa.array(['a', 'b', 'c']) table = pa.Table.from_arrays( [col1, col2], names=["index", "name"] ) pq.write_table(table, 'output.parquet')
For the sake of completeness, I show you how the parquet table looks like:
: pyarrow.Table : index: int16 : name: string : ---- : index: [[1,2,3]] : name: [["a","b","c"]]
Converting pyarrow tables to pandas
The pyarrow tables have a method to be converted to pandas dataframe, namely .to_pandas()
import pyarrow as pa import pyarrow.parquet as pq col1 = pa.array([1, 2, 3], type=pa.int16()) col2 = pa.array(['a', 'b', 'c']) table = pa.Table.from_arrays( [col1, col2], names=["index", "name"] ) df = table.to_pandas() print(df.head())
index name 0 1 a 1 2 b 2 3 c
Loading a few columns of the parquet dataset
The read_table function of the pyarrow.parquet module has a column argument, which is used to provide the list of columns to be imported. Below, I use it to upload only four of the columns in the viv_hog_per_censo2024/viviendas_censo2024.parquet file.
import pyarrow as pa import pyarrow.parquet as pq table = pq.read_table("viv_hog_per_censo2024/viviendas_censo2024.parquet", columns=["id_vivienda", "region", "provincia", "comuna"]) print(table)
pyarrow.Table id_vivienda: int32 region: int32 provincia: int32 comuna: int32 ---- id_vivienda: [[1,2,3,4,5,...,131068,131069,131070,131071,131072],[131073,131074,131075,131076,131077,...,262140,262141,262142,262143,262144],...,[7471105,7471106,7471107,7471108,7471109,...,7602172,7602173,7602174,7602175,7602176],[7602177,7602178,7602179,7602180,7602181,...,7664462,7664463,7664464,7664465,7664466]] region: [[5,4,11,1,8,...,3,13,6,6,9],[1,13,3,9,13,...,10,13,13,15,7],...,[13,7,8,5,13,...,8,13,2,5,14],[8,8,13,4,10,...,3,14,9,5,13]] provincia: [[58,43,112,11,83,...,33,131,61,61,92],[11,131,31,91,131,...,101,131,131,151,74],...,[131,74,81,56,131,...,81,132,21,56,141],[81,81,131,41,103,...,31,142,91,51,131]] comuna: [[5802,4303,11202,1101,8301,...,3302,13130,6109,6101,9203],[1101,13114,3101,9101,13114,...,10102,13131,13115,15101,7406],...,[13118,7406,8101,5604,13115,...,8111,13202,2101,5605,14101],[8106,8102,13104,4101,10306,...,3101,14202,9105,5101,13122]]
A bit more using pandas
As I mentioned, we can load the parquet files directly from pandas. In this example, we shall load a few columns of the parquet dataset using the columns argument of the read_parquet function.
import pandas as pd df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet", columns=["id_vivienda", "region", "provincia", "comuna"]) print(df.head())
id_vivienda region provincia comuna 0 1 5 58 5802 1 2 4 43 4303 2 3 11 112 11202 3 4 1 11 1101 4 5 8 83 8301
It is also possible to use filters to load the dataset partially:
import pandas as pd df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet", filters=[("region", "=", 5)], columns=["id_vivienda", "region", "provincia", "comuna"]) print(df.head(n=10))
id_vivienda region provincia comuna 0 1 5 58 5802 1 21 5 55 5504 2 23 5 55 5504 3 39 5 56 5605 4 40 5 51 5109 5 42 5 51 5107 6 47 5 55 5504 7 60 5 51 5101 8 65 5 51 5105 9 74 5 55 5502
import pandas as pd df = pd.read_parquet("./viv_hog_per_censo2024/viviendas_censo2024.parquet", # filters=[("region", "=", 5)], columns=["id_vivienda", "region", "provincia", "comuna"]) print(df.groupby(['region', "provincia"])["comuna"].count())
region provincia
1 11 123677
14 21470
2 21 152576
22 66800
23 13930
3 31 87560
32 14070
33 38479
4 41 246491
42 46389
43 80798
5 51 379042
52 3115
53 45292
54 52596
55 82741
56 133388
57 63522
58 156894
6 61 271686
62 41512
63 95620
7 71 191687
72 32452
73 127697
74 135716
8 81 405166
82 70442
83 186347
9 91 389269
92 92276
10 101 200465
102 86152
103 107343
104 10883
11 111 28797
112 15358
113 2924
114 5350
12 121 60256
122 784
123 4179
124 11555
13 131 2067326
132 209289
133 113462
134 184487
135 81172
136 117471
14 141 136010
142 47730
15 151 87208
152 2477
16 161 154726
162 28841
163 51521
Name: comuna, dtype: int64
I'll continue exploring parquet databases and the data from the CENSUS2024 in following post.