Are you worried that PostgreSQL cannot scale writes within a single node, or do you think that scaling PostgreSQL can only be done horizontally, across different Kubernetes nodes? Discover the surprising truth behind PostgreSQL’s vertical scalability in this first article of a two-part series. Explore the potential of optimizing CPU, RAM, and storage resources through meticulous measurement and benchmarking, challenging conventional scaling wisdom. Delve into the solid strategies within the CloudNativePG stack, such as separate volumes for data and transaction logs, temporary tablespaces, and I/O segregation for tables and indexes. Stay tuned for insights into aligning storage solutions with PostgreSQL’s resilience needs in the upcoming sequel.

During KubeCon EU 2024 in Paris, I delivered a talk with Gari Singh from Google Cloud titled “Scaling Heights: Mastering Postgres Database Vertical Scalability with Kubernetes Storage Magic” (the video of that talk is available on the CNCF channel on YouTube).

A PostgreSQL database cluster comprises a single primary with an arbitrary number of read-only replicas for High Availability. One of the biggest mistakes we could make is quickly jumping to conclusions and labelling Postgres as a database that cannot scale. The goal of that presentation was to remind us that scaling doesn’t necessarily imply doing it horizontally - that is, across different nodes. Scaling can also happen on a single node by taking full advantage of all the available CPU, RAM, and storage resources.

Like that presentation, this two-part blog series will focus on the most critical component for a database’s vertical scalability: storage.

Start simple #

Although there are several use cases where horizontal scaling is required, I advise starting simple, adopting a scientific approach, and measuring your database systems before taking that path. The reason is purely pragmatic: a PostgreSQL cluster is much simpler to manage regarding Disaster Recovery and High Availability than an active/active counterpart that spans across different nodes, for example.

Start by defining your business goals: Recovery Point Objective (RPO), Recovery Time Objective (RTO), and the number of transactions per second (TPS) that you require. Then benchmark PostgreSQL, and only if the results are not good enough proceed with an active/active distributed solution. Benchmark the storage with fio and the database using pgbench (either with the built-in OLTP-like workload or by writing your custom queries).

In any case, fine-tuning PostgreSQL to exploit each node entirely is an activity that will bring benefits even in a distributed scenario.

Scaling with volumes in CloudNativePG #

CloudNativePG is designed to manage PersistentVolumeClaim (PVC) resources directly instead of relying on a Statefulset resource like most operators that work with data (another one that followed our approach is Strimzi for Kafka). If you are interested, we explain the reasons behind this choice on the “Pod Controller” page from the CloudNativePG documentation.

Every CloudNativePG instance necessitates a mandatory volume for the PostgreSQL data, aka PGDATA, meticulously configured within the storage stanza.

Flexibility is vital with CloudNativePG, as it includes an optional volume for Write Ahead Log (WAL) files, configured via the walStorage stanza, bolstering data protection and recovery capabilities. CloudNativePG empowers you to create an arbitrary number of PostgreSQL tablespaces through the tablespaces stanza, effectively decoupling physical and logical data modelling.

Additionally, CloudNativePG offers the freedom to use different storage classes for each volume, catering to diverse storage requirements and preferences and optimising cost-efficiency and I/O bandwidth for specific volume purposes. For example, you can use a different storage class for WAL file storage or a particular tablespace.

You can add volumes to live clusters and resize them (subject to storage class support), ensuring adaptability to evolving storage needs. The primary advantages of scaling with volumes include performance isolation and predictability, effective I/O distribution across volumes, and streamlined database maintenance operations such as indexing, reindexing, or VACUUM.

Regardless of the number of PVCs, integration with High Availability and Disaster Recovery is seamless, facilitated by features like volume snapshot backup and recovery for large databases.

Knowing the strengths and weaknesses of your CSI solutions is essential for making informed decisions and effectively managing PostgreSQL databases in a Kubernetes environment.

Adding a separate volume for WAL files #

The first option you have is to move the transaction logs (WAL files) onto another volume. All you must do is add the walStorage stanza as in the example below:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: separate-wal
spec:
  instances: 3
  storage:
    size: 40Gi
  walStorage:
    size: 10Gi

Each instance within the separate-wal cluster will be equipped with two volumes: one designated for PGDATA and another dedicated to WAL files. Internally, CloudNativePG manage the symbolic link to ensure that the pg_wal directory seamlessly points to the appropriate directory within the WAL volume. In the event of adding a volume for WAL files later on, CloudNativePG orchestrates a smooth rolling update process. This involves halting replicas one at a time, transferring WAL files to the new volume, and updating symbolic links. The process is further refined by the primaryUpdateMethod option, which dictates whether the primary undergoes a restart or a switchover.

Employing a dedicated volume for WAL files yields tangible enhancements in database performance. Specifically, some benchmarks I have run show an improvement between 15% and 45%, depending on how much your access workload is on disk vs in memory.

The critical aspect here is to properly size the volume for WALs and make it coherent with PostgreSQL settings such as min_wal_size and max_wal_size. Running out of disk space in the WAL volume will force PostgreSQL to halt ( there is an ongoing discussion on improving this in CloudNativePG).

The critical point is the ability to use a WAL-optimised storage class different from the one used for PGDATA. For example:

  walStorage:
    storageClass: my-favourite-storage-class-for-wals
    size: 10Gi

Adding temporary tablespaces #

In PostgreSQL, you can set up one or more temporary tablespaces. They are handy for creating temporary objects, namely temporary tables and indexes. They’re also used to make temporary files to sort big data sets. You can find many resources online to learn more about temporary tablespaces. For the interactive approach he used, I like the article Daniel Westermann wrote a few years ago.

Internally, PostgreSQL controls temporary tablespaces via the temp_tablespaces parameter, which supports multiple occurrences. CloudNativePG abstracts the management of this parameter through the .spec.tablespaces[*].temporary option.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: tmp-tablespace
spec:
  instances: 3
  storage:
    size: 40Gi
  walStorage:
    size: 10Gi
  tablespaces:
  - name: tmptbs1
    temporary: true
    storage:
      size: 40Gi

You can verify that the tmptbs1 tablespace is in the temp_tablespaces parameter as follows:

kubectl exec -ti tmp-tablespace-1 -c postgres \
  -- psql -c 'SHOW temp_tablespaces'

Returning:

 temp_tablespaces
------------------
 tmptbs1
(1 row)

As mentioned, you could even add a second temporary tablespace:

# <snip>
  tablespaces:
  # <snip>
  - name: tmptbs2
    temporary: true
    storage:
      size: 40Gi

The same kubectl command will now return:

 temp_tablespaces
------------------
 tmptbs1,tmptbs2
(1 row)

In this case, when PostgreSQL creates a temporary object inside a transaction, it randomly picks one from the temp_tablespaces list and then sequentially iterates through it.

Separating I/O for indexes and tables #

A widely used technique, particularly effective with simpler databases and scenarios, involves separating I/O operations for tables and indexes. In this example, we create two tablespaces: data for tables and idx for indexes.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: idx-tablespace
spec:
  instances: 3
  storage:
    size: 40Gi
  walStorage:
    size: 10Gi
  tablespaces:
  - name: data
    storage:
      size: 40Gi
  - name: idx
    storage:
      size: 40Gi

Once the tablespaces are there, you can position a table in the data tablespace by specifying the TABLESPACE option in the Data Definition Language (DDL) statement that creates the table. Given that this will also place any index for that table in the data tablespace, you must create:

constraints by specifying USING INDEX TABLESPACE idx
indexes by specifying TABLESPACE idx

What’s next #

By now, you have understood the flexibility the CNPG stack (Kubernetes, PostgreSQL, and CloudNativePG) provides regarding storage. As with any paradox of choice, the risk is for you to get lost in the ocean of possibilities. Start simple, then evaluate the storage possibilities you have based on the underlying Kubernetes environment.

Work with storage vendors, especially on-premises, and make them do the work for you (trust me, they’ll be happy to do it!). As shown in the KubeCon talk, I published a straightforward set of benchmarking guidelines that they can use to start providing some valuable insights.

Pay attention to performance, but make sure you don’t lose any data at the storage level (for example, what happens if you suddenly shut down the underlying hardware and storage?). PostgreSQL is designed from the ground up to be resilient to these issues. Remember that planning and practice environments are the perfect places to run these critical experiments: you don’t want to discover that your storage layer doesn’t honour data durability in the performance environment (i.e., operations, production, or execution).

In the next article, I continue covering the vertical scalability of PostgreSQL databases with storage, focusing on tablespaces and horizontal table partitioning.

Please stay tuned for upcoming updates! To keep yourself informed, kindly follow my LinkedIn and X channels. If you found this article helpful, why not share it with your social media network using the links below? Your support would mean a lot to me!