The Six Pillars of Database-Driven Application Performance
Master these six pillars and keep your application simple with less architectural complexity
Hi, speedsters!
As I mentioned in the last posts, in the first few posts in this newsletter, I will cover the fundamentals of application performances before I start to cover more niche topics.
In the last newsletter, I covered six ways ORMS made writing poorly performant code easy. But it is just a part of the story.
In this newsletter, I will cover the bigger picture of working against a database and six significant topics that will most impact your application performance that you need to be conscious of and can decrease the need to add complexity to your system (like caching, multiple different DB's, etc.)
Side note - I may refer to RDBMS and SQL in the post, but these principles apply to NoSQL systems and beyond.
Let's start,
Database version
New major versions are coming with new features and may include performance improvements. It is 200% right if we refer to PostgreSQL.
Benchmark made by EDB shows a 47% performance enhancement between PostgreSQL v9.6 and v15 and a 10.4% upgrade from v12 to v15. V16 has as well some minor improvements. You can read more in this post by "benchant.com".
Hardware
The database is the heart of your application. If it doesn't have enough computing resources, your whole application will be waiting for these resources to be available.
CPU
For small-medium size, you need a minimum of 4 cores for parallelism of your workloads, connection concurrencies, and background jobs done by the DB (i.e., journaling). Scale up with these considerations in mind:
Workload Characteristics: CPU-intensive workloads (e.g., complex calculations and heavy transactions) require more cores for better performance.
Concurrency: Higher numbers of concurrent connections and transactions can increase CPU requirements.
Database Size and Complexity: Larger databases or those with complex schemas and queries may benefit from additional CPU resources to handle the increased processing demands.
Memory / RAM
RAM impacts the database's ability to cache data, work with indexes, and process queries efficiently, significantly affecting overall performance. For small-medium sizes require at least 4GB of RAM and scale up by these considerations:
Database Working Set Size: Ideally, your server should have enough RAM to hold the frequently accessed ("hot") portion of your database (tables and indexes) in memory to reduce disk I/O.
Connection Overhead: PostgreSQL uses a multi-process architecture, where each connection has its dedicated process. More connections imply more RAM for connection overhead. As a rough estimate, plan for about 10 MB (or more) per connection.
Workload Type: Read-heavy workloads benefit from more RAM for caching, reducing read operations from disk. Write-heavy workloads may benefit less from caching but require sufficient RAM for sorting and transaction logs.
Storage
The storage size varies from one application to another, but what is important here is the type of storage and its IO performance:
Storage Type: SSDs should be your choice
Provisioned IOPS: Consider using provisioned IOPS storage for high-performance workloads to ensure consistent and fast disk I/O.
Storage Auto-scaling: Enable auto-scaling for storage to adjust as your database grows automatically.
Scaling up your system cost usually more money. Monitor closely the use of your current resources and consider ways you decrease the use of the resources before adding more resources.
Network proximity / Geographics:
It's best to keep your database and application server as close to each other as possible, network-wise so that they can communicate more efficiently. One way to do this is by provisioning them in the same region on your cloud service and putting them within the same private virtual network. Also, it's a good idea to use private IP addresses for communication between them rather than public IPs, which will help reduce the number of network hops and increase the speed of data transfer.
Connections Management
Opening a connection to the database is an expensive operation. It includes the DNS resolving, TCP handshake, TLS Handshake, authentication, and setting up internal resources such as session objects and watchdogs. So, you should be conscious about opening a connection to the DB and reusing them as much as possible.
The way to achieve it is by using connection pooling. It maintains a set number of connections and opens new ones until reaching a maximum value of connections. So, your application code should always have an open connection.
Amount of queries in a single request/transaction
Whether you use connection pooling or not, your database/pool has a limit on the maximum number of open connections. You should also verify the number of queries and their processing time so you won't get to the point where your code is waiting for a connection to be available instead of running your business logic.
What should you look for?
The N+1 problem
Querying too much unnecessary data:
Filtering fetched data in code instead of in your queries.
Aggregate fetched data in code instead of in a query (count/sum/averages/min/max/etc)
Fetch fields without using them (look up big text fields or tables with a lot of columns)
Use joins instead of multiple separate queries.
Optimize query processing time.
Before we start to optimize, we need to understand the order of query execution
In other words:
Be careful joining too many tables.
Conditions and Grouping should be evaluated fast (JOIN, Where, Group by)
With less data, sorting will be faster and better if you store it in order.
The "Limit" doesn't affect the performance of the query, just the amount of data transferred.
Notes about JOINS
Suppose you select from multiple tables using:
FROM table1, table2 WHERE table1.id=table2.other_id
You may have a cartesian product of table1 and table2 and only then filter it. Using the JOIN statement ensures it won't happen:
FROM table1 INNER JOIN table2 ON table1.id=table2.other_id
Most ORMs will use JOINS properly if instructed to do eager loading.
Fine-tune condition evaluation with indexes
What are indexes
Imagine you have a table of users containing ID, name, and birth date. Your app has a search function for all the users who have a Birthday today. Usually, the database will scan the table data row by row and filter all the records with this birthday. Imagine this table has 1M records.
Indexes are like a sidecar to your tables; they contain a "table of contents" in various data structures, so when you query data, it will be possible to access the records without scanning all table content. The index contains the data of one or more of your columns, but instead of using a list to hold the data, it uses other data structures like trees and a hash table and divides the column value into different parts so the search will be faster. full table scan is O(n), while index scan usually is O(Log(n)) or O(1), depends on the index.
So now, instead of scanning the entire table to find users with a birthday, it checks the index, gets pointers for the records in the table (or, more accurately, to the exact pages in the filesystem), and then retrieves the data.
What should you index?
The database evaluates first the conditions in the FROM Clause (if you have joined), then in the WHERE clause, and then in the GROUP BY. To tune up the speed of your queries, you should index every column in use in the JOIN clauses and then the common ground of your conditions in your WHERE clauses. Finally, the columns you group by.
One exception to remember is that the database won't use an index with low cardinality, which means the data isn't unique enough, i.e., a Boolean value or small Enums (gender, permissions, etc.)
Types of indexes
Single column index
The most classic index, which indexes the data based on one column
Multi-column / Composite index
It indexes the data based on the value of multiple columns in the order declared during creation. Creating an index on (A, B, C) will let you optimize queries with conditions like WHERE A=1 or WHERE A=1 and B=2 or WHERE A=1 and B=2 and C=2, but if it doesn't include A, it won't use the index, if it only includes A and C, but not B, it will use only the index with the value of A. Also, it could be more efficient for querying A=1 or B=2. For this kind of situation, it's better to have two indexes, and the query engine will fetch all data with A and B using the indexes and evaluate the OR on the filtered data.
Unique index
Use this index to make sure the table doesn't have duplicate data. The primary key is a unique index (behind the scenes). Querying using a Unique index is fast because finding one value within the index is fast, and then you get an exact pointer to the record containing the value.
Partial index
As the name suggests, it created the index on part of the data in the table. It solves the situation where you have a value to filter that is frequently in use and also for low cardinality columns. The filter definition includes a condition on the data to be indexed. For example," index all birthdays of active users only."
Expression index
Sometimes, there is a need to filter/group based on the result of a function. For example, to run a case-insensitive search, the condition will look like this:
WHERE lower(name)=lower(:input)
Using the lower function typically will not trigger the use of the index; hence, it will be a full table scan. If, for example, we want to find the number of users born in the same year :
SELECT year(birthday), count(*) FROM users GROUP BY year(birthday);
Databases make it possible to index the result of an expression, and then the query engine knows to use this index to execute the query.
CREATE INDEX users_lower_email ON users(lower(email));
Index Data structures
B-Tree - is the default data structure for indexes in most databases. The tree structure mathematically promises the worst-case scenario of O(Log(n)), and it is best for searching ranges of values.
Hash hashes the column values and enables a lookup of O(1) when filtering for equality. But the query engine won't use the index for range lookup (between, greater/smaller) or like statements.
Gin (Postgres SQL) - is an "inverted index" appropriate for data values containing multiple component values, such as arrays, JSON, and strings.
GiST (Postgres SQL) - Generalized Search Tree indexes are not a single kind of index but rather an infrastructure within which many different indexing strategies can be implemented. Use it usually for Geometry types, Text searches, Nearest neighbor searches, ranges of timestamps/IPs (using the correct data type)
SP-Gist / Spatial - SP-GiST indexes are most valuable when your data has a natural clustering element and is not an equally balanced tree. A great example is phone numbers (at least US ones). They follow a format of 3 digits for the area code, 3 for the prefix, and 4 for the line number.
BRIN - shorts for "Block Range INdexes" store summaries about the values stored in consecutive physical block ranges of a table. Thus, they are most effective for columns whose values are well-correlated with the physical order of the table rows — for example, time-series data such as audit logs or events.
Fulltext - gather a column or multiple columns into a document, and then preprocess the document in various ways into an index, which makes it possible to run different kinds of text matches (including spelling mistakes).
What kind of index and data structure should you use?
In most cases, the default of b-tree and single/multi-column value will be enough and give you the most you need. When it's not enough, I would examine the data itself, which data types, the exact query, and the query plan using `Explain.` Then, We can create another index (and replace the old one).
💡 pay attention that you should not index everything, as tempting as it is, because every index uses more storage and slows the write operations. Remember that the query planner usually uses the most selective index (fetching the smallest data), so some indexes may never be used.
Fine-tune data fetching with Partitioning
If you deal with big data and need more than indexing, partitioning the data will let you only query the relevant portion. For example, if partitioning the data by "Year," you will query only data from 2024, and all 2023 won't be retrieved from the storage if you haven't explicitly asked for it (partition is like a different storage unit). If most of the queries are in filtering by 2024, it will dramatically enhance your query performance.
Partitioning has overhead in maintenance, and it can lead to a spike in lock manager waits and negatively impact user query performance. Increasing smaller tables and multiple indexes and partitions can generate many locks per query, causing contention problems.
Based on my experience, focusing on these 6 pillars can significantly enhance the performance of the application by 80-90%, even before implementing any additional solutions such as caching, rewriting specific parts of the application in more performant languages, dividing into microservices, or using NoSQL.