In this article, you will learn how SQL aggregate functions can represent an easy way to significantly improve the performance of your application. Particularly, you will see how they were a game-changer in a real-world scenario based on a data-driven application developed for a startup operating in the sports industry.
Let’s now dig deeper into this scenario and learn why SQL aggregate functions are so crucial when it comes to data science.
Introducing the Scenario
The application I recently worked on aims to offer advanced data exploration features in the sports world through the web. In detail, it needs to allow exploration of both raw and aggregated data. Since the database involves terabytes of heterogeneous and unstructured data, the challenges were mostly backend and database side. Now, let’s dive into this scenario.
Technologies, server specs, and architecture
We developed the backend in Kotlin with the Spring Boot 2.5.3 framework and the Hibernate 5.4.32.Final ORM (Object Relational Mapping). We deployed it on an 8GB 4 CPU VPS through a Docker container managed by Dokku. The initial heap size was set to 2GB and limited to 7GB, while the remaining GB of RAM was allocated to a Redis-based caching system. We developed the web application with performance in mind. Specifically, it is based on the multi-layered Spring Boot architecture described here and it involves multi-thread processing.
Database structure
The database is a MySql server running on an 8GB 2 CPU VPS. The same server farm also hosts the backend application and the database, but they do not share the same VPS. Since the data coming from sports is simple but highly heterogeneous, the database was structured to avoid duplication and encourage standardization. This is why a relational database was chosen. Right now, the database involves hundreds of tables and it cannot be presented entirely here due to an NDA.
Luckily, the most problematic tables share more or less the same structure. So, analyzing just one table should be enough. In particular, this is what the PositionalData
table looks like this:
CREATE TABLE `PositionalData` ( `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY, `area1` double DEFAULT NULL, `area2` double DEFAULT NULL, `area3` double DEFAULT NULL, `area4` double DEFAULT NULL, `area5` double DEFAULT NULL, . . . `area140` double DEFAULT NULL, `area141` double DEFAULT NULL, `area142` double DEFAULT NULL, `area143` double DEFAULT NULL, `area144` double DEFAULT NULL, `value` double DEFAULT NULL, `parameterId` int(11) NOT NULL, `gameId` int(11) NOT NULL, `createdAt` datetime DEFAULT CURRENT_TIMESTAMP, `createdBy` int(11) DEFAULT NULL, `updatedAt` datetime DEFAULT CURRENT_TIMESTAMP, `updatedBy` int(11) DEFAULT NULL, `deletedAt` datetime DEFAULT NULL, `deletedBy` int(11) DEFAULT NULL, `active` tinyint(1) DEFAULT '1', ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
As you can see, it involves more than 100 columns, and it has more than 4 external IDs. On average, each of these tables contains at least 15 million rows.
Performance issues
One of the key features of the frontend application is to let users analyze the aggregated values of hundreds of different sports parameters (e.g., passes, throws, blocks) coming from all the selected games of one or more seasons. To retrieve this data, we developed a backend API performing a query on the aforementioned table. Such a query was nothing more than a trivial SELECT
returning from 10k to 20k rows. Then, this data is aggregated with a multi-thread process, stored in the Redis cache, and finally serialized in JSON and returned to the frontend application. The first time users call that API—and so before the result is available in the Redis cache—it takes from 2 to 4 seconds. This is unacceptable.
Delving Into the Performance Problem
Let’s now see what the downsides of the above approach are.
ORM data transformation process bottleneck
Most advanced ORMs abstract how data is represented at the database level. In other terms, the ORM performs the query, retrieves the desired data from the database, and takes care of transforming it into its application-level representation. This data transformation process is performed behind the scenes, but it undoubtedly represents an overhead. Although it is usually negligible in terms of performance, when it comes to thousands of rows it can easily become a bottleneck. This is especially true when using OO (Object Oriented) languages. In fact, creating a new class instance takes time and resources. To limit the object size and then the heap usage, a solution might be to select only the strictly necessary set of columns. This would make each object lighter, even though the object creation process is the main overhead here. Thus, the time spent performing this transformation process would not change significantly.
Looping takes time
Performing simple operations like sum or average on arrays of objects containing thousands of elements is not performance-free. Although this does not compare to the time spent by the ORM to transform the data, it certainly represents an additional overhead. Fortunately, Java supports many thread-safe collections to perform operations concurrently. On the other hand, opening and managing threads is a complex and time-consuming operation.
Adopting SQL Aggregate Functions
Let’s see how a couple of SQL aggregate functions helped me solve the performance issue.
What are SQL aggregate functions?
SQL aggregate functions allow you to perform a calculation involving several rows and obtain one value as a result. Even though each SQL language has its own aggregate function, the most common ones are:
-
COUNT()
: it returns a count of the number of rows selected -
MIN()
: it extracts the minimum value -
MAX()
: it extracts the maximum value -
SUM()
: it performs the sum operation -
AVG()
: it performs the average operation
They represent a particularly powerful and useful tool when used in association with the GROUP BY statement. Thanks to it, you can first group the desired data and then aggregate it by harnessing them. If you want to delve into MySQL aggregate functions, you can find all the supported ones here. I also recommend checking out this and this.
Replacing application-level operations with queries
While SQL aggregation functions seem promising, I did not know if they could make a difference before seeing them in action. Specifically, the application-level operation was generating a data structure containing the average value on the value
column and the sum of each areaX
(with X
from 1 to 144) column on each parameter chosen over the selected games. You can easily translated into the following query:
SELECT SUM(`area1`) as `area1`, SUM(`area2`) as `area2`, SUM(`area3`) as `area3`, ... SUM(`area142`) as `area142`, SUM(`area143`) as `area143`, SUM(`area144`) as `area144`, AVG(`total`) as `total`, `parameterId` FROM `PositionalData` WHERE `parameterId` IN (:parameterIds) AND `gameId` IN (:gameIds) GROUP BY `parameterId`
As you can see, this query takes advantage of the SQL aggregate functions to return aggregate data at the database level. All this while filtering over the desired data using the IN
statement on gameId
and parameterId
and grouping it based on the same parameterId
. In other words, data is first filtered based on the selected game of the season and the desired parameters to analyze. Then, the resulting data is grouped by parameter, and finally aggregated by the SQL aggregate functions.
Defining the right indexes
Since that query involves GROUP BY, IN, and SQL aggregate statements, it might be slow. This is why defining the right indexes is so important. In detail, the most important and performance-effective index applied was the following one:
ALTER TABLE `PositionalData` ADD INDEX `PositionalData_parameterId_gameId` (`parameterId`, `gameId`) USING BTREE;
Pros
- Database-level aggregation is much faster than executing the same aggregation logic at the application level while looping on large arrays.
- Using queries involving SQL aggregate functions with the GROUP BY statement allows you to drastically reduce the number of rows returned. In detail, this allowed me to go from about 10k rows to a number of rows equal to the number of parameters analyzed. Consequently, this makes the data transformation process performed by the ORM irrelevant in terms of time, preventing it from being a bottleneck.
- Aggregating at the database level allows you to take advantage of the performance benefits coming from the database cache when executing the same requests. This can make it less crucial to have an application-level cache, consequently leading to a lighter architecture.
Cons
- SQL aggregate functions are generally used at
SELECT
time. When dealing with strongly typed programming languages, the ORM needs to the type the result will have. And not all the ORMs allow you to define easily, sometimes even restricting SQL aggregate functions to native queries only. This means losing the advantage of abstraction introduced by the ORM, and discouraging their use. - The query to extract the desired data involving SQL aggregate functions is always slower than a
SELECT
involving a simpleWHERE
clause. At the same time, the execution time should remain in the order of tenths of a second, and in any case much less than to perform the same operation at the application level. - The SQL aggregate operations available are usually a dozen, of which only 5 or 6 are actually mathematical operations.
Performance Comparison
Let’s compare the result in terms of response time when calling the same API involving data aggregation with no cache and the same parameters.
- Response time when performing aggregation at the application level: ~2/4s
- Response time when performing aggregation at the database level: ~800ms
Final Thoughts on SQL Aggregate Functions
SQL aggregate functions are with no doubt a great tool to take the performance to the next level when dealing with data science. Using them is easy and effective, although not all the ORM can fully or natively support them. Either way, knowing how to take advantage of them may become essential to improve performance, and explaining it through a real-world case study was what this article was about.