Challenges in using CI during database development
Full Deployment vs. Delta Deployment
While code can be overwritten entirely during release of a software application (full deployment), it is necessary to bear in mind, when working with database systems, that renewed deployment of objects usually causes a loss of the persistent data therein. Database systems are also used in traditional software development, but mostly just as a backend for a software or web application. The database system is optimized in such cases for smaller transactions, changes in the data model being rare and usually purely additive. Furthermore, a smaller number of developers usually work on the database system. Most of the changes can be created using simple scripts or DB Diffs, special attention to data being required only in individual cases. In DWS, however, a large portion of the work must be dedicated to the database, with structural changes, data migrations, archivings and updates forming part of daily activities. Changes here can therefore not be created and deployed simply via DB Diffs. To be made available for a release is a specific transition / upgrade script which takes into account data changes (DML) in addition to structural changes (DDL). Here it is essential to ensure that data are not lost or falsified. An additional challenge is posed by the large amounts of data (compared to transactional systems). To mitigate the consequential rise in complexity, various tools such as Liquibase and Flyway have been established. An overview of possible tools is provided here.
Exclusive creation of upgrade scripts can lead to problems, however. If the totality of structural changes and the configuration files of all upgrade scripts are always implemented sequentially in order to deploy the current development stage in an empty environment, this can quickly lead to bottlenecks, thus breaching key practice 7. Furthermore, the code alone does not make it easy to determine which objects exist in which form in the current development stage, because they are distributed in different upgrade scripts. These are just a few of the reasons for generating and maintaining, wherever possible, a snapshot of the current development stage in addition to the upgrade scripts. Caution, however: If two code artefacts are maintained, it is necessary to ensure (e.g. through a use of automatic DB Diffs) that they do not drift apart.
Dependencies Within a Build
Existent within a database are dependencies between objects which cannot always be resolved easily. For example, a view can access a function, but also vice versa. Challenges can arise especially if snapshots are used for the reasons previously mentioned, and the latest development stage is to be rolled out into an empty environment. In traditional software development, the standard procedure at this point would be to analyze the system and its loaders, this ultimately being similar to deployment of object categories. Due to the circumstances described above, however, it is not sufficient to simply deploy object categories building on each other in DWS (first all tables, then all views, and so on). Probably the most simple, but not very elegant solution would be to accept the dependencies and manage them to the greatest possible extent with coding standards and conventions. A more elegant but perhaps costly solution would be to analyze the dependencies before deployment and create an optimal flow path.
Heterogenous Production Environment
Using CI, many problems can be detected and fixed early by automating and testing the individual steps right up to the productive environment. In software development, the CI environment usually reflects the production environment quite accurately. However, the practicality of this for database development is limited for the following reasons:
- Gigantic storage space might be required for a full copy of the production environment.
- Cloning (or importing a backup) of the production environment can take a great deal of time.
- Due to the GDPR, data can usually not be used on a 1:1 basis for testing purposes.
- For financial reasons, multiple instances of hardware and licenses cannot always be made available.
The balancing act at this point is to ensure the most realistic possible test environment (see key practice 8) and simultaneously a fast build (see key practice 7) despite the limitations described. The mentioned challenges can only be met with compromises, especially for large production environments in DWS. Accordingly, an environment designed specifically for tests geared toward production can be loaded with an older, anonymized data version at regular intervals and used jointly by all developers (pre-production). However, developers must perform very precise coordination here in order to optimize use of resources and avoid conflicts. Due to the longer period until the environment is updated, developers should keep the environment "tidy" and handle the data prudently.
Automated Tests
Various restrictions are imposed on DWS, in contrast to application development where individual units can be tested in considerable isolation, and interfaces can be simulated simply. Because data are actually copied with the help of management processes in the database for meaningful tests, individual tests cannot be parallelized readily. Otherwise any test could be overwritten / manipulated through parallel invocation of another test's source data before its conclusion. In addition to parallelism of tests within a build, the same reasons also prevent concurrent starting of multiple builds with tests if they are deployed for the same environment. In addition, sequential execution of individual tests also means a prolonged running period. It is true that the running period can be shortened by generating procedure scripts from the dependencies for optimal sequencing during test execution (first Test1, then Test2 and Test3 parallel, then Test4, etc.). However, the build can only be accelerated to a certain degree here, and full coverage is only possible conditionally, because not all dependencies can readily be read out automatically. For these reasons, all tests cannot be readily executed with each build. A more elaborate solution would be to perform only those tests for which objects have actually changed. Due to dynamic SQL alone, however, full coverage would be difficult to realize here, so that compromises must be reached again in this case. For example, nightly builds running at fixed times can be configured for automated tests. Furthermore, tests can also be separated so that lightweight unit tests are executed with each build, for example, while more extensive and time-consuming tests are assigned to nightly builds.