Javascript required
Skip to content Skip to sidebar Skip to footer

Why Do We Need Schema in Database

Principles of Database Schema Design

Gregory Pabian

As a software developer, I have spent countless hours designing database schemas, together with other professionals. Each time I did such a task, I got better at anticipating possible problems and averting them during the design. My current knowledge is a mixture of my own experience and the ideas of others, most of which I learned from various books and articles.

Today I give back to the community by sharing a curated list of database schema design principles. I believe every database schema designer should be aware of them. I hope they will amplify your experience as a software professional.

Please note that I approach the problem from the perspective of a person that builds distributed systems. Also, take everything in the article with a grain of salt — I do not know the complexity or requirements of the system you are designing!

By tables, I mean structures organized by rows and columns in SQL databases, and by collections, I mean their counterparts in No-SQL databases.

For better understanding, I have divided all the principles into four groups:

  • process-related principles,
  • database-related principles,
  • query-related principles,
  • entity-related principles.

Photo by Maksym Kaharlytskyi on Unsplash

Process-related Principles

Understand the Problem

Before trying to design anything, you should have a clear and deep understanding of the real-life problem the application you are building is trying to solve. Do realize that databases are expensive to maintain, time- and money-wise.

Because of that fact, maybe there is a solution that does not warrant booting up a new SQL server? Perhaps your solution needs a No-SQL database or an object storage service?

In general, it is not wise to limit oneself to just one idea!

Define Access Control

You should ask yourself which parts of your system will access the new database. Directly and indirectly. Will you stick to the old-fashioned one-backend-per-database rule, or will you use a different strategy? Should you allow people (developers or product managers) to access the database directly?

Depending on the requirements, you might need to approach security topics differently. For this article, I coined two terms — the proactive and the reactive approach.

The proactive approach is to enforce access control using the tools provided by the database engine. In the case of a SQL database, you might want to use database users with different privileges — for instance, distinct CRUD privileges for particular tables. Using a specific user for some operations provides a separation layer between queries and data. For example, to perform database calls for a system user, I could leverage a database user with the privileges allotted to that end user.

The reactive approach would be to use your backends for access control by authorizing the callers of queries and mutations. It means that the backends use a single database user with vast privileges to all possible operations in the database. Your schema could contain some journaling information to tell who introduced which change; it can provide relevant information for user audits.

Photo by Scott Graham on Unsplash

Write the Schema on Paper

I think writing schema on paper first allows its designer to see evident flaws earlier. It is easier to correct errors using a pencil than in an existing SQL database. Or even worse, if we used an ORM to generate the SQL schema for ourselves.

Additionally, you can move the paper schema to a digitalized environment. It might invite more collaboration from other team members, especially in the wake of the remote working arrangements. For people that prefer to use formal structures, some environments might support UML diagrams as well.

Review the Design

You should review your design with other people, possibly many times over. You might easily overlook some nitty-gritty details, not to mention completely forgetting about one commissioned feature introduced a couple of days back. Most people regard databases as a complex domain on their own, and they do not expect one person to have expert knowledge on each aspect of data storage.

Database-related Principles

Use No Foreign Keys

In a SQL database of a distributed system, foreign keys might become a hindrance. We need to apply changes to records held by foreign keys in a particular order. Please note that not all SQL engines allow for deferring consistency. Allowing for inconsistent changes within a transaction is not compliant with the SQL standard.

Some developers do away with this hindrance by using nullable foreign keys. As you read this article, you will understand why I try not to use nullable fields at all.

Additionally, using foreign keys might cause deadlocks if we have not thought our queries well! By removing foreign keys, we reduce the chance of deadlocking.

Without foreign keys, we can rely on correctly used transactions, isolation levels, and creating proper queries that move the database from one correct state to the other.

My solution is to start working with the foreign keys to benefit from the engine consistency checks and, after we have ensured the quality of our queries, ditch the foreign keys altogether.

Use No Cascades

The SQL databases allow users to define cascade behavior for foreign keys to control what happens when someone updates or deletes a row in a parent table. Now, I have explained why I would not use a foreign key in a distributed system, but steering clear from cascades remains a different topic.

Even if it were possible to use cascades without foreign keys, it would not do that, as cascades hide the business logic used for updates and removals.

Photo by Agê Barros on Unsplash

Understand Time Zones

Arguably, the biggest problem humanity faces with time are time zones. Time zones always feel implicit and depend on the geographical context, not including the infamous daylight saving time.

You can save timestamps using UTC, as it is a global standard independent of the location and daylight saving time. It is up to you whether you use dedicated types or just the epoch time defined in integers. In the end, all time-related operations must result in integer comparisons.

Use Row Timestamps

I would advise that you know when every row (document) in your database was created, updated, or deleted. You can use it for auditing the database and debugging. Additionally, you can use these timestamps for sorting, especially for entities that have no natural ordering.

Know the Actor

Who is an actor? By actor, I mean a person or a service that alters data in the system. A person can change data using the provided UI. A service can modify data when periodically invoked by a scheduler.

You might want to persist information about the actor upon every data change, at least for journaling and audit purposes. Depending on the product requirements and available resources, you need to decide about the accuracy of your reporting system.

Use Numbered Enumerations

When working on the state or the status of particular entities, you might feel inclined to save them as strings in their respective database fields. It could allow database administrators to inspect the values with ease. Another argument is that you can always purchase more disk space.

The truth lies probably between, as I believe using integer types requires precise planning, saves space (including the index space), and allows for faster comparisons. Also, in case of a malicious acquiring of database data, it hides portions of business logic from the attacker. That said, I think it is reasonable to use character-based fields during the Proof-of-Concept phase of the product.

Limit the Column Name Length

I believe that you should keep column names as short as possible. I find it easier to write and read concise SQL queries. If you keep sending a lot of queries all the time (you might want to use internally-stored procedures for that as well), you might incur additional traffic by keeping your names long.

In NoSQL databases that place the column name into each written object, you might waste a lot of disk space if you stick to long strings!

Limit the Column Value Length

As a rule of thumb, you should always know how much space you need to persist a particular value.

Database engines restrict the length of text fields using the number of bytes, not characters. It might prove surprising when working with Unicode strings. Also, you should check the encoding and the collation of text columns — you might apply these properties on many database levels (the whole database, table, or column).

For real numbers, there are usually three types available:

  • floating-point single-precision type using the base of 2,
  • floating-point double-precision type (using the same base as previously),
  • fixed-precision using the base of 10.

Storing real numbers might be a challenge since the backends might not support the same precision standards as the database.

When using integers, we need to allot enough bytes on the field level. We should decide if we support signed or unsigned data. Again, your backend might not support the same integer standards as the database.

Photo by Braňo on Unsplash

Learn about Money Types

Databases should store information about money in either an integer, decimal, string, or especially-designed monetary columns. Using floating-point arithmetic with money usually leads to a disaster — and a respectable company cannot allow it to happen.

The main problem is that our monetary units are divisible by powers of 10, not of 2. As such, you cannot faithfully represent them using most implementations of floating-point arithmetic, e.g., using the standard known as IEEE-754.

Storing financial data in strings has one fundamental problem: locales. Different languages use different delimiters for separating the integer part from the fractional part of a number. For instance, English uses a dot, while German uses a comma. To circumvent any problems, you should store all money information in one format.

Query-related Principles

Consider Reads and Writes

In distributed systems, backends read from and write to databases many times per second. It is why you should consider the database schema design in relation (pun intended!) to the way you plan your system to use the database.

How many writes per second does it have to sustain? How many reads? Can you easily partition it? Does the database support sharding?

You should check if your database engine supports replication — this might come in handy if you need to redirect your read operations to a read replica and keep your writes to the original database.

Ensure Writes without Reads

If you are building a distributed system, you might want to ensure your writes do not depend on your reads. That means you can update table rows in your database without knowing their previous values. It forms one of the ways of ditching traditional transactions in favor of atomic updates.

For instance, to increment a field value, you could obtain the previous one, add one, and save it, which requires two operations. Alternatively, you can tell your database engine to increment the field value in one go. Please note the database of your choice might not expose the latter functionality.

If your entire database works under that principle, you should be able to rebuild the database by starting from scratch and replaying all the commands and messages that entered the system in their order.

Understand Transactions and Isolation Levels

As a creator of a database schema, you should fully understand transactions and isolation levels available in the database of your choosing. Some database systems might not even offer the functionality of a transaction. Please note that you do not necessarily need transactions, as you can replace them with two-phase commits or the saga pattern.

If you choose to use transactions, you need to understand the locking mechanism behind them. If you use foreign or unique keys, the locking might feel amplified. Applying locks in transactions in a different order often results in a deadlock.

Plan Your Queries

Once you are ready with your schema, you should plan queries used by your application layer. It serves as yet another test for the feasibility of your schema. In general, I would consider writing and reading queries separately.

Writing queries usually consist of the filtering part and the data part. In the former, we define which subset of rows we update, and in the latter which data we put into these rows. Most filtering operations should utilize indices for efficient and fast querying. Filtering should result in the minimal subset of rows we desire to update (ideally). Each update ought to transform a complete entity into another whole entity.

You may verify if creation and update mean the same to the application and consider them one operation, usually called "upsert" in the industry. You can reap benefits by replacing all thoughts about writes into just "upserts" at least for some tables in your database.

Similar principles apply to reading. It consists of the filtering part and the projection part. As mentioned before, filtering should use indices for efficiency. Please consider pagination, as it is unlikely you would like to present a non-deterministic number of rows to the end-users of your product. Good projection ensures the database returns only the columns we need.

In some databases, you can persist your queries permanently in the database as procedures. You can then move parts of the business logic away from the backends, focusing on calling procedures and formatting the results.

Plan Your Indices

When you have finished planning your queries, you should plan your indices.

Some SQL databases enforce a concept of a clustered index. It defines how rows are sorted and stored on the disk. Usually, the primary key is (meant to be) the clustered index, but always ensure this is the case. The obvious conclusion is that the primary key also serves as an index — which means that read queries that use it should be already optimized.

For each table (collection), you might want to collect all the CRUD queries that run against it and prepare all candidate indices. Different queries might affect the database performance variously due to their complexity and the number of executions. Also, indices cost in disk space, so be reasonable about using them.

Depending on your situation, you might want to apply particular indices at a later point or when the performance of the database engine drops significantly.

In SQL databases, you might use unique indices to enforce the uniqueness of table rows outside of the associated primary key.

As a bonus point, once you have a proper database schema coded into a database, you might use the EXPLAIN keyword (SQL only) to see how the database engine will execute particular queries, especially which indices it will potentially use during the execution.

Photo by Wesley Tingey on Unsplash

Entity-related Principles

Use Single Source of Truth

Provided a database consists of tables (or collections), we ought to ensure the independence of each table towards others. I believe a table should contain data about one particular model class, no more, no less — a table should serve as the only source of truth for these entities. The only reasonable exception to the rule is denormalization — used for reading optimization.

Make Data Independent

Products — over time — get more features that require changes to the database schema. These changes usually result in adding tables (collections) and columns. I will not discuss the removal of tables and columns because it entails a lot of different scenarios.

When adding a table (collection), I think you should not change the existing queries and their results. If you do not use foreign keys and cascades, this should be a trivial task.

When adding columns in SQL databases, you ought to always ensure that:

  • the first migration adds a column with a default value (e.g., the null value or 0),
  • the second migration populates the column with proper values,
  • the third migration makes the column non-nullable if required.

If you plan on using a No-SQL database, check its documentation on adding columns — sometimes you need no migration because the engine takes care of it in your stead.

A good rule of thumb is to enforce that all changes to the database schema result in minimal changes to queries, backends, and views in your applications.

Use a Single Primary Key

Each table (or a collection) should have a single primary key. You ought to refrain from using compound primary keys — which are tuples of keys spanning multiple columns. Compound indices require more disk space than simple indices; data comparison might need more time as well. We can also create a hash of values under many columns, creating a single primary key.

I am against using integer sequences for primary keys.

Firstly, if you expose these integers to the end-users, your competitors might know how many rows you have in your tables!

Secondly, a sequence generator probably requires a table-level lock for the fear that two new rows could end up with the same primary key.

Finally, we cannot set the value of the primary key in the application code.

We can use a UUID field (or just a fixed-length text field) instead. It allows us to control the value of a primary key from the start. Also, we do not need to run queries in a particular order. Why? Because we do not need to obtain the primary key and pass it somewhere else.

By having a single primary key, we can easily reference a particular row throughout the entire database (try doing that with a compound key). Also, we can use the same value to reference it within the whole system.

Leverage Denormalization

As a database schema designer, you should understand when to use and not to use database normalization. In particular situations, for instance, when we need to improve the efficacy of reads, we might introduce controlled denormalization.

It means that we could enrich particular tables with additional columns or create tables from data stored in other tables. We might also generate materialized views if we do not need to present information up to date.

Insert, Update, and Upsert

For people that just heard the word "upsert" (I defined it before but you might have forgotten!) — it is a portmanteau of "update" and "insert". It means that upserting is an operation that updates a particular entry in the database if one already exists, or creates a new one otherwise.

You might choose to use the upsert operation family to write a particular record into a database, regardless if a previous version of that record existed there before. If so, you need to ensure this is a safe operation from the point of view of your system.

Having upsert operations instead of separate insertions and updates greatly simplifies your write queries. On the other hand, you no longer can distinguish between insertions and updates, and you cannot write logic that depends on that distinction.

Put Only Necessary Null Values

You should know, already at the design level, what kind of values a field might contain. When you allow any to hold a NULL value, there must exist a reason for it. The rationale might come from the product requirements or from the intrinsics of a data structure you chose.

Create and update whole records (documents) unless you want to persist partial information by design. It might be better to fail on insertion of a partial record than to save incomplete data. Your backends might not deserialize partially completed entities, even though they can exist in a properly designed database.

Refrain from Default Values

I seldomly rely on default values, be it databases or code. You scatter your business logic and create implicit behavior by using default values in a column definition. I wrote "implicit" because if you provide no value when writing to a particular field, the database will provide one for you. Judging by my experience, tracking implicit behavior is not straightforward, and I avoid it while I can.

The same applies to default NULL values for nullable fields — I think it is reasonable to use them only when you add a new column and migrate the already existing records in a particular table.

Keep Entities Independent

I believe you should always strive to keep your entities independent. It means you can move the entire table (or a collection) to a completely different database without changing other tables. If implemented correctly, such a structure gives you more flexibility when your database needs partitioning in the future.

Keep Restorable Entities

A user, or sometimes a script, can erroneously remove information from the database. To avoid issues with restoring such information, we can mark rows for removal with a specific tag on the row level. After that, we need to adjust the system to look up only unmarked rows. If we want to restore particular rows, we can flip the tag on them.

Summary

The database schema design is not an easy topic. The more systems you build, the more conclusions you will find based on your successes and failures. I find it crucial to exchange opinions on the Internet and learn from other people's experiences.

As always, you are welcome to leave a comment in the comment section!

Why Do We Need Schema in Database

Source: https://levelup.gitconnected.com/principles-of-database-schema-design-8e322e4fb283