Data Storage For You — Aspiration and Advisory

When the developer starts working on a new application, he must pick data storage. This storage could have different purposes: primary data container, caching, search, etc. So, depending on the task, the different storage types could be more or less appropriate. The following article describes my view of picking the best data storage for a software application.

Data Evaluation

First, the developer has to understand what data he’s going to work with. It includes the amount, type, and structure of data. He needs this information to be able to plan data storage parameters like disk size, memory limit, and sometimes even the number of CPU cores. That will become the hardware foundation of the data storage.

Then it is important to know the expected load, i.e. how often the customers are going to get the stored information. There are four typical scenarios:

small data, low load;
small data, high load;
big data, low load;
big data, high load.

The first scenario usually does not require any data storage investigation. Typical solutions like RDBMS work fine here. All other cases though require proper planning. The developer has to think about the proper data architecture to be able to handle the data properly.

Another concert that should be taken care of from the very beginning is read/write frequency. The developer has to evaluate how often the data will be read and how often customers will change it. If the data has to be changed often then the developer has to think about the proper way to cache it to not break user experience.

Now it is time to think about ordering chaos or, in other words, about creating a data architecture.

Data Architecture

It is usually started by taking all the data you have and trying to group it. There are many ways to do it:

by feature, i.e. each feature has its own data domain;
by component, i.e. all features that have a similar foundation will be in the same group;
by access type, i.e. separate features based on how they work with data (change data a lot, cache data, etc);
by structure, i.e. split structured data into groups based on structure types, and split unstructured data by the data source.

This separation is the first step in organizing your data storage and its structure.

The next step is building data flows. The developer has to understand where data is coming flow, how it is processed in the application, where it is stored, and when it can be achieved or removed. These are phases of the data flow. It is a very useful concept to understand and organize data sources and data storage from the very beginning, and plan extension points for the future.

Usually, at this point, the developer already understands how the data will be grouped and stored. Depending on the data structure and application size it may require single or multiple storages. If the data is consistent enough and data flows do not involve lots of different processing, there is a good chance that the application needs only one data storage. However, if data has multiple different structures that have to be processed differently or data flows include lots of data converting, then the application could utilize multiple storages, one for each data type or structure.

And now the developer can finally do what he wanted from the very beginning: choose a type for each storage.

The Best Storage Type

Storage type has to be defined by the data’s purpose. For example, there are storage types for relational data, plain data, table data, key-value data, cache data, search data, and so on. And each of these storage types has one or several primary use cases when it should be used. Let us check the most common cases.

Relational databases are usually used as primary storage and storage for aggregated data. They are reliable, consistent, and can handle multiple simultaneous changes without conflicts.

Non-relational (no-SQL) databases are usually used to store cache, raw data before aggregation, or search data. They are faster than relational databases but often lack consistency and concurrency control.

Now, it is time to pick the storage type. The developer should consider multiple data parameters, like data structure, reliability, read/write speed, consistency concurrency, availability, capacity, required hardware, and so on. There is no single storage for all application requirements based on the parameters above, so the developer has to pick one or two that match these requirements as close as possible, and then find the solution to take care of all edge cases.

Taking into account that the application could evolve and require changing one storage for another, it is usually a good idea to isolate interaction with the storage using one of the data source patterns. This way the developer can change only the data access layer and keep all the application logic intact.

So, these are the best practices for picking the right storage type for your application. Follow them, think ahead, and you can manage your application data under any circumstances!