This article presents the evolution of end-user storage systems focusing on the data storage complexity: from traditional hard drives to the complex fusion of filesystems and multi-disk arrays. We followed the data storage development as it strives to satisfy the always increasing requirements of several kinds: capacity, performance, fault-tolerance and flexibility.
The simplest data storage is a regular hard drive (a.k.a hard disk). In order to use it, you need first to partition the drive: in the simplest case you have a single partition that needs to be formatted to create the filesystem. Translating physical sectors to logical ones is also simple and clear – there is a fixed offset difference between physical and logical sectors as illustrated in the following picture.
A filesystem working on the virtual volume operates with clusters – small groups of sectors of the same size. A filesystem allocates clusters on the virtual volume sequentially from the beginning to the end of the volume and uses them as needed.
The next level of complexity is associated with the transition from a single disk to a set of disks combined into a large storage using the RAID technology (originally redundant array of inexpensive disks, now commonly called redundant array of independent disks). RAID arrays can be block and non-block.
In a non-block array, disk space is allocated sequentially and contiguously in the same way as it happens in a regular hard drive. Non-block RAIDs are also called JBOD (just a bunch of disks): disk space belonging to different hard drives is concatenated into a linear virtual volume. In RAID1 each sector on a virtual volume corresponds to two sectors on the physical disks, in a configuration called mirror.
In a block array, the memory space of physical disks is first split into blocks of the same size which are then included into a virtual volume according to a certain pattern. This is called striping. Block RAIDs can be redundant and non-redundant. A non-redundant block array is called RAID0 where blocks from different disks are sequentially included into a virtual volume.
In redundant arrays, a virtual volume, as before, includes blocks from different disks. However, some blocks are not included into a virtual volume – they are reserved to store redundant data (parity check) based on which a corresponding data block content can be reconstructed. RAID5 is the most common redundant block array.
Full description of all RAID levels and combinations is beyond the scope of this article. If you are interested in more details, the article provides a good start.
The RAID technology, despite all the advantages, has a drawback – it requires that all the disks have the same size. Theoretically, you can use disks of different sizes but the resulting RAID will be truncated to the smallest disk, with a clear waste of storage space. All these issues make the expansion of the RAID array quite difficult and a tricky.
The situation has changed with the adoption of the Linux LVM (Logical Volume Manager) technology allowing to expand multi-disk storage systems. Nowadays the LVM is increasingly used in NAS devices such as Synology Hybrid RAID and Netgear X-RAID2. With LVM, you can combine several RAIDs into a single large conglomerate including arrays of different sizes.
Hybrid storage systems
All the previously presented methods are gradually increasing the overall complexity. The RAID needs several physical disks grouped together, and LVM is a set of several RAIDs grouped together. All of these have one thing in common – they are the disk space management drivers responsible for creating virtual volumes. A filesystem of these storage devices works on the virtual volume but it’s not aware of how the volume is created. That means the levels at which the virtual volume and filesystem drivers operate do not intersect and so they are not aware of each other.
Obviously, such a lack of knowledge about each other has its advantages and drawbacks. One of the advantages is that in case of failures you are dealing with independent parts, which are repaired or replaced separately. The drawback it’s a lower performance because you can get a much more effective system allowing the disk space driver to work in conjunction with the filesystem driver. This is addressed by hybrid storage systems.
MS Storage Spaces
Microsoft introduced the Storage Spaces technology in 2013 (see this article for more details). The most recent version is essentially the same, and the goal of the technology is to create a truly flexible and efficient storage systems. With the Storage Spaces, you can combine completely different disks into pools; each member disk in the pool is virtually cut into 256 MB pieces (slabs) which are the base unit for the disk drivers.
On the disks of the pool, you can create spaces of two types – fixed and thinly-provisioned. For a fixed space, all the capacity is allocated from the pool at the time of creation. For a thinly-provisioned space, you can specify any size and it does not matter how much disk space is available because when you run out of disk space on a thinly provisioned storage, you can just add more disks to the pool.
Hybridity of Storage Spaces is about the disk space driver working in conjunction with the filesystem driver. For example, if you delete a file on a thinly provisioned storage so that the entire slab becomes free, it is released from the space to the pool of free slabs and no trace that the slab once belonged to the space remains. Also if you use ReFS, which has built-in integrity checking, ReFS driver will ask for different copies of data if it detects checksum errors. Therefore, both layers are aware of each other.
BTRFS (B-TRee File System)
Although BTRFS is considered a filesystem, from a practical point of view it is a hybrid of filesystem and disk space driver. At the filesystem level, BTRFS operates with clusters; however, disk space for these clusters is allocated in chunks, which can be of different size and have different fault tolerance. Some chunks may be stored in a single copy, some in two copies on one or two disks, some chunks may be stored as a RAID5. Different chunks are used for metadata and user content, therefore allowing you to have different levels of fault tolerance for metadata and for file content. Similar to thinly-provisioned space, BTRFS does not divide drives into chunks in advance, allocating the new ones as needed.
A file cluster can be stored in several devices at the same time. In case of a single disk, data can be stored in multiple locations in order to be protected against bad sectors. If there are two or more disks, two copies are stored on different disks protecting not only against bad sectors, but also against disk failures. In such a case, unlike RAID1, two disks are not copies of each other because data on different disks most likely is stored in different places. The filesystem keeps track of what data and where it puts without having some kind of special layer for RAID and the location of files on one or several disks is described by a set of unified metadata.
If you are keen to dive into the technical details, the BTRFS Wiki provides good and extensive information.
When recovering data from BTRFS it is necessary to solve two tasks at the same time: find where files and chunks are located. Additionally, the weak point of BTRFS storage is chunk table storing information about the location of chunks on physical disks. Although there are two copies of this table but it still does not help to recover those chunks which were normally discarded, for example just because a user deleted a particular file, the content of which was stored in this chunk. In this case BTRFS updates both copies of the table simultaneously.
Conclusion: data storage complexity
Modern storage devices are considerably complex when compared to regular hard drives. The difficulty is not only that each of the layers became more complicated, but also that there is interconnections between the layers and more factors come into play. For example to describe a typical hard drive with a regular filesystem it is enough to specify the start offset, cluster size and the filesystem type. For RAID, you deal with a set of parameters like the number of disks, the disk order, the array type, block size and several others. In case of hybrid storage systems, the number of parameters depends on the size of the storage. For Storage Spaces there are 4000 parameters for each TB of data. This way, complexity of hybrid storage system increases proportionally to its size, while complexity of the traditional storage does not.