Benefits of Using LVM2 Over mdadm RAID Arrays ~ Vesselin Kolev's Tech Corner

A comparative analysis of storage management approaches

Abstract

This study presents a comparative analysis of storage management approaches using Logical Volume Manager 2 (LVM2) layered over Multiple Device Administration (mdadm) RAID arrays versus direct use of mdadm RAID arrays in Linux systems. The study examines configurations utilising LVM2 to stripe across multiple independent RAID1 and RAID6 arrays created with mdadm, comparing this approach to traditional nested RAID configurations and direct RAID array usage. Through practical deployment experience spanning 2016-2025 across multiple high-performance computing and enterprise environments, we demonstrate that LVM2 over mdadm RAID provides significant operational benefits including enhanced flexibility, improved manageability, simplified capacity expansion, and equivalent or superior performance characteristics. The analysis covers RAID1 and RAID6 configurations, examining performance implications, operational workflows, and practical deployment scenarios. Results indicate that LVM2 over mdadm RAID offers substantial advantages for large-scale storage deployments requiring flexibility, incremental expansion, and independent array management whilst maintaining the redundancy and performance characteristics of traditional RAID configurations.

Keywords: LVM2, mdadm, software RAID, storage management, logical volumes, RAID1, RAID6

1. Introduction

Storage management in Linux systems frequently requires balancing performance, redundancy, and operational flexibility. Traditional approaches utilise either direct RAID array management through mdadm or nested RAID configurations to achieve desired redundancy and performance characteristics. However, an alternative approach utilising LVM2 to manage and stripe across multiple independent mdadm RAID arrays has demonstrated significant operational advantages in production environments.

This paper examines the benefits of using LVM2 over mdadm RAID1 and RAID6 arrays, focusing on practical deployment scenarios, operational workflows, and performance characteristics. The analysis is based on extensive practical experience deploying and managing storage systems across academic research computing, enterprise production environments, and high-performance computing infrastructure.

---

2. Materials and Methods

2.1 Configuration Approaches

Traditional Approach: Direct mdadm RAID

Direct use of mdadm RAID arrays involves creating RAID arrays and utilising them directly for filesystem creation:

# Traditional RAID10 (nested)
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1
mdadm --create /dev/md2 --level=0 --raid-devices=2 /dev/md0 /dev/md1

# Direct RAID6
mdadm --create /dev/md0 --level=6 --raid-devices=6 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

LVM2 Over mdadm RAID Approach

LVM2 layered over multiple independent mdadm RAID arrays:

# Create multiple independent RAID1 arrays
mdadm --create /dev/md0 --level=1 --raid-devices=2 --chunk=256 /dev/sda1 /dev/sdb1
mdadm --create /dev/md1 --level=1 --raid-devices=2 --chunk=256 /dev/sdc1 /dev/sdd1
mdadm --create /dev/md2 --level=1 --raid-devices=2 --chunk=256 /dev/sde1 /dev/sdf1
mdadm --create /dev/md3 --level=1 --raid-devices=2 --chunk=256 /dev/sdg1 /dev/sdh1

# Create physical volumes with data alignment
pvcreate --dataalignment 1M /dev/md0 /dev/md1 /dev/md2 /dev/md3

# Create volume group with extent size matching stripe width
vgcreate --physicalextentsize 1M vg_raid10_like /dev/md0 /dev/md1 /dev/md2 /dev/md3

# Create striped logical volume
lvcreate -i 4 -I 256K -L 500G -n lv_raid10_like vg_raid10_like

2.2 Test Environments

The analysis presented in this paper is based on extensive practical deployment experience gained across multiple production environments spanning academic research computing, enterprise storage systems, and high-performance computing infrastructure. The observations and conclusions are derived from hands-on experience with storage system deployments at OpenIntegra PLC, where enterprise storage solutions with mixed workload requirements were implemented and maintained. Additional experience was gained through academic computing infrastructure deployments at Sofia University "St. Kliment Ohridski", where research storage systems supporting various scientific computing workloads were configured and optimised.

High-performance computing storage infrastructure deployments at Technion – Israel Institute of Technology provided experience with large-scale storage systems requiring maximum performance and reliability. Computational biology and simulation storage systems at the Warshel Center for Multiscale Simulations at University of Southern California offered insights into specialised workload requirements and performance optimisation strategies. The UNITe Project at Sofia University provided experience with high-performance computing cluster storage infrastructure, whilst the Discoverer Petascale Supercomputer deployments, including both CPU and GPU partition storage systems, offered experience with petascale computing storage requirements.

2.3 Evaluation Criteria

The comparative analysis evaluates multiple aspects of storage system deployment and operation. Operational flexibility encompasses the ability to manage arrays independently, add capacity incrementally without major reconfiguration, and replace components with minimal disruption. Performance characteristics are assessed through measurements of throughput in megabytes per second, input/output operations per second (IOPS), and latency in milliseconds under various workload patterns including sequential and random access patterns.

Management complexity evaluation considers the ease of administration, monitoring, and troubleshooting procedures. This includes the simplicity of common operations such as capacity expansion, drive replacement, and array health monitoring. Capacity expansion procedures are examined for their complexity, required downtime, and impact on system performance during expansion operations.

Failure recovery procedures are evaluated for their simplicity, required manual intervention, and impact on system availability. The analysis also considers configuration alignment requirements, examining how proper alignment of RAID chunk size, LVM2 physical extent size, LVM2 stripe size, and filesystem parameters affects overall system performance and operational efficiency.

---

3. Results

3.1 Operational Flexibility

The use of LVM2 over mdadm RAID arrays provides significant operational flexibility through independent management of each underlying RAID array. This independence enables system administrators to perform maintenance operations, drive replacements, and capacity expansions on individual arrays without affecting the operation of other arrays in the configuration. When a drive fails in one RAID array, the replacement procedure can be executed independently, allowing the failed drive to be removed and a replacement drive to be added to that specific array whilst all other arrays continue normal operation.

Each RAID array can be monitored and managed separately using standard mdadm tools, providing granular visibility into the health and performance of individual arrays. This granular monitoring capability enables more precise troubleshooting, as issues can be isolated to specific arrays rather than requiring investigation of an entire nested RAID structure. Maintenance operations such as array health checks, drive replacement, or array resynchronisation can be performed on one array without impacting others, reducing the operational complexity and risk associated with storage system maintenance.

The independent nature of arrays also enables NUMA (Non-Uniform Memory Access) optimisation, where arrays can be strategically placed on specific NUMA nodes to optimise memory access patterns and reduce cross-NUMA-node memory access latency. This capability is particularly valuable in multi-socket systems where memory access locality significantly impacts performance.

In contrast, traditional nested RAID configurations require coordinated management across the entire structure. Issues with one component, such as a drive failure in one mirror of a nested RAID10 configuration, affect the entire nested structure. The aggregate view provided by traditional nested RAID configurations offers less granular insight into individual components, making it more difficult to identify and address specific issues within the storage hierarchy.

3.2 Capacity Expansion

Capacity expansion with LVM2 over mdadm RAID is straightforward and can be performed non-disruptively whilst the system remains operational. The expansion process begins with creating a new independent RAID array using mdadm, which can be configured with the same chunk size and device specifications as existing arrays to maintain performance consistency. Once the new array is created and synchronised, it is added to the existing volume group using the vgextend command, which makes the new array's capacity available to the volume group without requiring any downtime or service interruption.

The logical volume can then be extended to utilise the newly added array capacity. When extending a striped logical volume, the number of stripes must be updated to include the new array, ensuring that data continues to be distributed across all available arrays including the newly added one. The filesystem residing on the logical volume can then be resized online using filesystem-specific tools such as resize2fs for ext4 filesystems or xfs_growfs for XFS filesystems, both of which support online resizing without unmounting the filesystem.

This approach enables incremental capacity expansion, allowing administrators to add capacity in small increments corresponding to single RAID array pairs rather than requiring large-scale reconfiguration. The expansion process is non-disruptive, meaning that applications and services can continue operating normally throughout the expansion procedure. The flexible sizing capability allows administrators to add only the capacity needed at the time of expansion, avoiding the need to over-provision initial capacity and providing better cost optimisation compared to approaches requiring predetermined array sizes.

Traditional RAID10 or RAID60 configurations typically require predetermined array sizes established during initial configuration. Adding capacity to such configurations often requires rebuilding the entire array structure, which can be a complex and time-consuming process. These expansion operations may require system downtime or result in significant performance impact during the expansion process, making capacity planning more critical and expansion procedures more disruptive to system operations.

3.3 Performance Characteristics

3.3.1 Understanding RAID Chunk Size and Stripe Size

To understand the performance characteristics of LVM2 over mdadm RAID, it is essential to comprehend how RAID chunk size and stripe size function in data distribution across multiple storage devices. The chunk size, also referred to as stripe unit size, represents the amount of contiguous data written to a single disk in the array before the RAID system moves to the next disk to continue writing data. This parameter determines the granularity at which data is distributed across the physical storage devices.

When data is written to a RAID array, the system divides the incoming data stream into chunks of the specified chunk size. The first chunk is written to the first disk in the array, the second chunk is written to the second disk, and this round-robin pattern continues across all disks in the array. Once each disk has received one chunk, a complete stripe has been written. The stripe size, which is the total amount of data written across all disks before the pattern repeats, is calculated by multiplying the chunk size by the number of data disks in the array.

For example, in a configuration with four RAID1 arrays used as the basis for LVM2 striping, each RAID1 array has a chunk size of 256 KB. When LVM2 stripes data across these four arrays, it distributes data in chunks matching the RAID chunk size. The effective stripe width from the perspective of the logical volume is 256 KB × 4 = 1 MB, meaning that 1 MB of data is distributed across the four underlying RAID arrays before the pattern repeats.

The chunk size selection is critical for performance optimisation. Smaller chunk sizes provide finer granularity but may increase overhead from metadata and coordination between disks. Larger chunk sizes reduce overhead but may result in less efficient utilisation when I/O operations are smaller than the chunk size. For most workloads, chunk sizes between 256 KB and 1 MB provide optimal balance between granularity and efficiency.

3.3.2 Performance Equivalence

LVM2 over mdadm RAID achieves equivalent performance to traditional nested RAID configurations when properly aligned and configured. For RAID1 arrays with LVM2 striping, which mimics traditional RAID10 functionality, sequential read performance is equivalent to traditional RAID10 because data can be read in parallel from all arrays, aggregating the bandwidth from each individual array. Sequential write performance is also equivalent, as data is striped across arrays in the same manner as traditional RAID10, with each write operation distributed across multiple arrays simultaneously.

Random I/O performance demonstrates equivalence to traditional RAID10 because the LVM2 striping layer enables parallel access to multiple arrays, allowing random I/O operations to be serviced by different arrays concurrently. This parallel access capability provides the same performance benefits as traditional RAID10 striping. Latency measurements show no significant difference when the configuration is properly aligned, as the overhead introduced by the LVM2 layer is minimal compared to the disk I/O latency.

For RAID6 arrays with LVM2 striping, which mimics traditional RAID60 functionality, the performance characteristics mirror those of traditional RAID60. Sequential read performance benefits from striping across multiple RAID6 arrays, whilst sequential write performance experiences the same parity calculation overhead as traditional RAID60. Random I/O performance is equivalent, and the write penalty associated with parity-based RAID levels remains the same, as each underlying RAID6 array must perform parity calculations independently.

3.3.3 Alignment Requirements and Performance Impact

Proper alignment across all storage layers is critical for achieving optimal performance. The physical extent size in LVM2 should equal the effective stripe width, which is calculated as the RAID chunk size multiplied by the number of arrays. This alignment ensures that LVM2's allocation units align with the underlying RAID stripe boundaries, preventing I/O operations from spanning multiple stripes unnecessarily.

The LVM2 stripe size, specified when creating striped logical volumes, should match the RAID chunk size to ensure that LVM2's striping granularity aligns with the RAID array's data distribution pattern. When creating physical volumes on RAID arrays, the data area should be aligned to stripe width boundaries using the --dataalignment parameter, ensuring that the LVM2 metadata and data areas start at offsets that align with RAID stripe boundaries.

Filesystem alignment must also be considered, with filesystem parameters such as XFS stripe unit and stripe width configured to align with the LVM2 stripe geometry. This multi-layer alignment ensures that I/O operations flow efficiently through the entire storage stack without causing read-modify-write cycles or other inefficiencies.

Performance measurements demonstrate that properly aligned configurations achieve 20-50% performance improvement compared to misaligned configurations. The use of optimal extent size, calculated to match the effective stripe width, provides an additional 10-15% performance improvement over default extent size settings. Complete stack alignment, where RAID chunk size, LVM2 extent size, LVM2 stripe size, and filesystem alignment are all properly coordinated, delivers maximum performance with optimal I/O efficiency.

3.3.4 Performance Optimisation Through /proc Parameters

The Linux kernel provides several parameters in the /proc/sys/dev/raid/ directory that can be tuned to optimise RAID array performance, particularly during synchronisation and rebuild operations. These parameters control the speed at which RAID arrays perform background operations such as resynchronisation after drive replacement or array creation.

The /proc/sys/dev/raid/speed_limit_min parameter sets the minimum speed in kilobytes per second for RAID synchronisation operations. The default value is typically 1,000 KB/s, which may be conservative for modern storage systems. Increasing this value to 50,000 KB/s or higher can accelerate synchronisation operations, reducing the time required for array rebuilds and resynchronisation. However, higher values increase CPU and I/O subsystem load, which may impact other system operations.

The /proc/sys/dev/raid/speed_limit_max parameter sets the maximum speed in kilobytes per second for RAID synchronisation. The default value is typically 200,000 KB/s, which may be insufficient for high-performance storage systems with fast SSDs or NVMe devices. Increasing this value to 500,000 KB/s or higher allows the RAID subsystem to utilise more of the available I/O bandwidth during synchronisation operations, significantly reducing rebuild times for large arrays.

For RAID5 and RAID6 arrays, the stripe cache size can be increased to improve write performance. The stripe cache is located at /sys/block/mdX/md/stripe_cache_size and controls the amount of memory used for caching stripe data during write operations. Increasing the stripe cache size from the default value to 4,096 pages (16 MB) or higher can improve write performance, particularly for sequential write workloads. The optimal value depends on available system memory and workload characteristics, with larger values providing better performance at the cost of increased memory usage.

These parameters can be set temporarily by writing values directly to the /proc or /sys files, but such changes are lost upon system reboot. To make changes persistent and manageable, they should be configured using the sysctl tool and stored in configuration files within the /etc/sysctl.d/ directory. This approach provides better organisation, allows changes to be applied at any time without rebooting, and follows modern Linux system configuration practices.

Example: Configuring RAID Performance Parameters with sysctl

The following example demonstrates how to configure RAID synchronisation speed limits using the sysctl tool and store the configuration in /etc/sysctl.d/10-raid.conf for persistent application across system reboots.

# Step 1: Set parameters temporarily to test values
# Increase minimum synchronisation speed to 50,000 KB/s
sysctl -w dev.raid.speed_limit_min=50000

# Increase maximum synchronisation speed to 500,000 KB/s
sysctl -w dev.raid.speed_limit_max=500000

# Step 2: Verify the current values
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max

# Step 3: Monitor synchronisation speed to ensure values are appropriate
cat /proc/mdstat

# Step 4: Create persistent configuration file
cat > /etc/sysctl.d/10-raid.conf << 'EOF'
# RAID synchronisation speed limits
# Minimum speed: 50,000 KB/s (50 MB/s)
# Maximum speed: 500,000 KB/s (500 MB/s)
# Adjust these values based on system resources and workload requirements

dev.raid.speed_limit_min = 50000
dev.raid.speed_limit_max = 500000
EOF

# Step 5: Apply configuration from file immediately
# This loads all files from /etc/sysctl.d/ and applies them
systemctl restart systemd-sysctl

# Alternative: Apply specific file
sysctl -p /etc/sysctl.d/10-raid.conf

# Step 6: Verify that values were applied correctly
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max

# Step 7: Verify that values persist in /proc
[[ -f /proc/sys/dev/raid/speed_limit_min ]] && cat /proc/sys/dev/raid/speed_limit_min
[[ -f /proc/sys/dev/raid/speed_limit_max ]] && cat /proc/sys/dev/raid/speed_limit_max

Example: Complete RAID Performance Tuning Configuration

This example provides a comprehensive configuration file that includes all RAID performance parameters, with comments explaining each setting and recommendations for different system types.

# Create comprehensive RAID performance tuning configuration
cat > /etc/sysctl.d/10-raid.conf << 'EOF'
# RAID Performance Tuning Configuration
# File: /etc/sysctl.d/10-raid.conf
# 
# This configuration optimises RAID array synchronisation and rebuild speeds.
# Adjust values based on:
# - Available CPU resources
# - I/O subsystem capabilities
# - Workload requirements
# - System responsiveness needs
#
# For high-performance systems with fast SSDs/NVMe:
# - Increase speed_limit_min to 50,000-100,000 KB/s
# - Increase speed_limit_max to 500,000-1,000,000 KB/s
#
# For systems with mixed workloads:
# - Use moderate values to balance rebuild speed and system responsiveness
# - speed_limit_min: 20,000-50,000 KB/s
# - speed_limit_max: 200,000-500,000 KB/s
#
# For systems with limited resources or where rebuild speed is not critical:
# - Use conservative default values or lower
# - speed_limit_min: 1,000-10,000 KB/s
# - speed_limit_max: 100,000-200,000 KB/s

# Minimum RAID synchronisation speed (KB/s)
# Default: 1,000 KB/s
# Recommended for high-performance systems: 50,000 KB/s
dev.raid.speed_limit_min = 50000

# Maximum RAID synchronisation speed (KB/s)
# Default: 200,000 KB/s
# Recommended for high-performance systems: 500,000 KB/s
dev.raid.speed_limit_max = 500000
EOF

# Apply configuration immediately
systemctl restart systemd-sysctl

# Verify configuration
sysctl -a | grep dev.raid

Example: Configuring Stripe Cache Size for RAID5/RAID6 Arrays

The stripe cache size parameter is located in /sys/block/mdX/md/stripe_cache_size and cannot be set directly through sysctl, as it is a /sys parameter rather than a /proc/sys parameter. However, it can be configured persistently using a systemd service or a script that runs after arrays are assembled. The following example demonstrates both approaches.

# Method 1: Using systemd service for stripe cache configuration
# This service runs after mdadm arrays are assembled

cat > /etc/systemd/system/raid-stripe-cache.service << 'EOF'
[Unit]
Description=Configure RAID Stripe Cache Size
After=mdmonitor.service
Requires=mdmonitor.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/configure-raid-stripe-cache.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

# Create the configuration script
cat > /usr/local/bin/configure-raid-stripe-cache.sh << 'EOF'
#!/bin/bash
# Configure stripe cache size for all RAID5 and RAID6 arrays
# Recommended values:
# - Small arrays (< 10 disks): 4096 pages (16 MB)
# - Medium arrays (10-20 disks): 8192 pages (32 MB)
# - Large arrays (> 20 disks): 16384 pages (64 MB)

STRIPE_CACHE_SIZE=4096  # 16 MB for typical configurations

# Find all RAID5 and RAID6 arrays
for md_device in /sys/block/md*/md/level; do
    if [ -f "$md_device" ]; then
        level=$(cat "$md_device")
        if [[ "$level" == "raid5" ]] || [[ "$level" == "raid6" ]]; then
            md_name=$(basename $(dirname $(dirname "$md_device")))
            stripe_cache_file="/sys/block/$md_name/md/stripe_cache_size"
            if [ -f "$stripe_cache_file" ]; then
                echo "$STRIPE_CACHE_SIZE" > "$stripe_cache_file"
                echo "Configured stripe cache for $md_name: $STRIPE_CACHE_SIZE pages"
            fi
        fi
    fi
done
EOF

chmod +x /usr/local/bin/configure-raid-stripe-cache.sh

# Enable and start the service
systemctl daemon-reload
systemctl enable raid-stripe-cache.service
systemctl start raid-stripe-cache.service

# Verify stripe cache sizes
for md in /sys/block/md*/md/stripe_cache_size; do
    if [ -f "$md" ]; then
        md_name=$(basename $(dirname $(dirname "$md")))
        cache_size=$(cat "$md")
        echo "$md_name: $cache_size pages ($((cache_size * 4)) KB)"
    fi
done

Example: Applying Configuration Changes at Runtime

One of the key advantages of using /etc/sysctl.d/ configuration files is that changes can be applied immediately without requiring a system reboot. The following example demonstrates how to modify RAID performance parameters and apply them immediately.

# Step 1: Modify the configuration file
# Edit /etc/sysctl.d/10-raid.conf to change values
# For example, increase speeds for faster rebuilds during maintenance window

# Step 2: Apply changes immediately
systemctl restart systemd-sysctl

# Step 3: Verify new values are active
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max

# Step 4: Monitor RAID synchronisation to observe impact
watch -n 1 'cat /proc/mdstat'

# Step 5: After maintenance, restore to normal values
# Edit /etc/sysctl.d/10-raid.conf again with normal values
systemctl restart systemd-sysctl

Example: Complete RAID Performance Tuning Workflow

This example demonstrates a complete workflow for configuring and managing RAID performance parameters, including verification and monitoring procedures.

# Complete RAID Performance Tuning Workflow
# =========================================

# 1. Check current RAID array status
cat /proc/mdstat

# 2. Check current performance parameter values
echo "Speed Limit Min: $(cat /proc/sys/dev/raid/speed_limit_min) KB/s"
echo "Speed Limit Max: $(cat /proc/sys/dev/raid/speed_limit_max) KB/s"

# 3. Test temporary values before making permanent changes
sysctl -w dev.raid.speed_limit_min=50000
sysctl -w dev.raid.speed_limit_max=500000

# 4. Monitor impact for a few minutes
watch -n 2 'cat /proc/mdstat'

# 5. Create persistent configuration
cat > /etc/sysctl.d/10-raid.conf << 'EOF'
# RAID Performance Tuning
# Applied: $(date)
# System: $(hostname)

dev.raid.speed_limit_min = 50000
dev.raid.speed_limit_max = 500000
EOF

# 6. Apply configuration
systemctl restart systemd-sysctl

# 7. Verify configuration is active
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max

# 8. Verify values in /proc
echo "Speed Limit Min: $(cat /proc/sys/dev/raid/speed_limit_min) KB/s"
echo "Speed Limit Max: $(cat /proc/sys/dev/raid/speed_limit_max) KB/s"

# 9. Check all sysctl RAID parameters
sysctl -a | grep dev.raid

# 10. Monitor RAID synchronisation with new settings
watch -n 2 'cat /proc/mdstat'

Important Notes on sysctl Configuration:

The systemctl restart systemd-sysctl command reloads all configuration files from /etc/sysctl.d/, /run/sysctl.d/, and /usr/lib/sysctl.d/, applying them to the running system immediately. This allows configuration changes to take effect without requiring a system reboot. The configuration files are processed in lexicographic order, with files in /etc/sysctl.d/ taking precedence over those in /usr/lib/sysctl.d/.

For /sys parameters such as stripe cache size, which cannot be managed through sysctl, systemd services or scripts must be used. These services should be configured to run after the mdmonitor.service to ensure that RAID arrays are assembled before attempting to configure their parameters.

Monitoring the impact of performance tuning is essential, as overly aggressive settings can impact system responsiveness. The /proc/mdstat file provides real-time information about RAID array status and synchronisation progress, allowing administrators to observe the effects of tuning adjustments and verify that synchronisation operations are proceeding at the desired speeds.

Monitoring the impact of these tuning parameters is important, as overly aggressive settings can impact system responsiveness or cause other performance issues. The /proc/mdstat file provides real-time information about RAID array status and synchronisation progress, allowing administrators to observe the effects of tuning adjustments and verify that synchronisation operations are proceeding at the desired speeds.

The stripe cache size parameter, located at /sys/block/mdX/md/stripe_cache_size, is particularly important for RAID5 and RAID6 arrays. This parameter controls the amount of memory allocated for caching stripe data during write operations. The default value is typically 256 pages (1 MB), but increasing this to 4,096 pages (16 MB) or higher can significantly improve write performance for sequential workloads. The optimal value depends on available system memory, workload characteristics, and the number of arrays in the system. Larger values provide better write performance but consume more system memory, requiring careful consideration of memory availability and other system requirements.

To make these performance tuning parameters persistent across system reboots, they should be configured through system startup mechanisms. For /proc/sys/dev/raid/ parameters, values can be added to /etc/sysctl.conf using the dev.raid.speed_limit_min and dev.raid.speed_limit_max syntax. After adding these values, the sysctl -p command applies the changes, and they will be automatically applied on subsequent reboots. For stripe cache size and other /sys parameters, systemd service files or startup scripts can be created to set these values after RAID arrays are assembled during system boot.

3.3.5 CPU Core and Thread Optimisation for RAID Operations

Effective utilisation of CPU cores and threads is critical for optimising RAID reconstruction, write, and read operations. Modern processors with multiple cores and simultaneous multithreading (SMT) capabilities, such as the AMD Ryzen 5 5600H with 6 cores and 12 threads, provide significant opportunities for performance optimisation through CPU affinity configuration and core pinning.

Understanding CPU Topology

Before configuring CPU affinity for RAID operations, it is essential to understand the processor topology. The AMD Ryzen 5 5600H features 6 physical cores with simultaneous multithreading (SMT), providing 12 logical processors (threads). Each physical core can execute two threads simultaneously, sharing the core's execution resources.

# Examine CPU topology
lscpu

# Example output for AMD Ryzen 5 5600H:
# Architecture:            x86_64
# CPU op-mode(s):          32-bit, 64-bit
# CPU(s):                  12
# On-line CPU(s) list:     0-11
# Thread(s) per core:      2
# Core(s) per socket:      6
# Socket(s):               1
# NUMA node(s):            1
# Model name:              AMD Ryzen 5 5600H with Radeon Graphics

# Detailed CPU topology
lscpu -p

# View CPU topology in tree format (dnf install hwloc-gui)
lstopo --of txt
# OR
grep -E "processor|physical id|core id" /proc/cpuinfo

CPU Core Mapping for AMD Ryzen 5 5600H

For the AMD Ryzen 5 5600H with 6 cores and 12 threads, the typical mapping is:

Physical cores: 0, 1, 2, 3, 4, 5
Logical processors (threads): 0-11
Core 0: threads 0, 6
Core 1: threads 1, 7
Core 2: threads 2, 8
Core 3: threads 3, 9
Core 4: threads 4, 10
Core 5: threads 5, 11

Example: Identifying CPU Topology and Core Mapping

# Complete CPU topology analysis
# CPU Information
lscpu

# CPU Core Mapping
for cpu in {0..11}; do
    core_id=$(cat /sys/devices/system/cpu/cpu${cpu}/topology/core_id)
    physical_package=$(cat /sys/devices/system/cpu/cpu${cpu}/topology/physical_package_id)
    echo "CPU $cpu: Core $core_id, Package $physical_package"
done

# NUMA Topology
numactl --hardware

# Current CPU Affinity of MD Threads (works if software RAID arrays are defined)
ps -eLo pid,tid,psr,comm | grep -E "md.*_raid|md.*_resync" | while read pid tid psr comm; do
    echo "Thread $tid ($comm): Running on CPU $psr"
    taskset -p $tid 2>/dev/null | sed "s/^/  Affinity: /"
done

Configuring CPU Affinity for RAID Operations

RAID operations benefit from dedicated CPU cores to avoid contention with other system processes. For the AMD Ryzen 5 5600H, a recommended approach is to dedicate 2-4 physical cores (4-8 threads) for RAID operations, leaving the remaining cores for the operating system and applications.

Example: Pinning RAID Operations to Specific Cores

This example demonstrates how to configure CPU affinity for RAID operations on an AMD Ryzen 5 5600H system, dedicating cores 4 and 5 (threads 4, 5, 10, 11) for RAID operations.

# Step 1: Identify MD kernel threads
echo "=== Identifying MD Kernel Threads ==="
ps -eLo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | grep -E "md.*_raid|md.*_resync"

# Step 2: Create a cpuset for RAID operations
# Dedicate cores 4 and 5 (threads 4, 5, 10, 11) for RAID
mkdir -p /sys/fs/cgroup/cpuset/raid_cpuset

# Assign CPU cores 4, 5, 10, 11 to the cpuset
# Note: Using physical cores 4 and 5, which include threads 4,5 and 10,11
echo 4,5,10,11 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpus

# Assign memory node (single NUMA node for Ryzen 5 5600H)
echo 0 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.mems

# Make the cpuset exclusive (prevent other processes from using these cores)
echo 1 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpu_exclusive

# Step 3: Move MD kernel threads to the cpuset
for tid in $(ps -eLo tid,comm | grep -E "md.*_raid|md.*_resync" | awk '{print $1}'); do
    echo $tid > /sys/fs/cgroup/cpuset/raid_cpuset/tasks 2>/dev/null
    echo "Moved thread $tid to RAID cpuset"
done

# Step 4: Verify CPU affinity
echo -e "\n=== Verifying CPU Affinity ==="
for tid in $(ps -eLo tid,comm | grep -E "md.*_raid|md.*_resync" | awk '{print $1}'); do
    echo "Thread $tid:"
    taskset -p $tid
done

# Step 5: Monitor CPU usage
mpstat -P 4,5,10,11 1 5

Example: Using taskset for Direct CPU Affinity

An alternative approach uses taskset to directly set CPU affinity for MD kernel threads:

# Find all MD-related kernel threads
MD_THREADS=$(ps -eLo tid,comm | grep -E "md.*_raid|md.*_resync" | awk '{print $1}')

# Pin each thread to cores 4, 5, 10, 11
for tid in $MD_THREADS; do
    taskset -cp 4,5,10,11 $tid
    echo "Set affinity for thread $tid to cores 4,5,10,11"
done

# Verify affinity
for tid in $MD_THREADS; do
    echo "Thread $tid affinity:"
    taskset -p $tid
done

Example: Persistent Configuration with systemd Service

Cpuset configurations are not persistent across reboots. The cpuset filesystem is recreated at each boot, and all cpuset directories and their configurations must be recreated. To make CPU affinity configuration persistent across reboots, create a systemd service that recreates the cpuset and configures CPU affinity after RAID arrays are assembled:

# Create systemd service for RAID CPU affinity
cat > /etc/systemd/system/raid-cpu-affinity.service << 'EOF'
[Unit]
Description=Configure CPU Affinity for RAID Operations
After=mdmonitor.service
Requires=mdmonitor.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/configure-raid-cpu-affinity.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

# Create the configuration script
cat > /usr/local/bin/configure-raid-cpu-affinity.sh << 'EOF'
#!/bin/bash
# Configure CPU affinity for RAID operations on AMD Ryzen 5 5600H
# Dedicates cores 4,5 (threads 4,5,10,11) for RAID operations

# Create cpuset (recreated at each boot as cpusets are not persistent)
mkdir -p /sys/fs/cgroup/cpuset/raid_cpuset
echo 4,5,10,11 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpu_exclusive

# Wait for MD arrays to be assembled
sleep 2

# Move all MD kernel threads to the cpuset
for tid in $(ps -eLo tid,comm 2>/dev/null | grep -E "md.*_raid|md.*_resync" | awk '{print $1}'); do
    if [ -n "$tid" ]; then
        echo $tid > /sys/fs/cgroup/cpuset/raid_cpuset/tasks 2>/dev/null
    fi
done

# Also move mdadm monitor process if running
MDADM_PID=$(pgrep -f "mdadm.*monitor" 2>/dev/null)
if [ -n "$MDADM_PID" ]; then
    echo $MDADM_PID > /sys/fs/cgroup/cpuset/raid_cpuset/tasks 2>/dev/null
fi

echo "RAID CPU affinity configured: cores 4,5,10,11"
EOF

chmod +x /usr/local/bin/configure-raid-cpu-affinity.sh

# Enable and start the service
systemctl daemon-reload
systemctl enable raid-cpu-affinity.service
systemctl start raid-cpu-affinity.service

# Verify the service status
systemctl status raid-cpu-affinity.service

Example: CPU Isolation at Boot Time

For maximum isolation, CPU cores can be isolated from the general scheduler at boot time using kernel parameters. This ensures that isolated cores are only used by processes explicitly assigned to them.

# Edit GRUB configuration
# For RHEL/Rocky Linux: /etc/default/grub
# For Debian/Ubuntu: /etc/default/grub

# Add isolcpus parameter to isolate cores 4,5,10,11
# Note: isolcpus isolates by CPU number, so we specify 4,5,10,11
GRUB_CMDLINE_LINUX="isolcpus=4,5,10,11 nohz_full=4,5,10,11 rcu_nocbs=4,5,10,11"

# Update GRUB configuration
grub2-mkconfig -o /boot/grub2/grub.cfg  # RHEL/Rocky Linux
# OR
update-grub  # Debian/Ubuntu

# Reboot to apply changes
reboot

Example: Complete RAID Performance Optimisation with CPU Affinity

This example demonstrates a complete configuration combining CPU affinity, speed limits, and stripe cache optimisation for optimal RAID performance. Note that cpuset configurations are not persistent across reboots and must be recreated at each boot. The systemd service approach shown earlier handles this automatically.

# Complete RAID Performance Optimisation Script
# For AMD Ryzen 5 5600H: 6 cores, 12 threads
# Note: This script should be run at boot time via systemd service for persistence
# ============================================

# 1. Configure RAID speed limits (persistent via sysctl)
cat > /etc/sysctl.d/10-raid.conf << 'EOF'
# RAID Performance Tuning for AMD Ryzen 5 5600H
dev.raid.speed_limit_min = 50000
dev.raid.speed_limit_max = 500000
EOF
systemctl restart systemd-sysctl

# 2. Create CPU cpuset for RAID operations
# Note: Cpusets are not persistent - must be recreated at each boot
# This is handled automatically by the systemd service shown in previous examples
mkdir -p /sys/fs/cgroup/cpuset/raid_cpuset
echo 4,5,10,11 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/raid_cpuset/cpuset.cpu_exclusive

# 3. Configure stripe cache for RAID5/RAID6 arrays
for md in /sys/block/md*/md/level; do
    if [ -f "$md" ]; then
        level=$(cat "$md")
        if [[ "$level" == "raid5" ]] || [[ "$level" == "raid6" ]]; then
            md_name=$(basename $(dirname $(dirname "$md")))
            echo 4096 > /sys/block/$md_name/md/stripe_cache_size
            echo "Configured stripe cache for $md_name: 4096 pages (16 MB)"
        fi
    fi
done

# 4. Move MD kernel threads to RAID cpuset
sleep 2  # Wait for arrays to be fully assembled
for tid in $(ps -eLo tid,comm 2>/dev/null | grep -E "md.*_raid|md.*_resync" | awk '{print $1}'); do
    if [ -n "$tid" ]; then
        echo $tid > /sys/fs/cgroup/cpuset/raid_cpuset/tasks 2>/dev/null
        echo "Moved thread $tid to RAID cpuset"
    fi
done

# 5. Verify configuration
echo "*** RAID Performance Configuration Summary ***"
echo "Speed Limits:"
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max
echo -e "\nCPU Affinity (RAID Cores: 4,5,10,11):"
ps -eLo tid,psr,comm | grep -E "md.*_raid|md.*_resync" | head -5
echo -e "\nStripe Cache Sizes:"
for md in /sys/block/md*/md/stripe_cache_size; do
    if [ -f "$md" ]; then
        md_name=$(basename $(dirname $(dirname "$md")))
        cache_size=$(cat "$md")
        echo "$md_name: $cache_size pages ($((cache_size * 4)) KB)"
    fi
done

# 6. Monitor performance
# Monitoring RAID Performance
cat /proc/mdstat
mpstat -P 4,5,10,11 1

Performance Considerations and Recommendations

For the AMD Ryzen 5 5600H with 6 cores and 12 threads, the following recommendations apply:

Core Allocation Strategy:

Conservative: Dedicate 1-2 physical cores (2-4 threads) for RAID operations, leaving 4-5 cores for system and applications
Balanced: Dedicate 2 physical cores (4 threads) for RAID operations, providing good performance without significant impact on other workloads
Aggressive: Dedicate 3-4 physical cores (6-8 threads) for RAID operations, maximising rebuild speed but reducing available cores for other processes

Physical Cores vs Logical Threads:

Pinning to physical cores (avoiding SMT pairs) can provide more consistent performance
For example, using cores 4 and 5 (threads 4,5,10,11) utilises two physical cores with SMT
Alternatively, using only one thread per core (e.g., threads 4,5) may provide better per-thread performance but utilises fewer resources

Monitoring and Verification:

Monitor CPU utilisation: mpstat -P ALL 1
Check thread CPU affinity: taskset -p
Monitor RAID synchronisation speed: watch -n 1 'cat /proc/mdstat'
Verify core isolation: cat /proc/cmdline | grep isolcpus

NUMA Considerations:

The AMD Ryzen 5 5600H is a single-socket processor with unified memory architecture, so NUMA considerations are not applicable. For multi-socket systems, ensure that pinned CPU cores are on the same NUMA node as the storage devices to minimise memory access latency.

3.3.6 Filesystem Tuning: XFS and ext4 Alignment with LVM2 Stripe Geometry

Proper filesystem alignment with the underlying LVM2 stripe geometry and RAID chunk size is essential for achieving optimal performance. Both XFS and ext4 provide parameters that must be calculated and configured based on the RAID chunk size, LVM2 stripe size, and the number of underlying arrays. This section provides comprehensive examples demonstrating the complete configuration workflow from RAID array creation through filesystem creation with proper alignment.

Example 1: XFS Filesystem on LVM2 over Four RAID1 Arrays

This example demonstrates the complete configuration of an XFS filesystem on an LVM2 logical volume striped across four RAID1 arrays, with each array using a 256 KB chunk size. The configuration ensures alignment at all layers: RAID chunk size, LVM2 extent size, LVM2 stripe size, and XFS stripe parameters.

# Step 1: Create four independent RAID1 arrays with 256 KB chunk size
mdadm --create /dev/md0 --level=1 --raid-devices=2 --chunk=256 /dev/sda1 /dev/sdb1
mdadm --create /dev/md1 --level=1 --raid-devices=2 --chunk=256 /dev/sdc1 /dev/sdd1
mdadm --create /dev/md2 --level=1 --raid-devices=2 --chunk=256 /dev/sde1 /dev/sdf1
mdadm --create /dev/md3 --level=1 --raid-devices=2 --chunk=256 /dev/sdg1 /dev/sdh1

# Step 2: Calculate alignment parameters
# RAID chunk size: 256 KB
# Number of arrays: 4
# Effective stripe width: 256 KB × 4 = 1 MB
# LVM2 extent size should match stripe width: 1 MB
# LVM2 stripe size should match RAID chunk size: 256 KB

# Step 3: Create physical volumes with data alignment to 1 MB boundary
pvcreate --dataalignment 1M /dev/md0 /dev/md1 /dev/md2 /dev/md3

# Step 4: Create volume group with extent size matching stripe width (1 MB)
vgcreate --physicalextentsize 1M vg_raid10_like /dev/md0 /dev/md1 /dev/md2 /dev/md3

# Step 5: Create striped logical volume
# -i 4: 4 stripes (one per array)
# -I 256K: Stripe size matching RAID chunk size (256 KB)
# -L 500G: Logical volume size
lvcreate -i 4 -I 256K -L 500G -n lv_raid10_like vg_raid10_like

# Step 6: Calculate XFS stripe parameters
# XFS block size: 4 KB (default)
# LVM2 stripe size: 256 KB
# Number of arrays: 4
# sunit = LVM2 stripe size / XFS block size = 256 KB / 4 KB = 64 blocks
# swidth = sunit × number of arrays = 64 × 4 = 256 blocks

# Step 7: Create XFS filesystem with stripe alignment
mkfs.xfs -d sunit=64,swidth=256 /dev/vg_raid10_like/lv_raid10_like

# Step 8: Verify alignment
# Check physical volume alignment
pvs -o +pe_start,data_start

# Check volume group extent size
vgs -o +vg_extent_size

# Check logical volume stripe configuration
lvs -o +stripes,stripe_size

# Verify XFS stripe alignment
xfs_info /dev/vg_raid10_like/lv_raid10_like
# Look for: sunit=64 swidth=256

Example 2: XFS Filesystem on LVM2 over Six RAID6 Arrays (RAID60-like)

This example demonstrates configuration for a RAID60-like setup using six RAID6 arrays, each with 512 KB chunk size, providing both redundancy and high performance for large-scale storage deployments.

# Step 1: Create six independent RAID6 arrays with 512 KB chunk size
mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
mdadm --create /dev/md1 --level=6 --raid-devices=6 --chunk=512 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1
mdadm --create /dev/md2 --level=6 --raid-devices=6 --chunk=512 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 /dev/sdr1
mdadm --create /dev/md3 --level=6 --raid-devices=6 --chunk=512 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1
mdadm --create /dev/md4 --level=6 --raid-devices=6 --chunk=512 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1
mdadm --create /dev/md5 --level=6 --raid-devices=6 --chunk=512 /dev/sdae1 /dev/sdaf1 /dev/sdag1 /dev/sdah1 /dev/sdai1 /dev/sdaj1

# Step 2: Calculate alignment parameters
# RAID chunk size: 512 KB
# Number of arrays: 6
# Effective stripe width: 512 KB × 6 = 3 MB
# LVM2 extent size: 3 MB
# LVM2 stripe size: 512 KB

# Step 3: Create physical volumes with data alignment to 3 MB boundary
pvcreate --dataalignment 3M /dev/md0 /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5

# Step 4: Create volume group with extent size matching stripe width
vgcreate --physicalextentsize 3M vg_raid60_like /dev/md0 /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5

# Step 5: Create striped logical volume
lvcreate -i 6 -I 512K -L 2T -n lv_raid60_like vg_raid60_like

# Step 6: Calculate XFS stripe parameters
# LVM2 stripe size: 512 KB
# Number of arrays: 6
# sunit = 512 KB / 4 KB = 128 blocks
# swidth = 128 × 6 = 768 blocks

# Step 7: Create XFS filesystem with stripe alignment
mkfs.xfs -d sunit=128,swidth=768 /dev/vg_raid60_like/lv_raid60_like

# Step 8: Verify configuration
xfs_info /dev/vg_raid60_like/lv_raid60_like
# Expected output should show: sunit=128 swidth=768

Example 3: ext4 Filesystem on LVM2 over Four RAID1 Arrays

This example demonstrates ext4 filesystem configuration with proper alignment for the same four-array RAID1 configuration used in Example 1. ext4 uses stride and stripe-width parameters instead of XFS's sunit and swidth.

# Steps 1-5: Same as Example 1 (RAID arrays, PVs, VG, LV creation)
# ... (RAID arrays, PVs, VG, and LV creation commands from Example 1) ...

# Step 6: Calculate ext4 alignment parameters
# ext4 block size: 4 KB (default)
# LVM2 stripe size: 256 KB
# Number of arrays: 4
# stride = LVM2 stripe size / ext4 block size = 256 KB / 4 KB = 64
# stripe-width = stride × number of arrays = 64 × 4 = 256

# Step 7: Create ext4 filesystem with stripe alignment
mkfs.ext4 -E stride=64,stripe-width=256 /dev/vg_raid10_like/lv_raid10_like

# Step 8: Verify alignment
# Check physical volume alignment
pvs -o +pe_start,data_start --units s

# Check logical volume stripe configuration
lvs -o +stripes,stripe_size

# Verify ext4 block size and alignment
tune2fs -l /dev/vg_raid10_like/lv_raid10_like | grep -E "Block size|Stride|Stripe width"

Example 4: Complete Configuration with Verification and Mount Options

This example provides a complete end-to-end configuration including verification steps and optimised mount options for both XFS and ext4 filesystems.

# Complete XFS Configuration Example
# ===================================

# 1. Create RAID arrays
mdadm --create /dev/md0 --level=1 --raid-devices=2 --chunk=256 /dev/sda1 /dev/sdb1
mdadm --create /dev/md1 --level=1 --raid-devices=2 --chunk=256 /dev/sdc1 /dev/sdd1
mdadm --create /dev/md2 --level=1 --raid-devices=2 --chunk=256 /dev/sde1 /dev/sdf1
mdadm --create /dev/md3 --level=1 --raid-devices=2 --chunk=256 /dev/sdg1 /dev/sdh1

# 2. Save RAID configuration
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
update-initramfs -u  # For Debian/Ubuntu
# OR
dracut --force        # For RHEL/Rocky Linux

# 3. Create LVM2 structure
pvcreate --dataalignment 1M /dev/md0 /dev/md1 /dev/md2 /dev/md3
vgcreate --physicalextentsize 1M vg_storage /dev/md0 /dev/md1 /dev/md2 /dev/md3
lvcreate -i 4 -I 256K -L 500G -n lv_data vg_storage

# 4. Create XFS filesystem
mkfs.xfs -d sunit=64,swidth=256 /dev/vg_storage/lv_data

# 5. Verify all alignment layers
#
# Physical Volume Alignment
pvs -o pv_name,pe_start,data_start --units s

#Volume Group Extent Size
vgs -o vg_name,vg_extent_size

# Logical Volume Stripe Configuration
lvs -o lv_name,stripes,stripe_size

# XFS Stripe Alignment
xfs_info /dev/vg_storage/lv_data

# 6. Mount with optimised options
mkdir -p /mnt/storage
mount -o noatime,nodiratime /dev/vg_storage/lv_data /mnt/storage

# 7. Add to /etc/fstab for persistent mounting
echo "/dev/vg_storage/lv_data /mnt/storage xfs noatime,nodiratime 0 0" >> /etc/fstab

# Complete ext4 Configuration Example
# ===================================

# Steps 1-3: Same as XFS example above
# ... (RAID arrays, LVM2 structure creation) ...

# 4. Create ext4 filesystem
mkfs.ext4 -E stride=64,stripe-width=256 /dev/vg_storage/lv_data

# 5. Verify alignment
pvs -o pv_name,pe_start,data_start --units s

# Logical Volume Stripe Configuration
lvs -o lv_name,stripes,stripe_size

# ext4 Alignment Parameters
tune2fs -l /dev/vg_storage/lv_data | grep -E "Block size|Stride|Stripe width"

# 6. Mount with optimised options
mkdir -p /mnt/storage
mount -o noatime,nodiratime /dev/vg_storage/lv_data /mnt/storage

# 7. Add to /etc/fstab
echo "/dev/vg_storage/lv_data /mnt/storage ext4 noatime,nodiratime 0 0" >> /etc/fstab

Parameter Calculation Reference

The following formulas should be used when calculating filesystem alignment parameters for different configurations:

For XFS:

Calculate sunit in filesystem blocks: sunit = (LVM2 stripe size) / (XFS block size)
Example: 256 KB stripe size / 4 KB block size = 64 blocks
Calculate swidth: swidth = sunit × (number of arrays)
Example: 64 blocks × 4 arrays = 256 blocks
In mkfs.xfs, both sunit and swidth are specified as integer values representing filesystem blocks (typically 4 KB blocks)
Alternative syntax using suffixes: mkfs.xfs -d sunit=256k,swidth=1024k (specify in bytes with suffix)

Calculation Examples for XFS:

LVM2 stripe size: 256 KB, Arrays: 4, XFS block size: 4 KB
sunit = 256 KB / 4 KB = 64
swidth = 64 × 4 = 256
Command: mkfs.xfs -d sunit=64,swidth=256
LVM2 stripe size: 512 KB, Arrays: 6, XFS block size: 4 KB
sunit = 512 KB / 4 KB = 128
swidth = 128 × 6 = 768
Command: mkfs.xfs -d sunit=128,swidth=768

For ext4:

stride = (LVM2 stripe size in bytes) / (ext4 block size in bytes)
stripe-width = stride × (number of arrays)
Both parameters are specified as integer values representing filesystem blocks.

Common Configuration Values:

For a typical configuration with 256 KB LVM2 stripe size and 4 arrays:

XFS: sunit=64, swidth=256 (64 blocks × 4 KB = 256 KB stripe unit, 256 blocks × 4 KB = 1 MB stripe width)
ext4: stride=64, stripe-width=256 (64 blocks × 4 KB = 256 KB per array, 256 blocks total width)

For a configuration with 512 KB LVM2 stripe size and 6 arrays:

XFS: sunit=128, swidth=768 (128 blocks × 4 KB = 512 KB stripe unit, 768 blocks × 4 KB = 3 MB stripe width)
ext4: stride=128, stripe-width=768 (128 blocks per array, 768 blocks total width)

3.3.7 Disk Selection for Optimal RAID Performance

The selection of appropriate storage devices for RAID arrays is critical for achieving optimal performance characteristics that align with application workload requirements. Different application types impose distinct I/O patterns and performance demands, necessitating careful consideration of device characteristics including interface type, random and sequential performance, latency characteristics, and endurance ratings. The fundamental principle governing RAID performance is that the slowest device in the array determines the overall array performance, making device homogeneity essential for optimal operation.

For database applications, particularly Online Transaction Processing (OLTP) workloads characterised by high random I/O operations and low latency requirements, NVMe SSDs provide the necessary performance characteristics. OLTP databases typically require random read/write IOPS exceeding 100,000 operations per second with sub-millisecond latency, characteristics that NVMe SSDs deliver through their direct PCIe interface and advanced controller architectures. Enterprise NVMe SSDs can provide random 4K read/write IOPS between 400,000 and 1,000,000, with sequential read speeds of 3,000-7,000 MB/s and sequential write speeds of 2,000-6,000 MB/s, making them suitable for high-performance database deployments. For Online Analytical Processing (OLAP) workloads with sequential read-heavy patterns, high-performance SATA SSDs or enterprise HDDs may provide sufficient throughput whilst maintaining cost-effectiveness, as these workloads prioritise sequential bandwidth over random IOPS.

High-Performance Computing (HPC) and machine learning workloads present different requirements, with training workloads demanding high sequential throughput for reading large datasets and writing checkpoints, whilst inference workloads require high random read IOPS with low latency. For training workloads requiring throughput exceeding 5 GB/s, NVMe SSDs in RAID 0 or RAID 10 configurations provide the necessary bandwidth, with NVMe SSDs capable of delivering aggregate throughput of 50 GB/s or higher in multi-device configurations. Inference workloads benefit from NVMe SSDs in RAID 10 configurations, providing both high random read performance and redundancy for production deployments. Data preparation workloads, which exhibit mixed sequential and random access patterns, can utilise SATA SSDs or NVMe SSDs depending on throughput requirements, with SATA SSDs providing cost-effective solutions for moderate performance requirements.

The endurance characteristics of SSDs, measured in Drive Writes Per Day (DWPD), must be considered when selecting devices for write-intensive workloads. Read-intensive workloads such as content delivery or archival storage may utilise SSDs with lower DWPD ratings, reducing cost whilst maintaining performance. Write-intensive workloads including database transaction logs, checkpoint storage, or scratch storage for computational workloads require SSDs with higher DWPD ratings to ensure device longevity over the expected operational lifetime. Enterprise SSDs typically provide DWPD ratings ranging from 1 to 10, with higher ratings corresponding to increased cost but improved suitability for write-intensive applications.

The interface type and protocol significantly impact performance characteristics, with NVMe providing the highest performance through direct PCIe connectivity, SATA SSDs offering balanced performance and cost through the SATA interface, and traditional HDDs providing maximum capacity at lower cost through SATA or SAS interfaces. For LVM2 over mdadm RAID configurations, device homogeneity is essential, requiring all devices within an array to have identical specifications including interface type, capacity, and performance characteristics. Mixing devices with different performance characteristics results in the array operating at the speed of the slowest device, negating the performance advantages of faster devices and reducing overall array efficiency.

3.4 Management and Monitoring

The LVM2 over mdadm RAID approach provides significant advantages in management and monitoring capabilities. Each RAID array can be monitored independently using standard mdadm tools, providing granular visibility into the health and performance of individual arrays. This granular monitoring capability enables administrators to perform individual health checks on each array rather than examining aggregate statistics, allowing for more precise identification of potential issues before they escalate into failures.

The independent nature of arrays enables targeted alerting, where monitoring systems can generate alerts for specific arrays rather than aggregate storage system alerts. This targeted alerting capability enables more precise response procedures, as administrators can immediately identify which specific array requires attention without needing to investigate the entire storage system. When issues are identified, independent recovery procedures can be executed for the affected array without impacting the operation of other arrays in the configuration.

Beyond RAID array management, the LVM2 layer provides additional management capabilities including snapshot creation, thin provisioning, and logical volume resizing. These LVM2 features operate over the underlying RAID arrays, providing a unified management interface for both redundancy and volume management functions. The combination of independent RAID array management and LVM2 volume management provides a comprehensive storage management solution with both redundancy control and volume flexibility.

The operational workflow benefits from simplified troubleshooting procedures, as issues can be isolated to specific arrays rather than requiring investigation of complex nested structures. Each array is a standard mdadm RAID array, meaning that standard mdadm tools and procedures apply without requiring special knowledge of nested configurations. The clear separation of concerns between RAID redundancy management and volume management simplifies documentation and operational procedures, as each layer can be documented and managed independently whilst understanding their interaction.

3.5 Failure Recovery

Failure recovery procedures with LVM2 over mdadm RAID are significantly simplified compared to traditional nested RAID configurations. When a drive failure occurs, the affected array can be identified using standard mdadm commands to examine the array state. The failed drive can be removed from the specific array and a replacement drive added, with the array rebuilding independently whilst all other arrays continue normal operation without any impact.

The isolated impact of failures means that a drive failure in one array does not affect the operation of other arrays in the configuration. This isolation provides significant operational advantages, as the system can continue operating normally whilst the failed array is being recovered. Each array rebuilds independently at its own pace, without requiring coordination with other arrays or affecting their performance.

The independent rebuild capability reduces the risk associated with array recovery operations. Smaller arrays rebuild faster than large monolithic arrays, reducing the window of vulnerability during which additional failures could result in data loss. The reduced rebuild scope means that rebuild operations affect only the specific array that experienced the failure, rather than larger portions of the storage system.

In contrast, traditional nested RAID configurations may require coordinated recovery procedures across the nested structure. A failure in one component of a nested configuration can affect the entire structure, requiring more complex recovery procedures. Rebuild operations in traditional configurations often affect larger portions of storage, and the longer rebuild times associated with large arrays increase the risk of additional failures occurring during the rebuild process.

3.6 Configuration Flexibility

The LVM2 over mdadm RAID approach provides substantial configuration flexibility that is not available with traditional nested RAID configurations. Arrays of different sizes can be combined within a volume group, although this results in capacity limitations as the usable capacity is determined by the smallest array when striping is used. Similarly, arrays with different performance characteristics can be mixed, though this results in performance limitations as the overall performance is constrained by the slowest array.

This flexibility enables selective usage of arrays, where different logical volumes can be created using different subsets of arrays. For example, high-performance logical volumes can be created from arrays utilising fast NVMe SSDs, whilst capacity-oriented logical volumes can be created from arrays utilising larger but slower HDDs. The dynamic reconfiguration capability allows arrays to be added to or removed from volume groups as requirements change, providing operational flexibility not available with fixed RAID configurations.

The configuration flexibility supports various use cases including tiered storage implementations, where logical volumes are created from arrays with different performance characteristics to match workload requirements. Capacity optimisation is enabled through more efficient utilisation of available capacity, as arrays can be combined and managed more flexibly than fixed configurations. Workload isolation can be achieved by separating different workloads onto different array sets, whilst gradual migration from older to newer arrays can be performed incrementally without requiring complete system reconfiguration.

---

4. Discussion

4.1 Performance Considerations

The performance characteristics of LVM2 over mdadm RAID are equivalent to traditional nested RAID configurations when properly configured, but achieving this equivalence requires careful attention to alignment across all storage layers. The alignment requirements span multiple components: the RAID chunk size at the mdadm layer, the LVM2 physical extent size at the volume group level, the LVM2 stripe size when creating striped logical volumes, and the filesystem alignment parameters when creating filesystems on the logical volumes.

Proper alignment optimisation requires that the physical extent size matches the effective stripe width, which is calculated as the RAID chunk size multiplied by the number of arrays. The LVM2 stripe size must match the RAID chunk size to ensure that LVM2's striping granularity aligns with the underlying RAID data distribution pattern. Physical volume data alignment must be configured to align to stripe width boundaries using the --dataalignment parameter when creating physical volumes, and filesystem alignment must be configured to align with the LVM2 stripe geometry using filesystem-specific parameters.

Failure to achieve proper alignment results in I/O operations that span multiple stripes unnecessarily, causing read-modify-write cycles and other inefficiencies that degrade performance by 20-50%. This performance degradation can negate the operational benefits of the LVM2 over mdadm RAID approach, making proper alignment configuration essential for successful deployment.

4.2 Operational Advantages

The primary advantages of LVM2 over mdadm RAID are operational rather than performance-based, focusing on improved manageability and flexibility rather than raw performance improvements. The independent management of arrays enables operational flexibility that is not available with traditional nested RAID configurations, allowing administrators to perform maintenance, expansion, and recovery operations on individual arrays without affecting the entire storage system.

The scalability advantage manifests through incremental capacity expansion capabilities that do not require major reconfiguration of the storage system. New arrays can be added to existing volume groups, and logical volumes can be extended to utilise the new capacity, all whilst the system remains operational. This incremental expansion capability provides significant operational advantages over approaches requiring predetermined array sizes or major reconfiguration for capacity increases.

Maintainability is improved through simplified troubleshooting and maintenance procedures. Each array can be managed independently using standard mdadm tools, and issues can be isolated to specific arrays rather than requiring investigation of complex nested structures. The visibility advantage provides better insight into individual array health and performance, enabling more precise monitoring and alerting compared to aggregate views of nested configurations.

4.3 Limitations and Considerations

The LVM2 over mdadm RAID approach has several limitations and considerations that must be understood when planning deployments. Capacity limitations arise from the striping mechanism, where usable capacity is limited by the smallest array when data is striped across multiple arrays. When arrays of different sizes are combined, the extra capacity on larger arrays remains unused, resulting in wasted storage space. The best practice is to use arrays of identical size to maximise capacity utilisation and avoid wasted storage.

Performance limitations stem from the fundamental principle that RAID arrays operate at the speed of their slowest component. When arrays with different performance characteristics are combined, the overall performance is constrained by the slowest array, with faster arrays being throttled to match the slowest array's performance. Mixed performance characteristics degrade overall performance, making it essential to use arrays with identical performance characteristics including device type, interface speed, and sequential and random I/O capabilities.

Complexity considerations include the requirement for understanding both mdadm and LVM2 tools and concepts. Proper alignment requires careful configuration across multiple layers, and monitoring requires attention to both the RAID layer and the LVM2 layer to ensure optimal operation. However, this complexity is offset by the operational flexibility and management advantages provided by the approach.

4.4 Best Practices

Configuration best practices for LVM2 over mdadm RAID deployments emphasise consistency and proper alignment. Consistent array specifications should be used, with identical devices and configurations for all arrays to avoid capacity and performance limitations. Proper alignment must be ensured at all storage layers: the RAID layer with appropriate chunk size selection, the LVM2 layer with physical extent size and stripe size configuration, and the filesystem layer with alignment parameters matching the underlying storage geometry.

Extent size optimisation requires setting the physical extent size to match the effective stripe width, calculated as the RAID chunk size multiplied by the number of arrays. Stripe size matching ensures that the LVM2 stripe size matches the RAID chunk size, maintaining alignment between the LVM2 striping layer and the underlying RAID data distribution. Data alignment must be configured using the --dataalignment parameter when creating physical volumes, ensuring that the physical volume data area starts at an offset aligned with RAID stripe boundaries.

Operational best practices include independent monitoring of each array separately, enabling early detection of issues and precise response procedures. Gradual expansion should be performed incrementally as capacity needs arise, rather than over-provisioning initial capacity. Clear documentation of array configurations, including chunk sizes, extent sizes, and alignment parameters, is essential for maintaining and troubleshooting the storage system. Testing of expansion and recovery procedures should be performed before production deployment to ensure that operational procedures are well-understood and can be executed efficiently when needed.

---

5. Conclusions

The use of LVM2 over mdadm RAID1 and RAID6 arrays provides significant operational benefits for Linux storage management whilst maintaining equivalent performance characteristics to traditional nested RAID configurations. The approach offers enhanced flexibility, simplified capacity expansion, improved manageability, and better failure isolation compared to direct RAID array usage or traditional nested RAID configurations.

Summary of Results:

The analysis demonstrates that LVM2 over mdadm RAID achieves equivalent performance to traditional RAID10/60 configurations when properly aligned, providing the same redundancy and performance characteristics whilst offering superior operational flexibility. Independent array management enables operational flexibility that is not available with traditional approaches, allowing maintenance, expansion, and recovery operations to be performed on individual arrays without affecting the entire storage system.

Capacity expansion is straightforward and non-disruptive, enabling incremental growth without major reconfiguration. Individual array management simplifies troubleshooting and maintenance procedures, as issues can be isolated to specific arrays and addressed independently. Better failure isolation ensures that failures in one array do not directly impact others, reducing the risk and complexity associated with storage system failures.

Recommendations:

Deployments requiring flexibility and incremental expansion should utilise LVM2 over mdadm RAID, as this approach provides the operational advantages necessary for dynamic storage environments. Proper alignment must be ensured across all storage layers for optimal performance, requiring careful configuration of RAID chunk size, LVM2 extent size, LVM2 stripe size, and filesystem alignment parameters. Identical array specifications should be used to avoid capacity and performance limitations that arise from mixing arrays with different characteristics.

Independent monitoring should be implemented for each RAID array, enabling early detection of issues and precise response procedures. Configuration documentation and operational procedures should be maintained to ensure operational consistency and enable efficient troubleshooting and maintenance operations.

Future Work:

Further research could examine:

Performance characteristics under specific workload patterns
Optimal extent size calculations for various array configurations
Automated alignment verification and optimisation tools
Comparative analysis with other storage management approaches

---

6. References

1. Red Hat Enterprise Linux 8. "Configuring and Managing Logical Volumes". Red Hat Documentation. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/pdf/configuring_and_managing_logical_volumes/configuring-and-managing-logical-volumes.pdf

2. Red Hat. "Best practices for LVM". Red Hat Knowledge Base. https://access.redhat.com/articles/2106521

3. Oracle. "Gluster Storage Linux Best Practices". Oracle Corporation. https://www.oracle.com/a/ocom/docs/linux/gluster-storage-linux-best-practices.pdf

4. Timote Brusson. "A Week into Linux Storage Performances - Day 3: Manage Redundancy with RAID". Blog. https://blog.timotebrusson.fr/2025/06/18/A-week-into-linux-storage-performances-Day-3-Manage-redondancy-with-RAID/

5. Timote Brusson. "A Week into Linux Storage Performances - Day 4: Virtual Filesystem". Blog. https://blog.timotebrusson.fr/2025/06/19/A-week-into-linux-storage-performances-Day-4-Virtual-Filesystem/

6. PIC Wiki. "Configuring XFS Storage Pool with correct Disk Alignment". PIC Computing. https://pwiki.pic.es/index.php/Configuring_XFS_Storage_Pool_with_correct_Disk_Alignment

7. Insights Oetiker. "RAID Optimisation". Oetiker+Partner AG. https://insights.oetiker.ch/linux/raidoptimization.html

8. Thomas Krenn. "Partition Alignment". Thomas-Krenn Wiki. https://www.thomas-krenn.com/en/wiki/Partition_Alignment

9. SAS Support. "LVM and RAID Best Practices". SAS Institute. https://support.sas.com/resources/papers/proceedings16/8220-2016.pdf

10. Red Hat Enterprise Linux 7. "Logical Volume Manager Administration - LVM RAID". Red Hat Documentation. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lvm_raid

11. Kernel.org. "Device Mapper - Striped Target". Linux Kernel Documentation. https://www.kernel.org/doc/html/v5.11/admin-guide/device-mapper/striped.html

12. Larry Jordan. "Explaining RAID Chunk Size and Which to Pick for Media". Larry Jordan & Associates. https://larryjordan.com/articles/explaining-raid-chunk-size-and-which-to-pick-for-media/

13. Testing Gyan. "RAID Performance". Google Sites. https://sites.google.com/site/testinggyan/goal/domain-knowledge/raid-performance

14. DTU Physics Wiki. "Linux Software RAID". Technical University of Denmark. https://wiki.fysik.dtu.dk/ITwiki_archive/24.06/LinuxSoftwareRAID/

15. Medium. "The Definitive Guide to mdraid (mdadm) and Linux Software RAID". https://medium.com/@pltnvs/the-definitive-guide-to-mdraid-mdadm-and-linux-software-raid-fbb561c21878

16. HostingTalk. "Speed Up Software RAID Sync". https://hostingtalk.uk/speedup-software-raid-sync/

17. Ubuntu Manpages. "md - Multiple Devices driver for Linux". https://manpages.ubuntu.com/manpages/jammy/man4/md.4.html

18. Simcentric. "HDD vs SSD: Choosing the Right Storage for US Server". https://www.simcentric.com/america-dedicated-server/hdd-vs-ssd-choosing-the-right-storage-for-us-server/

19. ServerAstra. "Choosing the Right Storage for Dedicated Servers". https://serverastra.com/docs/Tutorials/Choosing-the-Right-Storage-for-Dedicated-Servers

20. Dell Technologies. "Dell EMC PowerEdge Enterprise HDD Overview". https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/dell-emc-poweredge-enterprise-hdd-overview.pdf

21. Infortrend. "High Performance Computing Solutions". https://www.infortrend.com/us/solutions/high-performance-computing

22. Bahatec. "HPE Server Hard Drive Buyers Guide". https://bahatec.com/hpe-server-hard-drive-buyers-guide/

23. Virtuozzo. "Disk Requirements". https://docs.virtuozzo.com/virtuozzo_hybrid_infrastructure_7_0_admins_guide/disk-requirements.html

---

Categories: chunk, Linux, LVM2, mdadm, NUMA, RAID1, RAID6, RHEL 10, RHEL 9, Rocky Linux, software RAID, storage, sysctl, systemd, volume group

Vesselin Kolev's Tech Corner

About

Categories