Monday, September 15, 2025

RHEL Booting into an Old Kernel? Here’s the Fix

 On my RHEL 8.10 system, I noticed that even though the latest kernel was installed, the server kept booting with an older one.

The Cause

GRUB was configured to use the last saved kernel entry (saved_entry) instead of automatically picking the most recent kernel.

The Solution

The fix was simple: edit the GRUB configuration and tell it to always boot the first entry (which is the newest kernel).

  1. Open /etc/default/grub and set:

    GRUB_DEFAULT=0
  2. Regenerate the GRUB configuration:

    • BIOS:

      grub2-mkconfig -o /boot/grub2/grub.cfg
    • UEFI:

      grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
  3. Reboot the server:

    reboot

After that, the system automatically boots into the latest installed kernel every time. 🚀

Mounting Azure Blob Storage on Linux with a Service Principal Securely

 Historically, the most common way to mount Azure Blob Storage as a file system on Linux has been using the storage account key. However, this method presents two significant security risks:

  1. The account key grants full access to all containers and resources, breaking the principle of least privilege.

  2. The key must be stored in a plain text configuration file, which is a poor security practice.

The modern and recommended solution is to use an Azure AD Service Principal (SPN) in conjunction with BlobFuse, Azure's FUSE driver.

The Secure Alternative: Service Principal and RBAC

By using a Service Principal, you can:

  • Apply Role-Based Access Control (RBAC) to assign granular permissions, such as the Storage Blob Data Contributor role, which only allows access to blob data, not account management.

  • Avoid exposing the master account key.

  • Enable key rotation or use certificates for even greater security.

The BlobFuse Twist on Linux

The main difference when using a Service Principal with BlobFuse is how the client secret is handled. Unlike other parameters, the secret is not included in the .cfg configuration file.

The configuration file (.cfg) only requires the following data:

accountName <storage-account-name>
authType SPN
containerName <container-name>
servicePrincipalClientId <client-id-of-the-service-principal>
servicePrincipalTenantId <tenant-id>

The client secret must be injected as an environment variable named AZURE_STORAGE_SPN_CLIENT_SECRET.

AZURE_STORAGE_SPN_CLIENT_SECRET="xxxxxxxxxxxx"

This prevents the secret from being exposed in a plain text file that could be accessible to other users on the system.

Practical Example on Ubuntu

To ensure the secret is available to BlobFuse when the system boots (for instance, if you're using an entry in /etc/fstab), the most straightforward way is to add the environment variable to /etc/environment.

1. Define the Secret

Edit the /etc/environment file and add the variable at the end:

sudo nano or vi /etc/environment

Add this line:

AZURE_STORAGE_SPN_CLIENT_SECRET="aaffEERR~3_LyAI..."

Note: By default, /etc/environment is world-readable. For stricter security, consider using a private systemd EnvironmentFile with chmod 600 permissions.

2. Create the BlobFuse Configuration File

Create a configuration file (e.g., storage.cfg) and ensure it only contains the Service Principal IDs and account information, without the secret.

accountName myblobaccount
authType SPN
containerName mycontainer
servicePrincipalClientId 12345678-abcd-efgh-1234-567890abcdef
servicePrincipalTenantId abcdef12-3456-7890-abcd-efgh12345678

3. Mount the Container

Now you can mount the container using the blobfuse command or by configuring an entry in /etc/fstab. BlobFuse will automatically pick up the secret from the environment variable.

With the AZURE_STORAGE_SPN_CLIENT_SECRET variable defined, simply run:

sudo blobfuse /path/to/mountpoint --config-file=/path/to/storage.cfg

If you're using /etc/fstab, the entry would look like this:

/path/to/storage.cfg /mnt/mycontainer blobfuse defaults,allow_other 0 0

Conclusion

Using a Service Principal to mount Azure Blob Storage on Linux with BlobFuse is a much more secure practice than using the account key. By keeping the client secret as an environment variable, you strengthen security and align with Azure AD and RBAC best practices, avoiding the exposure of sensitive credentials in configuration files. This "quirk" in the configuration is, in fact, a key security feature.

Tuesday, July 8, 2025

Network Printers in RHEL with CUPS Troubleshooting

 When users report printing issues, having a clear troubleshooting workflow can save a lot of time. Below is a concise guide with essential commands to investigate and resolve common printer problems.


🔍 Step-by-Step Diagnostic Guide

1. Check Printer Status

lpstat -p officeprinter1 lpq -P officeprinter1

This shows whether the printer is enabled, actively printing, or unreachable.

2. Check Recently Completed Jobs

lpstat -W completed -P officeprinter1

Helps confirm whether jobs were properly processed by CUPS.

3. Verify Device URI

lpstat -v officeprinter1

Displays the configured connection, e.g., socket://192.168.1.150:9100.

4. Test Network Connectivity

ping -c 4 192.168.1.150

No response may indicate the printer is powered off, disconnected, or blocked by a firewall.

5. Check if Print Port is Reachable

nc -zv 192.168.1.150 9100

Confirms whether the raw print port is open (9100 is standard for many network printers).

6. Check DNS Resolution

getent hosts officeprinter1

Validates whether the printer hostname resolves to an IP address.

7. Send a Test Page

lp -d officeprinter1 /usr/share/cups/data/testprint

Used to verify if the queue is functional and printing.


🛠️ Reset the Print Queue (if stuck)

cancel -a officeprinter1 cupsdisable officeprinter1 cupsenable officeprinter1

Cancels all jobs and restarts the print queue.


⚙️ Reconfigure Printer Using Direct IP


lpadmin -p officeprinter1 -v socket://192.168.1.150:9100

Useful if the hostname no longer resolves or the IP/port has changed.

✅ Conclusion

These commands allow you to quickly diagnose whether an issue is caused by network failure, misconfiguration, or a printer-side error. This workflow is applicable for both standard network printers and custom printing setups in RHEL environments.


Tuesday, April 29, 2025

Troubleshooting systemd Error 203/EXEC While Starting Chronos


Recently, while trying to start the Chronos service using systemctl, I encountered the following error:


systemctl start chronos

Upon checking the status:

chronos.service: Control process exited, code=exited status=203/EXEC

Root Cause

The error status=203/EXEC means that systemd failed to execute the service script. Common reasons include:

  1. Missing shebang (#!/bin/bash) at the top of the script.

  2. Incorrect file permissions — in this case, the script had chmod 777, which is too permissive and not allowed by systemd.


Resolution Steps

Here’s what I did to resolve it:

  1. Added the shebang to the script:

    #!/bin/bash

    Placed at the very top of /etc/rc.d/init.d/chronos.

  2. Corrected the script permissions:

    chmod 755 /etc/rc.d/init.d/chronos
  3. Verified that the script works manually:

    /etc/rc.d/init.d/chronos start
  4. Restarted the service:

    systemctl start chronos

This time, the service started successfully and showed active (exited), which is expected behavior for SYSV-style scripts.


Recommendation

While the issue is resolved, consider migrating to a native systemd unit file for better process management, logging, and compatibility. Legacy init scripts work, but native units provide more control and clarity.

Wednesday, January 22, 2025

Conflict in Kernel Path Assignment and Device Mapping for Multipath LUN /dev/mapper/mpXXX

 

Problem Overview

The issue stemmed from a conflict with the multipath device /dev/mapper/mpathch. The kernel assigned identical paths (e.g., sdmg, sdfl, sdmd, sdft) to a new LUN, resulting in duplicate device mappings and access conflicts. This caused errors during device scans and prevented the cleanup of the mpathch device.

Errors Observed

  1. During pvscan, errors were encountered when reading the device:


    Error reading device /dev/mapper/mpathch at 0 length 512. Error reading device /dev/mapper/mpathch at 0 length 4. Error reading device /dev/mapper/mpathch at 4096 length 4.

    These errors indicated that the device could not be properly accessed or read by the system, likely due to the path conflicts.

  2. dmsetup info -c revealed that:

    • The device /dev/mapper/mpathch was still active with paths assigned.
    • The logical volume vgexport-lvexport was in use, blocking further actions on mpathch.

Resolution Steps

  1. Checked Active Devices:

    • Used dmsetup info -c to identify active devices and locate mpathch and associated logical volumes.

    dmsetup info -c | grep mpathch dmsetup info -c | grep lv
  2. Removed the Blocking Logical Volume:

    • Identified and removed the logical volume vgexport-lvexport, which was preventing the unmapping of mpathch.
    dmsetup remove vgexport-lvexport
  3. Forcefully Removed the Multipath Device:

    • Used dmsetup remove -f to forcibly delete the mpathch device from the device-mapper layer.

    dmsetup remove -f mpathch

Validation

  • Verified that mpathch and its associated paths were no longer present using:
    dmsetup info -c
    multipath -ll
  • Confirmed the system was no longer referencing the conflicting LUN and informed the storage team for reassignment or cleanup.
  • After the reload, the new LUN became visible in the system.

Conclusion

The problem was caused by duplicate kernel path assignments for a new LUN, which conflicted with existing device mappings and caused read errors during device scans. By removing the blocking logical volume and forcefully unmapping the multipath device, the issue was resolved, and the system was returned to a clean state.

Tuesday, October 8, 2024

Kernel Panic During Oracle Cluster Testing on RHEL 8

 

Hey everyone,

I wanted to share some "exciting" issues I ran into while testing an Oracle cluster on RHEL 8—you know, just your everyday kernel panic to spice things up!

The Issue

While conducting tests, I decided to bring down a couple of network interfaces (enp43s1f8 and enp43s1f9) using nmcli. Little did I know that right after deactivating them, I would be treated to a lovely kernel panic, logged as follows:


Oct 8 16:52:24 serverA kernel: sysrq: SysRq : Trigger a crash Oct 8 16:52:24 serverA kernel: Kernel panic - not syncing: sysrq triggered crash

This surprise happened even with the iSCSI service (iscsi.service) disabled. Apparently, the system thought it was a great time to throw a party!

What I Found

  1. Dispatcher Scripts:

    • I found out that a script (04-iscsi) in /usr/lib/NetworkManager/dispatcher.d/ was doing its own thing and triggering actions whenever the network state changed. It was like that overly enthusiastic colleague who jumps in during a meeting and derails the conversation!
  2. Fixing the Issue:

    • To bring back some sanity, I temporarily removed or renamed the 04-iscsi script. After that, I was able to bring down the interfaces without causing the system to have a meltdown. Who knew a little housekeeping could go such a long way?
  3. For Future Tests:

    • Always use nmcli to gracefully deactivate connections; it’s less dramatic than a kernel panic!
    • Review any service dependencies to make sure nothing throws a tantrum when you change network states.

This little adventure reminded me of the importance of understanding how network management scripts and services like iSCSI interact—especially when they seem to have a mind of their own. By being proactive and keeping a sense of humor, we can avoid these surprises in the future.

Thursday, October 3, 2024

Resolving Oracle ASM Disk Configuration Issues After Node Reboots

Problem Summary:

Each time Node B is rebooted, two disks fail to automatically configure in Oracle ASM (Automatic Storage Management). These disks—ORA01_DSK2 and ORA01_DSK3—require manual intervention to be configured correctly after the system starts. This issue is evident from the output of the oracleasm scandisks command, where these disks are listed as valid but need to be instantiated manually.

Even though the disks are recognized as valid ASM disks, they are not automatically configured during the boot process, requiring the execution of oracleasm scandisks manually to bring them online.

Diagnosis:

Several potential causes could explain this issue:

  1. Service Dependency Misconfiguration: The Oracle ASM service (oracleasm.service) might be starting before the udev service, which manages device initialization, completes its work. This would result in some disks not being available when ASM performs its scan.
  2. Disk Initialization Timing: Some disks may take longer to be detected by the system, causing them to be unavailable when Oracle ASM performs its automatic scan.

Solution Applied:

To address the problem, the following changes were made to the oracleasm service configuration:

  1. Modifying the oracleasm Service: The configuration file for the Oracle ASM service (/etc/systemd/system/multi-user.target.wants/oracleasm.service) was updated to ensure that Oracle ASM starts after the udev service has fully initialized all devices. The following line was added to the configuration:

    After=systemd-udevd.service

    This ensures that the system’s disks are available by the time Oracle ASM performs its scan.

  2. Adding a Startup Timeout: A 120-second startup timeout was added using the TimeoutStartSec=120s directive. This allows Oracle ASM enough time to detect and configure all disks properly, even if disk initialization takes longer than usual. This prevents ASM from failing due to delayed disk recognition.




Conclusion:

By modifying the Oracle ASM service to depend on udev and increasing the timeout for service startup, the issue of disks not being automatically configured after a reboot was resolved. This solution ensures that all ASM disks are properly recognized and configured during the boot process, eliminating the need for manual configuration.