👨‍💻
Mike's Notes
  • Introduction
  • MacOs Setup
    • System Preferences
    • Homebrew
      • Usage
    • iTerm
      • VIM
      • Tree
      • ZSH
    • Visual Studio Code
    • Git
    • SSH Keys
  • DevOps Knowledge
    • SRE
      • Scaling Reliably
        • Splitting a Monolith into Microservices
      • Troubleshooting Common Issues
      • Service Level Terminology
      • Toil
      • Monitoring
      • Release Engineering
      • Best Practices
      • On-Call
      • Alerting
    • Containers
      • Docker
        • Best Practices
          • Image Building
          • Docker Development
        • CLI Cheat Sheet
      • Container Orchestration
        • Kubernetes
          • Benefits
          • Cheat Sheet
          • Components
          • Pods
          • Workload Resources
          • Best Practices
    • Developer Portal 👨‍💻
      • Solution Overview 🎯
      • System Architecture 🏗️
      • Implementation Journey 🛠️
      • Cross-team Collaboration 🤝
      • Lessons & Future 🎓
    • Provisioning
      • Terraform
        • Installation
        • Usage
    • Configuration Management
      • Ansible
        • Benefits
        • Installation
    • Build Systems
      • Bazel
        • Features
  • Security
    • Secure Software Engineering
    • Core Concepts
    • Security Design Principles
    • Software Security Requirements
    • Compliance Standards and Policies
      • Sarbanes-Oxley (SOX)
      • HIPAA and HITECH
      • Payment Card Industry Data Security Standard (PCI-DSS)
      • General Data Protection Regulation (GDPR)
      • California Consumer Privacy Act (CCPA)
      • Federal Risk and Authorization Management Program (FedRAMP)
    • Privacy & Data
  • Linux Fundamentals
    • Introduction to Linux
    • Architecture
    • Server Administration
      • User / Groups
      • File Permissions
      • SSH
      • Process Management
    • Networking
      • Diagrams
      • Browser URL Example
      • Network Topologies
      • Signal Routing
      • DNS (Domain Name System)
      • SSL (Secure Sockets Layer)
      • TLS (Transport Layer Security)
  • System Design
    • Process
    • Kafka
      • Advanced Topics
    • URL Shortener
Powered by GitBook
On this page
  • Cannot Reach Server
  • Cannot Reach Website or Application
  • Unable to SSH as Root or User
  • Disk Space is Full or Adding/Extending Disk Space
  • Filesystem Corruption
  • Missing or Incorrect fstab File
  • Cannot cd to Directory (Even with Sudo Privileges)
  • Cannot Create Links
  • Running Out of Memory
  • Add or Extend Swap Space
  • Unable to Run Certain Commands
  • System Unexpectedly Rebooting and Processes Restarting
  • Unable to Get an IP Address
  • Backup and Restore File Permissions in Linux
  • Useful Tips Related to Disk Partitioning

Was this helpful?

  1. DevOps Knowledge
  2. SRE

Troubleshooting Common Issues

PreviousSplitting a Monolith into MicroservicesNextService Level Terminology

Last updated 5 months ago

Was this helpful?

Cannot Reach Server

  1. Ping the server by Hostname and IP Address:

    • Hostname/IP Address is pingable:

      • The issue might be on the client side since the server is reachable.

    • Hostname is not pingable but IP Address is pingable:

      • Likely a DNS issue. Check:

        • /etc/hosts

        • /etc/resolv.conf

        • /etc/nsswitch.conf

        • Test DNS Resolution:

          • Using nslookup, dig or host

    • Neither Hostname nor IP Address is pingable:

      • Check another server on the same network:

        • False: Issue is with this specific host/server.

        • True: Likely a broader network issue.

      • Log in via Virtual Console (if the server is powered on):

        • Check uptime using command uptime.

        • Verify if the server has an IP and if the network interface is UP.

          • Run the command ip addr

          • Ensure the network interface (e.g., eth0, ens33) is listed and in the "UP" state.

        • Ping the gateway and check routes.

        • Check SELinux and firewall rules.

        • Inspect physical cable connections.

Cannot Reach Website or Application

  1. Ping the server by Hostname and IP Address:

    • False: Follow troubleshooting steps from “Server is not reachable or unable to connect.”

    • True: Check service availability using the telnet command with the appropriate port:

      • True: The service is running.

      • False: The service is not reachable or running. Check:

        • Service status (using systemctl or equivalent commands).

        • Firewall/SELinux settings.

        • Service logs.

        • Service configuration.

Unable to SSH as Root or User

  1. Ping the server by Hostname and IP Address:

    • False: Follow troubleshooting steps from “Cannot Reach Server”

    • True: Check service availability using the telnet command with the SSH port:

      • True: The service is running:

        • Check if the issue is on the client side.

        • Verify:

          • User account is not disabled.

          • User has a valid shell (not nologin).

          • Root login is not disabled in the SSH configuration.

      • False: The service is not reachable or running. Check:

        • Service status (using systemctl or equivalent commands).

        • Firewall/SELinux settings.

        • Service logs.

        • Service configuration.

Disk Space is Full or Adding/Extending Disk Space

  1. Detect Performance Degradation:

    • Applications are slow or unresponsive.

    • Commands fail to execute (e.g., / disk space is full).

    • Logging and other system operations fail.

  2. Analyze the Issue:

    • Use the df command to identify the problematic filesystem.

  3. Take Action:

    • Use du to find large files/directories in the affected filesystem.

    • Compress or remove large files.

    • Move files to another partition or server.

    • Check disk health with badblocks (e.g., badblocks -v /dev/sda).

    • Identify I/O-bound processes using iostat.

    • Create a link to move large files/directories.

  4. Add a New Disk:

    • Simple Partition:

      • Add the disk to the VM.

      • Verify the new disk using df or lsblk.

      • Use fdisk to create a partition (preferably LVM).

      • Create a filesystem, mount it, and add it to fstab for persistence.

    • LVM Partition:

      • Add the disk to the VM.

      • Verify with df or lsblk.

      • Use fdisk to create an LVM partition.

      • Set up PV, VG, and LV.

      • Create a filesystem, mount it, and add it to fstab.

    • Extend LVM Partition:

      • Add and create an LVM partition.

      • Add the new LVM partition (PV) to the existing VG.

      • Extend the LV and resize the filesystem.

Filesystem Corruption

  1. Symptoms:

    • The system fails to boot.

  2. Check Logs:

    • Investigate /var/log/messages, dmesg, and other log files.

    • Look for bad sector logs.

  3. Run fsck if Bad Sectors are Found:

    • Reboot the system into rescue mode (e.g., boot from CD-ROM or ISO).

    • Select Option 1 to mount the original root filesystem under /mnt/sysimage.

    • Edit fstab entries or recreate the file using blkid.

    • Reboot the system.

Missing or Incorrect fstab File

  1. Symptoms:

    • The system fails to boot.

  2. Check Logs:

    • Investigate /var/log/messages, dmesg, and other log files.

    • Look for bad sector logs.

  3. Run fsck if Bad Sectors are Found:

    • Reboot the system into rescue mode (e.g., boot from CD-ROM or ISO).

    • Select Option 1 to mount the original root filesystem under /mnt/sysimage.

    • Edit fstab entries or recreate the file using blkid.

    • Reboot the system.

Cannot cd to Directory (Even with Sudo Privileges)

  1. Reasons and Resolutions:

    • Directory does not exist.

    • Pathname conflict (relative vs absolute path).

    • Parent directory permission or ownership issues.

    • Missing executable permissions on the target directory.

    • Hidden directory not visible.

Cannot Create Links

  1. Reasons and Resolutions:

    • Target directory or file does not exist.

    • Pathname conflict (relative vs absolute path) — ensure the path is complete.

    • Parent directory permission or ownership issues.

    • Target file permission or ownership issues — must have read permissions.

    • Hidden directory or file not visible.

Running Out of Memory

  1. Types of Memory:

    • Cache: L1, L2, L3.

    • RAM:

      • Usage details from free -h:

        • Total: Total assigned memory.

        • Used: Total memory actually in use.

        • Free: Memory available for immediate use.

        • Shared: Shared memory.

        • Buff/Cache: Pages cached in memory.

        • Available: Memory that can be freed.

      • Check /proc/meminfo for detailed metrics:

        • File active/inactive, Anon active/inactive.

    • Swap (Virtual Memory): Monitor and manage for system stability.

  2. Resolutions:

    • Identify high-memory processes using top, htop, or ps.

    • Check logs for OOM events and review memory overcommit settings in sysctl.conf.

    • Kill or restart memory-hogging processes/services.

    • Use nice to prioritize critical processes.

    • Add or extend swap space.

    • Install more physical RAM.

Add or Extend Swap Space

  1. Steps to Add Swap Space:

    • Create a file using dd to reserve disk blocks for swap.

    • Set file permissions to 600 and assign root ownership.

    • Format the file for swap with mkswap.

    • Enable swap using swapon.

    • Add the swap file to fstab for persistence.

Unable to Run Certain Commands

  1. Troubleshooting and Resolutions:

    • Command issues:

      • System-related commands may require root access.

      • User-defined scripts/commands might have restrictions.

    • Steps to troubleshoot:

      • Check permission or ownership of the command/script.

      • Ensure sudo privileges are configured.

      • Verify the absolute or relative path to the command/script.

      • Ensure the command is in the user's $PATH variable.

      • Confirm that the command is installed.

      • Check for missing or deleted command libraries.

System Unexpectedly Rebooting and Processes Restarting

  1. Troubleshooting and Resolution:

    • System Reboot/Crash Reasons:

      • CPU stress.

      • RAM stress.

      • Kernel fault.

      • Hardware fault.

    • Process Restart Causes:

      • System reboot triggers process restarts.

      • Processes might restart themselves.

      • Watchdog applications:

        • Prevent high stress on system resources.

        • Restart or terminate processes causing excessive stress.

    • Troubleshooting Steps:

      • After logging in, check system status using commands like:

        • uptime, top, dmesg, journalctl, iostat -xz 1.

      • Examine log files: syslog.log, boot.log, dmesg, messages.log.

      • Check custom application log paths.

      • If inaccessible, use virtual consoles (e.g., ILO, IDRAC).

      • Open a support case with the vendor if needed.

Unable to Get an IP Address

  1. IP Assignment Methods:

    • DHCP:

      • Fixed Allocation.

      • Dynamic Allocation.

    • Static IP.

  2. Troubleshooting Steps:

    • Check network settings in the virtualization environment (e.g., VMware, VirtualBox).

    • Verify whether an IP address has been assigned.

    • Check the NIC status on the host using tools like lspci, nmcli.

    • Restart the network service.

Backup and Restore File Permissions in Linux

  1. Backup and Restore Steps:

    • The best option is to create an ACL file for directories/files before making bulk permission changes:

      • Backup file permissions: getfacl -R <dir> > permissions.acl.

      • Restore file permissions: setfacl --restore=permissions.acl.

    • Restore using a VM snapshot (not ideal for production environments).

    • Rebuild the VM (a safer option for long-term stability).

Useful Tips Related to Disk Partitioning

  1. Tips for Managing Disk Partitions:

    • After attaching a new disk to a VM, use lsblk to check its status, then rescan using:

      • echo 1 > /sys/block/sda/device/rescan.

    • Increasing the size of an existing disk appends additional space to the disk without affecting the existing file system or partition.

    • Recreating the filesystem on a block device automatically formats the old one.

    • For a disk with an existing partition/filesystem, share the .vmdk file to another VM. After mounting, the data will remain identical.