Ajitabh Pandey's Soul & Syntax

Exploring systems, souls, and stories – one post at a time

Category: FLOSS

About Free/Libre/Open Source Software

  • Lessons from Running a Live Streaming Setup for More than 7 Years

    After seven years of managing high-traffic live streams, you learn that the biggest challenges aren’t usually the video codecs—they are the “invisible” layers: filesystem synchronization, HTTP header inheritance, and metadata consistency.

    When you scale from a single server to a cluster of distribution nodes behind a Load Balancer (LB), the margin for error disappears. Here are the core lessons learned from troubleshooting a production-scale HLS environment.

    1. The “Last-Modified” Lie and LB Skew

    In a multi-server setup (we use 5 distribution nodes), your player is constantly rotating between different IPs. If you use lsyncd or rsync to push files from a source to these nodes, you will encounter Sync Skew.

    Even with a 0-second delay, one server might receive the latest .m3u8 playlist 500ms before another. If a player hits Server A and then Server B, and Server B is slightly behind, the player sees a Last-Modified timestamp that is “older” than the previous one. This triggers Stall Detection in the player (often seen as manifestAgeMs jumping between 20s and 70s), even if the stream is technically healthy.

    The Lesson: Don’t let the player rely on the file’s “birth certificate.” Force the player to judge the stream by its actual content (the Media Sequence) by suppressing metadata headers and using aggressive cache control.

    location /livestream/ {
    alias /var/www/liveout/;

    # HLS Playlists must never be cached by the LB or the Player
    add_header Cache-Control "no-cache, no-store, must-revalidate, max-age=0" always;
    expires -1;

    # Kill the headers that cause false "Stall" detections
    add_header Last-Modified "";
    add_header ETag "";
    if_modified_since off;

    open_file_cache off;
    include cors_support;
    }

    2. The Nginx Inheritance Trap (CORS)

    This is a silent killer. In Nginx, if you define an add_header directive in a parent location and then define any add_header in a nested child location, the child does not inherit the parent’s headers.

    If you optimize your .ts segments for caching but forget to re-include your CORS headers inside that specific block, your player will fetch the playlist successfully but then fail to download the actual media segments due to a CORS error.

    The Lesson: Always re-include your cors_support and use the always flag. The always flag ensures that even if a segment is briefly missing (404), the CORS headers are sent, allowing the player to see the 404 instead of throwing a confusing “CORS blocked” error.

    location ~* \.ts$ {
    # Re-include CORS because we are adding Cache-Control headers here
    include cors_support;

    # Segments are immutable; cache them forever
    add_header Cache-Control "max-age=31536000, public, immutable" always;
    expires 1y;

    # File handle caching is safe for segments
    open_file_cache max=1000 inactive=20s;
    }

    3. The “Two Masters” Conflict in rtmp.conf

    A common mistake is trying to “help” Nginx-RTMP by giving it an application block for every stream type. In our setup, we found that we have an application app_audio block with hls on; while a separate FFmpeg script was writing audio HLS directly to the same disk. This was causing random failures in generating the audio segments.

    Nginx-RTMP has a built-in “Garbage Collector” (hls_cleanup). If it sees files in its hls_path that it didn’t specifically create (because FFmpeg wrote them directly), it will delete them. To the admin, it looks like files are vanishing into thin air.

    The Lesson: If your FFmpeg script is handling the HLS generation (which is often necessary to satisfy strict Apple AVPlayer requirements for audio-only streams), remove the application block from Nginx-RTMP entirely.

    Correct Lean rtmp.conf Logic:

    • Application Ingest: Receives the stream and triggers the script.
    • Application Video: Receives the transcoded RTMP push for video HLS.
    • Audio: No application block. Let FFmpeg own the directory and the filesystem.

    4. The rsync Trap: --size-only

    When syncing HLS manifests to distribution nodes, it is tempting to use --size-only to speed up transfers. Do not do this. An HLS manifest often retains the same file size even when the content changes (e.g., by swapping one 12-second segment URL for another). rsync with --size-only will detect identical byte counts and skip the sync, leaving your distribution nodes with stale playlists.

    The Lesson: Stick to the default mtime (modification time) checks. On a high-performance instance like a DigitalOcean C4 Droplet, the overhead is negligible, but reliability is everything.

    Summary: The Good, the Bad, and the Buffering

    1. Split your caching: Playlists get max-age=0; Segments get immutable.
    2. Explicit CORS: Nginx inheritance is not your friend. Re-include headers in nested blocks.
    3. One Master per Folder: If FFmpeg writes the HLS, Nginx-RTMP should stay out of the way.
    4. Atomic Sync: Use lsyncd with delay = 0 and compress = false for the lowest possible latency across your Load Balancer.

    By following these principles, you ensure that strict players – especially Apple’s AVPlayer – receive a stream that is consistent, fresh, and compliant with the HLS spec.

  • When a 2-Core Server Hits Load 45+: A Real-World LAMP Debugging Story

    A visual metaphor of a server under pressure: a small machine overwhelmed by tangled cables and glowing red signals, transforming into a clean, efficient system with smooth flowing connections and green indicators. Minimalist, modern, tech illustration style.

    There’s a particular kind of panic that sets in when you SSH into a production server and see this:

    load average: 45.63, 38.37, 28.93

    On a 2-core machine, that’s not just high — it’s catastrophic.

    I usually help one of my friends with LAMP servers hosted on DigitalOcean that run WooCommerce. The site brings in good sales for his business. Recently, he reached out to me to say that some of his customers reported slow order placement. When I logged into the server, I found an interesting pattern.

    This post walks through a real debugging session using a symptoms → diagnostics → solution approach. Along the way, we’ll uncover multiple overlapping issues (not just one), fix them step by step, and explain why architectural changes like PHP-FPM and Nginx matter.

    Symptoms: What went wrong

    The server started showing:

    • Extremely high load averages (45+ on a 2-core system)
    • Slow or unresponsive web requests
    • CPU is constantly maxed out
    • Intermittent recovery followed by spikes

    Initial snapshot:

    # uptime
    load average: 5.95, 25.07, 25.33
    
    # nproc
    2

    Even after partial recovery, the load remained unstable.

    Diagnostics: What the system revealed

    1. Top CPU consumers

    # ps aux --sort=-%cpu | head -20

    Output (trimmed):

    root          92 35.8  0.0      0     0 ?        S    12:40  82:49 [kswapd0]
    mysql     198808 18.6 10.9 1821488 439632 ?      Ssl  16:29   0:31 /usr/sbin/mysqld
    www-data  197164  5.6  5.1 504092 205036 ?       S    16:16   0:51 /usr/sbin/apache2

    The key observation from this is that the process kswapd0 is consuming 35% CPU. This is not normal. It means the kernel is struggling with memory pressure.

    2. Apache process explosion

    # ps aux | grep apache | wc -l
    14

    RSS is the actual physical RAM a process is using right now, measured in KB. It does NOT include swapped-out memory, so it represents memory currently resident in RAM. It is the single most important metric for sizing concurrency.

    In the output, I saw that the RSS is approximately 200MB – 260MB for each Apache process.

    So for 14 processes it is:

    14 processes × ~220MB ≈ ~3GB RAM

    On a 4GB system, that’s quite high.

    3. MySQL check (surprisingly clean)

    When I checked the full process list on the MySQL

    mysql> SHOW FULL PROCESSLIST;

    I found it clean, with a few sleep connections and no long-running queries. I verified it with

    # mysqladmin processlist

    and found a similar output. So MySQL wasn’t the bottleneck.

    4. Network state – hidden problem

    The netstat revealed a hidden problem that may be contributing to the sluggishness.

    # netstat -ant | awk '{print $6}' | sort | uniq -c
    .....
    121 SYN_RECV
    .....

    This indicates:

    • Many half-open TCP connections
    • Likely bot traffic or SYN flood behavior

    5. System pressure via vmstat

    In this case, vmstat was the most powerful tool run. In the output,

    • r is the number of runnable processes (waiting for CPU). Ideally, it should have a value less than or equal to the number of CPU cores. A value exceeding the number of available CPU cores on the machine would indicate CPU contention.
    • id indicates a percentage of CPU that is idle. A value typically in the range of 70-100% indicate a relaxed system. A low value (say 0-20%) indicates a busy CPU. However, 0% means it is fully saturated.
    • si and so are swapped in and out. A value of 0 indicates no swapping and is considered good. Occasionally, a value > 0 indicates mild pressure. But if this value remains above 0 continuously, it may indicate memory problems.

    So when I ran:

    # vmstat 1 5

    Output (trimmed):

    r  b   swpd   free   si   so us sy id
    14 0      0 399400   0    0 34 29 35
    15 0      0 362864   0    0 87 12  0

    r with a value of 14-15 indicates too many runnable processes, and id with 0 means CPU is fully saturated.

    After initial fixes, when I ran vmstat again, I saw the new numbers:

    r  b   swpd   free   si   so us sy id
    1  0  12120 2554084   0    0 34 29 35
    0  0  12120 2554084   0    0  0  1 99

    So, now a value of r between 0-2 indicates a healthy condition, an id of 86-89% indicate idle CPU, and a si/so of 0 indicates no swapping.

    • r = 0–2 → healthy
    • id = 86–99% → CPU idle
    • si/so = 0 → no swapping

    Three Root Causes

    This wasn’t a single issue. It was a stacked failure:

    1. Apache (mod_php) memory bloat

    • Each request = full Apache process
    • Each process ≈ 200MB+
    • Too many workers → RAM exhaustion

    2. Swap thrashing (kswapd0)

    • Memory filled up
    • Kernel started reclaiming memory
    • CPU burned by swap management

    3. Connection pressure (SYN_RECV flood)

    • 121 half-open connections
    • Apache workers are tied up waiting

    Solutions Applied

    1. SYN flood mitigation (UFW + kernel)

    I enabled:

    net.ipv4.tcp_syncookies=1

    And:

    ufw limit 80/tcp
    ufw limit 443/tcp

    2. Apache concurrency control

    Reduced workers:

    MaxRequestWorkers 6

    This helped stabilize the CPU with no process pile-up

    3. KeepAlive tuning

    KeepAlive On
    MaxKeepAliveRequests 50
    KeepAliveTimeout 2

    4. OPcache verification and tuning

    When PHP runs a script, it parses PHP code, compiles it into bytecode, and executes it. Without OPcache, this happens on every request.

    With OPcache enabled, compiled bytecode is stored in memory so that future requests can reuse it. Without OPcache, high CPU usage and slower response times are expected. With OPcache, 30-35% less CPU is used, and execution is faster.

    When I checked, I found that OPcache (opcache.enable) was already enabled in the php.ini.

    I improved it with more cache:

    opcache.memory_consumption=192
    opcache.interned_strings_buffer=16
    opcache.max_accelerated_files=20000

    Additional Changes I would like to make

    1. Replace mod_php with PHP-FPM

    I would want to replace mod_php with php-fpm. In mod_php, each Apache process embeds PHP, leading to high memory usage (~200 MB per worker). This results in poor scalability and a lack of separation of concerns.

    PHP-FPM, on the other hand, runs as a separate service and has lightweight workers (~20-40 MB), providing better process control and supporting pooling and scaling. This will result in lower memory usage, better CPU efficiency, and more predictable performance.

      2. Prefer Nginx Over Apache

      Now, this is not about nginx hype; it’s about an architectural choice. I have been using Apache for quite some time and love it. The pre-fork model of Apache has a process/thread per connection, is memory-heavy, and struggles under concurrency.

      Nginx, with its event-driven model, can handle thousands of connections with a few processes and non-blocking I/O, making it an ideal choice for modern web workloads.

      Finally

      What looked like a “CPU problem” turned out to be:

      • Memory exhaustion
      • Connection pressure
      • Poor process model

      Fixing it required layered thinking, not just tweaking one parameter.

      And the biggest lesson?

      One can tune one’s way out of trouble temporarily, but the real win comes from choosing the right architecture.

      So, now, if you’ve ever seen load averages that made no sense, this pattern might look familiar. And now you know exactly how to break it down.

    1. When Pi-hole + Unbound Stop Resolving: A DNSSEC Trust Anchor Fix

      I have my own private DNS setup in my home network, powered by Pi-hole running on my very first Raspberry Pi, a humble Model B Rev 2. It’s been quietly handling ad-blocking and DNS resolution for years. But today, something broke.

      I noticed that none of my devices could resolve domain names. Pi-hole’s dashboard looked fine. The DNS service was running, blocking was active, but every query failed. Even direct dig queries returned SERVFAIL. Here’s how I diagnosed and resolved the issue.

      The Setup

      My Pi-hole forwards DNS queries to Unbound, a recursive DNS resolver running locally on port 5335. This is configured in /etc/pihole/setupVars.conf.

      PIHOLE_DNS_1=127.0.0.1#5335
      PIHOLE_DNS_2=127.0.0.1#5335

      And my system’s /etc/resolv.conf points to Pi-hole itself

      nameserver 127.0.0.1

      Unbound is installed with the dns-root-data package, which provides root hints and DNSSEC trust anchors:

      $ dpkg -l dns-root-data|grep ^ii
      ii dns-root-data 2024041801~deb11u1 all DNS root hints and DNSSEC trust anchor

      The Symptoms

      Despite everything appearing normal, DNS resolution failed:

      $ dig google.com @127.0.0.1 -p 5335

      ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL

      Even root-level queries failed:

      $ dig . @127.0.0.1 -p 5335

      ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL

      Unbound was running and listening:

      $ netstat -tulpn | grep 5335

      tcp 0 0 127.0.0.1:5335 0.0.0.0:* LISTEN 29155/unbound

      And outbound connectivity was fine. I pinged one of the root DNS servers directly to ensure this:

      $ ping -c1 198.41.0.4 
      PING 198.41.0.4 (198.41.0.4) 56(84) bytes of data.
      64 bytes from 198.41.0.4: icmp_seq=1 ttl=51 time=206 ms

      --- 198.41.0.4 ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 205.615/205.615/205.615/0.000 ms

      The Diagnosis

      At this point, I suspected a DNSSEC validation failure. Unbound uses a trust anchor, which is simply a cryptographic key stored in root.key. This cryptographic key is used to verify the authenticity of DNS responses. Think of it like a passport authority: when you travel internationally, border agents trust your passport because it was issued by a recognized authority. Similarly, DNSSEC relies on a trusted key at the root of the DNS hierarchy to validate every response down the chain. If that key is missing, expired, or corrupted, Unbound can’t verify the authenticity of DNS data — and like a border agent rejecting an unverified passport, it simply refuses to answer, returning SERVFAIL.

      Even though dns-root-data was installed, the trust anchor wasn’t working.

      The Fix

      I regenerated the trust anchor manually:

      $ sudo rm /usr/share/dns/root.key
      $ sudo unbound-anchor -a /usr/share/dns/root.key
      $ sudo systemctl restart unbound

      After this, Unbound started resolving again:

      $ dig google.com @127.0.0.1 -p 5335

      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR
      ;; ANSWER SECTION:
      google.com. 300 IN A 142.250.195.78

      Why This Happens

      Even with dns-root-data, the trust anchor could become stale — especially if the system missed a rollover event or the file was never initialized. Unbound doesn’t log this clearly, so it’s easy to miss.

      Preventing Future Failures

      To avoid this in the future, I added a weekly cron job to refresh the trust anchor:

      0 3 * * 0 /usr/sbin/unbound-anchor -a /usr/share/dns/root.key

      And a watchdog script to monitor Unbound health:

      $ dig . @127.0.0.1 -p 5335 | grep -q 'status: NOERROR' || systemctl restart unbound

      This was a good reminder that even quiet systems need occasional maintenance. Pi-hole and Unbound are powerful together, but DNSSEC adds complexity. If you’re running a similar setup, keep an eye on your trust anchors, and don’t trust the dashboard alone.

    2. Why Systemd Timers Outshine Cron Jobs

      For decades, cron has been the trusty workhorse for scheduling tasks on Linux systems. Need to run a backup script daily? cron was your go-to. But as modern systems evolve and demand more robust, flexible, and integrated solutions, systemd timers have emerged as a superior alternative. Let’s roll up our sleeves and dive into the strategic advantages of systemd timers, then walk through their design and implementation..

      Why Ditch Cron? The Strategic Imperative

      While cron is simple and widely understood, it comes with several inherent limitations that can become problematic in complex or production environments:

      • Limited Visibility and Logging: cron offers basic logging (often just mail notifications) and lacks a centralized way to check job status or output. Debugging failures can be a nightmare.
      • No Dependency Management: cron jobs are isolated. There’s no built-in way to ensure one task runs only after another has successfully completed, leading to potential race conditions or incomplete operations.
      • Missed Executions on Downtime: If a system is off during a scheduled cron run, that execution is simply missed. This is critical for tasks like backups or data synchronization.
      • Environment Inconsistencies: cron jobs run in a minimal environment, often leading to issues with PATH variables or other environmental dependencies that work fine when run manually.
      • No Event-Based Triggering: cron is purely time-based. It cannot react to system events like network availability, disk mounts, or the completion of other services.
      • Concurrency Issues: cron doesn’t inherently prevent multiple instances of the same job from running concurrently, which can lead to resource contention or data corruption.

      systemd timers, on the other hand, address these limitations by leveraging the full power of the systemd init system. (We’ll dive deeper into the intricacies of the systemd init system itself in a future post!)

      • Integrated Logging with Journalctl: All output and status information from systemd timer-triggered services are meticulously logged in the systemd journal, making debugging and monitoring significantly easier (journalctl -u your-service.service).
      • Robust Dependency Management: systemd allows you to define intricate dependencies between services. A timer can trigger a service that requires another service to be active, ensuring proper execution order.
      • Persistent Timers (Missed Job Handling): With the Persistent=true option, systemd timers will execute a missed job immediately upon system boot, ensuring critical tasks are never truly skipped.
      • Consistent Execution Environment: systemd services run in a well-defined environment, reducing surprises due to differing PATH or other variables. You can explicitly set environment variables within the service unit.
      • Flexible Triggering Mechanisms: Beyond simple calendar-based schedules (like cron), systemd timers support monotonic timers (e.g., “5 minutes after boot”) and can be combined with other systemd unit types for event-driven automation.
      • Concurrency Control: systemd inherently manages service states, preventing multiple instances of the same service from running simultaneously unless explicitly configured to do so.
      • Granular Control: Timers offer second-resolution scheduling (with AccuracySec=1us), allowing for much more precise control than cron‘s minute-level resolution.
      • Randomized Delays: RandomizedDelaySec can be used to prevent “thundering herd” issues where many timers configured for the same time might all fire simultaneously, potentially overwhelming the system.

      Designing Your Systemd Timers: A Two-Part Harmony

      systemd timers operate in a symbiotic relationship with systemd service units. You typically create two files for each scheduled task:

      1. A Service Unit (.service file): This defines what you want to run (e.g., a script, a command).
      2. A Timer Unit (.timer file): This defines when you want the service to run.

      Both files are usually placed in /etc/systemd/system/ for system-wide timers or ~/.config/systemd/user/ for user-specific timers.

      The Service Unit (your-task.service)

      This file is a standard systemd service unit. A basic example:

      [Unit]
      Description=My Daily Backup Service
      Wants=network-online.target # Optional: Ensure network is up before running
      
      [Service]
      Type=oneshot # For scripts that run and exit
      ExecStart=/usr/local/bin/backup-script.sh # The script to execute
      User=youruser # Run as a specific user (optional, but good practice)
      Group=yourgroup # Run as a specific group (optional)
      # Environment="PATH=/usr/local/bin:/usr/bin:/bin" # Example: set a custom PATH
      
      [Install]
      WantedBy=multi-user.target # Not strictly necessary for timers, but good for direct invocation
      

      Strategic Design Considerations for Service Units:

      • Type=oneshot: Ideal for scripts that perform a task and then exit.
      • ExecStart: Always use absolute paths for your scripts and commands to avoid environment-related issues.
      • User and Group: Run services with the least necessary privileges. This enhances security.
      • Dependencies (Wants, Requires, After, Before): Leverage systemd‘s powerful dependency management. For example, Wants=network-online.target ensures the network is active before the service starts.
      • Error Handling within Script: While systemd provides good logging, your scripts should still include robust error handling and exit with non-zero status codes on failure.
      • Output: Direct script output to stdout or stderr. journald will capture it automatically. Avoid sending emails directly from the script unless absolutely necessary; systemd‘s logging is usually sufficient.

      The Timer Unit (your-task.timer)

      This file defines the schedule for your service.

      [Unit]
      Description=Timer for My Daily Backup Service
      Requires=your-task.service # Ensure the service unit is loaded
      After=your-task.service # Start the timer after the service is defined
      
      [Timer]
      OnCalendar=daily # Run every day at midnight (default for 'daily')
      # OnCalendar=*-*-* 03:00:00 # Run every day at 3 AM
      # OnCalendar=Mon..Fri 18:00:00 # Run weekdays at 6 PM
      # OnBootSec=5min # Run 5 minutes after boot
      Persistent=true # If the system is off, run immediately on next boot
      RandomizedDelaySec=300 # Add up to 5 minutes of random delay to prevent stampedes
      
      [Install]
      WantedBy=timers.target # Essential for the timer to be enabled at boot
      

      Strategic Design Considerations for Timer Units:

      • OnCalendar: This is your primary scheduling mechanism. systemd offers a highly flexible calendar syntax (refer to man systemd.time for full details). Use systemd-analyze calendar "your-schedule" to test your expressions.
      • OnBootSec: Useful for tasks that need to run a certain duration after the system starts, regardless of the calendar date.
      • Persistent=true: Crucial for reliability! This ensures your task runs even if the system was powered off during its scheduled execution time. The task will execute once systemd comes back online.
      • RandomizedDelaySec: A best practice for production systems, especially if you have many timers. This spreads out the execution of jobs that might otherwise all start at the exact same moment.
      • AccuracySec: Defaults to 1 minute. Set to 1us for second-level precision if needed (though 1s is usually sufficient).
      • Unit: This explicitly links the timer to its corresponding service unit.
      • WantedBy=timers.target: This ensures your timer is enabled and started automatically when the system boots.

      Implementation and Management

      1. Create the files: Place your .service and .timer files in /etc/systemd/system/.
      2. Reload systemd daemon: After creating or modifying unit files: sudo systemctl daemon-reload
      3. Enable the timer: This creates a symlink so the timer starts at boot: sudo systemctl enable your-task.timer
      4. Start the timer: This activates the timer for the current session: sudo systemctl start your-task.timer
      5. Check status: sudo systemctl status your-task.timer; sudo systemctl status your-task.service
      6. View logs: journalctl -u your-task.service
      7. Manually trigger the service (for testing): sudo systemctl start your-task.service

      Conclusion

      While cron served its purpose admirably for many years, systemd timers offer a modern, robust, and integrated solution for scheduling tasks on Linux systems. By embracing systemd timers, you gain superior logging, dependency management, missed-job handling, and greater flexibility, leading to more reliable and maintainable automation. It’s a strategic upgrade that pays dividends in system stability and ease of troubleshooting. Make the switch and experience the power of a truly systemd-native approach to scheduled tasks.

    3. Upgrading Raspbian 8 (Jessie) to Raspbian 9 (Stretch)

      I decided to upgrade my oldest Raspberry Pi to the latest Raspbian. Since I was two releases behind, I decided to do it step-by-step. Today I updated from 8 – 9. I plan. to perform similar steps to upgrade 9 – 10.

      Following are the quick sequence of steps I followed to perform the upgrade. This is a Model B Rev 2 Pi, so was considerably slow to update and took me more than 4 hours to complete.

      Step 1 – Prepare The System For Upgrade

      Apply the latest updates to the system.

      $ sudo apt update && sudo apt upgrade -y && sudo apt-get dist-upgrade -y

      Next step is to search for packages which have been only partially installed on the system using dpkg -C command.

      $ sudo dpkg -C

      The dpkg may indicate what needs to be done with these. I did not find anything under this category, which was good. In last, I ran apt-mark showhold command to get a list of all packages which have been marked as hold.

      $ sudo apt-mark showhold

      While I did not get any packages in this list, but if there are any, we are expected to resolve this before proceedig to step 2.

      Stpe 2 – Prepare the APT System for Upgrade

      $ sudo sed -i 's/jessie/stretch/g' /etc/apt/sources.list
      $ sudo sed -i 's/jessie/stretch/g' /etc/apt/sources.list.d/raspi.list
      $ echo 'deb http://archive.raspberrypi.org/debian/ stretch main' >> /etc/apt/sources.list

      I am updating only the two files but if your system has any other source files, then you need to update them appropriately as well. A list of such files can be found using – grep -lnr jessie /etc/apt

      In addition to this I also removed the package apt-listchange which displays what changed in the new version of the Debian package as compared to the version currently installed on the system. This is expected to speed-up the entire process. This is not mandatory, so you can skip it.

      # optional step
      $ sudo apt-get remove apt-listchange 

      Step 3 – Perform The Upgrade and Cleanup

      As a last step initiate the upgrade process. This is the time where you can just leave the system for few hours.

      $ sudo apt update && sudo apt upgrade -y && sudo apt-get dist-upgrade -y

      I faced issues with chromium-browser and at the last command (dist-upgrade), the dpkg bailed out with a message indicating archive corruption of chromium-browser package. Since I am at Run Level 3, and do not need chromium on the headless pi, I decided to remove the following three packages. In any case in the absence of chromium, the debian system will automatically use update-alternatives and choose epiphany-browser to satisfy gnome-www-browser requirement.

      $ sudo apt-get remove chromium-browser chromium-browser-l10n rpi-chromium-mods

      After removing the chromium browser, I did another round of update, upgrade and dist-upgrade, just to make sure before initiating the cleanup as below –

      $ sudo apt-get autoremove -y && sudo apt-get autoclean

      The new OS version can be verified by

      $ cat /etc/debian_version;cat /etc/os-release

      I also took this opportunity to update the firmware of the raspberry pi by running the following command. Please note this step is absolutely optional and it is recomended also that do not perform this unless you know what you are doing or you are being asked by a support person.

      $ sudo rpi-update