Linux TV and STB Performance Optimisation: A Practical Guide for RAM-Constrained Platforms

The investigative and optimisation approaches that reliably recover performance from RAM-constrained Linux platforms.

Media

Will Tallentire, Tech Lead

Written by

Will Tallentire

Tech Lead

Every operator managing a large deployed Linux TV or STB estate eventually faces the same conversation. Performance has degraded. Customers are noticing. The typical suggestion is a hardware refresh. But the devices are fundamentally still capable, the current supply chain is unpredictable and expensive, and the numbers are difficult to justify. The assumption that sluggish performance means end-of-life hardware is one of the most common and most expensive conclusions operators reach too early.

In our experience, that assumption is rarely correct, and the cost of getting it wrong is high. Entertainment platforms operate under uniquely demanding conditions. Consumers may not understand memory pressure, process scheduling or resource contention, but they immediately notice when a device feels slow. A memory leak that goes unnoticed on a desktop system becomes highly visible when a set-top box remains powered on continuously for months. Customers expect applications to launch instantly, channel changes to feel immediate, and video playback to remain smooth regardless of how long a device has been running. That tolerance for poor performance is essentially zero.

Consult Red has spent over two decades working with leading global video service providers on Linux-based TV and set-top box platforms. That work spans new device bring-up, SoC integration, platform modernisation, middleware implementation and long-term operational support. The deployments vary enormously, but the underlying challenges are remarkably consistent.

If your fleet runs on Android TV rather than Linux, our Android TV performance optimisation guide covers the equivalent territory for that platform. And across both Android and Linux platforms, the golden window, that period when hardware is fully amortised yet still capable of delivering a good customer experience, is almost always wider than it first appears.

This article covers the investigative and optimisation approaches that reliably recover performance from constrained Linux platforms, and the order in which to apply them.

How Platforms Become Constrained

Understanding why a Linux TV platform degrades over time is the starting point for recovering it.

Memory becomes constrained incrementally. Middleware expands. OTT applications consume more resources.

Browser-based user interfaces, increasingly delivered via Chromium and WebKit-based frameworks, are growing larger with each release.

Security updates add overhead. Third-party integrations accumulate. Telemetry systems expand.

Over time, the software footprint grows while the available hardware remains unchanged.

What makes this particularly difficult to diagnose is that performance degradation on Linux TV platforms rarely has a single cause.

It is almost always the cumulative effect of years of incremental change across middleware, applications, telemetry systems and platform services. Optimisation succeeds when the platform is treated as a whole system rather than a collection of individual applications.

Start With the Right Questions, Not the Obvious Symptoms

One of the most common mistakes engineering teams make is responding to visible symptoms before addressing the fundamentals.

A slow application launch, an occasional pause in responsiveness, or creeping memory consumption can all attract immediate attention. But these symptoms rarely exist in isolation, and chasing them individually, adjusting a kernel parameter here, disabling a service there, frequently moves the problem rather than solving it.

Before reaching for complex optimisation strategies, three questions are worth asking about every process on the platform:

  • What is running? On Linux TV platforms, this typically means examining systemd services, middleware daemons, telemetry agents, browser processes, DRM components and content delivery services that have accumulated across multiple software generations.
  • Does it need to be running? Removing redundant legacy services and determining whether processes should launch on demand rather than at boot can recover meaningful resources with relatively low risk.
  • What value is it providing? The quickest performance gains consistently come from reducing unnecessary workload rather than optimising necessary workload.

This last point is worth dwelling on. It is common to find software components introduced years earlier to support a specific content partner, broadcast feature, or legacy user experience requirement. Long after their original purpose has faded, they continue consuming memory and CPU on every deployed device. Services that once seemed insignificant become permanent consumers of resources. Data pipelines perform work that is no longer required.

Improving a legacy software stack is not glamorous work, but it consistently delivers results, and reclaiming memory from inefficient software is significantly cheaper than purchasing more of it.

Measure Before Changing Anything

Optimisation without measurement is guesswork. Every Linux platform behaves differently. Different workloads, memory footprints, kernel configurations and software architectures create unique performance characteristics, and a change that helps one deployment can create new problems in another.

The only reliable approach is to establish a baseline first.

Investigation typically begins with standard Linux performance tools – vmstat, pidstat, smem, slabtop, perf and PSI (Pressure Stall Information) metrics. These help determine whether performance limitations originate from CPU contention, memory pressure, kernel allocations or I/O bottlenecks before any optimisation work begins.

Understanding the source of the problem is invariably more valuable than applying tuning parameters based on instinct. Without a clear baseline, it is impossible to know whether a change genuinely improved the system or simply altered where the problem manifests.

Sysmonitor - for Linux platform optimisation
Sysmonitor – process monitoring used for Linux platform optimisation

“Random Slowness” Is Usually a Memory Problem

One recurring complaint on RAM-constrained devices is that performance appears inconsistent. Users report the device feeling sluggish for a few seconds before suddenly becoming responsive again. This pattern is easy to misattribute to CPU limitations. In practice, it is most often caused by memory pressure.

On some of the most constrained Linux platforms we have worked with, executable code and frequently used libraries were being pushed out of memory during long periods of inactivity. When users picked up the remote and began interacting with the device again, those pages had to be reloaded from flash storage before execution could resume – a noticeable pause at precisely the moment responsiveness mattered most.

One particularly instructive example involved a platform that shipped with no swap configured. Over time, memory consumed by logging, telemetry buffers and small application leaks steadily reduced available free memory. Under pressure, the kernel was forced to reclaim executable code pages and shared libraries that would shortly be needed again.

Adding a relatively modest USB storage device and enabling swap transformed the platform’s behaviour. Rather than repeatedly discarding and reloading executable code, the kernel could move less performance-sensitive memory to secondary storage while keeping frequently executed code resident in RAM. The additional storage did not increase CPU performance or physical memory capacity, yet the improvement in perceived responsiveness was substantial. Application launches became more consistent, and the periodic pauses customers had come to accept simply stopped occurring.

Similarly, zram, which compresses memory pages in RAM to increase the effective amount of available memory and reduce reclaim activity, can deliver meaningful improvements on memory-constrained systems and is typically introduced with relatively minor platform changes.

We have also seen success with proactively loading critical code paths into memory on a scheduled basis, reducing page faults during periods immediately following inactivity.

The lesson is straightforward: apparent CPU performance problems are often memory management problems in disguise.

Kernel Tuning: Powerful, but Not a Starting Point

Linux offers sophisticated mechanisms for controlling memory management, scheduling, caching behaviour and resource allocation. Used correctly, these capabilities can deliver real improvements. Used incorrectly, or applied to a poorly understood system, they create new problems without resolving underlying ones.

This is why tuning should follow investigation, not replace it.

A well-understood system can benefit from carefully targeted kernel changes. A poorly understood system simply becomes a differently configured poorly understood system.

Depending on the platform, relevant opportunities may include tuning virtual memory behaviour, adjusting reclaim policies, analysing memory fragmentation, reviewing page-cache utilisation, introducing zram, evaluating swap strategies, or using cgroups to isolate resource-intensive workloads.

Every one of these interventions introduces trade-offs. Experience matters because identifying which trade-offs are acceptable for a specific platform and workload requires having navigated them before.

Finding the Leaks Nobody Else Sees

Some of the most expensive performance problems develop slowly and invisibly.

A memory leak consuming only a few kilobytes per hour is unlikely to appear during standard testing. It may take weeks or months of continuous operation before its impact becomes visible, by which point the device is already struggling.

Detecting these issues requires long-term analysis rather than occasional snapshots. One of the most effective techniques is the periodic collection of process memory statistics. This quickly identifies which process is growing unexpectedly and enables engineers to correlate memory growth with operational behaviour.

Identifying the responsible process is usually straightforward. Understanding why it is leaking is considerably harder.

Native memory leaks are notoriously difficult to diagnose in production environments. Core files may be unavailable, debugging access may be restricted, and diagnostic tools themselves often introduce enough overhead to render already-constrained devices unusable. Development tools such as Valgrind, AddressSanitizer (ASAN), LeakSanitizer (LSAN) and heaptrack are effective during integration and development, but on production STBs, they frequently alter performance characteristics enough to complicate root cause analysis.

In practice, many leak investigations combine telemetry analysis, log review, and code inspection. Once the offending process is identified, engineers work backwards through allocation paths and execution flows, looking for resources that are not being released correctly. This is particularly familiar territory in long-running C and C++ applications, where leaks can originate from ownership ambiguities, reference-counting errors, container growth, caching strategies, or exception-handling paths that rarely surface during testing.

Managed environments, such as Java, offer richer diagnostic capabilities. Tools including Eclipse Memory Analyzer (MAT), Java Flight Recorder (JFR), JVisualVM and heap dump analysis can significantly accelerate leak identification. Experience consistently shows that architectural decisions around object ownership, lifecycle management and reference handling have a greater long-term impact on stability than any individual leak fix.

Whether the software is written in C, C++ or Java, many issues initially reported as general system instability ultimately prove to be memory leaks quietly accumulating in the background. Addressing them delivers disproportionately large improvements in responsiveness, reliability and device longevity.

Image of multiple DDR4 and DDR5 RAM boards
Find the memory leaks nobody else sees

Monitoring Is Not Optional

Performance work is not a one-time exercise. Systems change. Applications evolve. New functionality is introduced. Memory consumption grows gradually over months and years. Without visibility into how the platform behaves over time, those changes remain hidden until customers start reporting problems.

In many deployments, we instrument key points throughout the software stack and collect operational telemetry for offline analysis. Examining trends in swap activity, CPU utilisation, memory pressure, page reclaim behaviour, process memory growth and disk I/O routinely reveals issues that are invisible during short-term testing.

The objective is rarely identifying a single failure. More often, it is understanding how the platform behaves across weeks or months of continuous operation. Gradually increasing swap utilisation, rising memory pressure or steadily growing process footprints consistently signal developing issues well before they become visible to customers.

Even relatively simple telemetry can be valuable. Periodic snapshots of process memory usage can quickly identify which process is responsible for gradual growth. Analysing historical logs reveals patterns in resource utilisation that laboratory testing will never surface.

Modern Linux platforms offer increasingly sophisticated observability through PSI metrics, eBPF-based tooling, Prometheus telemetry collection and Grafana visualisation — enabling engineering teams to identify resource contention before customers experience degraded performance. We regularly build bespoke observability solutions that combine process-level metrics, kernel telemetry, application instrumentation and operational analytics to track memory utilisation trends, resource contention, application behaviour, system stability and capacity risks over time.

The goal is not simply to observe problems. It is to detect them before they become operational issues.

Where to Start: A Structured Approach

Two decades of Linux TV and STB platform work point consistently to the same investigative priorities. The sequence matters. Teams that skip straight to kernel tuning or hardware decisions consistently get worse outcomes than those who work through the fundamentals first.

Throughout this six-step process, the underlying question is always the same: have we fully understood the platform we already have?

 

Begin with evidence, not assumptions

Baseline the system using vmstat, pidstat, smem and PSI metrics before touching anything. Understand where memory is going and what is consuming CPU before drawing any conclusions.

Audit everything that is running

Catalogue systemd services, middleware daemons, telemetry agents and legacy processes. For each one, apply the three questions from earlier in this article. The audit alone frequently surfaces candidates for removal or deferral that deliver immediate gains.

Investigate apparent performance problems for memory root causes

The pattern of periodic slowness and inconsistent responsiveness almost always leads here. Evaluate swap and zram early – they are often the highest-leverage, lowest-disruption intervention available on constrained hardware.

Instrument for long-term telemetry before declaring the platform stable

Process memory growth, swap trends, and reclaim activity reveal what short-term testing consistently misses. An optimised platform without monitoring will degrade again undetected.

Hunt for leaks methodically once the broader picture is understood

Periodic memory snapshots identify the culprit process; code inspection and allocation path analysis find the cause.

Apply kernel tuning last, with trade-offs understood in advance

Targeted changes to virtual memory behaviour, reclaim policies, and cgroup configuration can deliver meaningful gains on a well-understood system. On a poorly understood one, they simply defer the problem.

The Hardware Is Rarely the Problem

Devices that appear to be running out of road often have more life in them than the symptoms suggest – and the economics of finding out are compelling.

With many operators managing hundreds of thousands or millions of deployed devices, even modest improvements in memory utilisation can delay hardware refresh programmes and deliver substantial cost savings. The economics of optimisation are consistently more attractive than the economics of replacement. Structured optimisation and performance investigation, software modernisation, kernel expertise, memory analysis and long-term monitoring can unlock significantly more value from existing hardware than most teams expect. When done well, it removes engineering firefighting, stabilises the platform and gives product teams the headroom to focus on what customers actually value.

There are situations where additional RAM is the correct answer. There are situations where hardware replacement is unavoidable. But organisations frequently reach those conclusions before exhausting what structured investigation and optimisation can achieve. The most successful projects we have worked on started with evidence, not assumptions. Before investing in larger memory configurations or committing to new hardware, it is worth asking the question that too rarely gets asked first: “Have we fully understood the device we already have?”

The hardware may not be the problem. And if it isn’t, buying more of it will not solve the underlying issue.

 

If you are facing RAM pressure on your Linux TV/STB device fleet and want an honest assessment of what is achievable before committing to a replacement programme, speak to our team.

Consult Red has supported Linux TV and STB platform optimisation for some of the world’s largest video service providers, across estates running into millions of deployed devices. If performance degradation is pushing you towards a hardware refresh conversation, talk to us before you commit.

Some images on this page are AI-generated and used for illustrative purposes.