How can we reliably update embedded Linux?
The Mirai botnet attack that enslaved poorly secured connected embedded devices is yet another tangible example of the importance of security before bringing your embedded devices online. A new strain of Mirai has caused network outages to about a million Deutsche Telekom customers due to poorly secured routers. Many of these embedded devices run a variant of embedded Linux; typically, the distribution size is around 16MB today.
Unfortunately, the Linux kernel, although very widely used, is far from immune to critical security vulnerabilities as well. In fact, in a presentation at Linux Security Summit 2016, Kees Cook highlighted two examples of critical security vulnerabilities in the Linux kernel: one being present in kernel versions from 2.6.1 all the way to 3.15, the other from 3.4 to 3.14. He also showed that a myriad of high severity vulnerabilities are continuously being found and addressed—more than 30 in his data set.
Although the processes and practices of development teams clearly have a critical impact on the (in)security of software in embedded products, there is a clear correlation between the size of the software project's code base and the number of bugs and vulnerabilities as well. Steve McConnell in Code Complete states there are 1–25 bugs and vulnerabilities per 1,000 lines of code, where the variable is determined by the practices of the team. Military-certified products typically end up in the lower end due to more rigid security practices and quality testing, while consumer electronics are unfortunately in the higher end of the scale due to the greater focus on features and time to market.
Seasoned software developers always seek to reduce the size of the code base through refactoring and the reuse of functionality in libraries, but with the never-ending demand for more features and intelligence in every product, it is clear that the amount of software in embedded devices will only grow. This also necessarily implies that there will be more bugs and vulnerabilities as well.
From this, it should be clear that having a way to deploy bug and security fixes, as well as new features, remotely is an obvious requirement for embedded devices with some level of intelligence, especially if they are network-connected.
This article goes further into specific requirements and solution designs on deploying software updates to embedded devices, in particular, embedded Linux.
As part of the work on Mender, we are conducting interviews with many software engineers and teams in order to better understand how software updates are being done in embedded products today. In one series of these interviews, we spoke with 30 embedded software engineers and the main takeaways are described below.
In the first question, we simply asked if software updates are being deployed to their embedded products today and, if so, which tools were used (Figure 2).
45.5% of the respondents said that updates were never being deployed to their products. Their only way to get new software into customers' hands was to manufacture hardware with the new software and ship the hardware to the customers.
Roughly the other half, 54.5%, said that they did have a way to update their embedded products, but the method was built in-house. This also includes local updates, where a technician would need to go to a device physically and update the software from external media, such as a USB stick. Yet another category was devices enabled for remote updates, but where you could update only one at the time, essentially precluding any mass updates. Finally, some had the capability to deploy mass updates to all devices in a highly automated fashion.
One of the key findings here was that virtually nobody reused a third-party software updater—they all re-invented the wheel!
Second, we asked what the preferred method of deploying software updates is (Figure 3). You can broadly classify embedded updaters into image- or package-based. Image-based updaters will work on the block level and can replace an entire partition or storage device. Package-based updaters work on the file level, and some of them, like RPM/YUM, are well known in Linux desktop and server environments as well.
Image-based updaters have, in general, a clear preference in the embedded space. The main reason for this is that they typically provide atomicity during the update process. Atomicity means that 1) an update is always either fully applied or not at all, and 2) no other component except the software updater can ever see a partial update. This property is very important for embedded updaters, because embedded devices could lose power or be rebooted at any time, and losing power in the middle of the update process should not lead to the device becoming bricked or unusable. The other stated key advantage of image-based updaters was consistency across devices, meaning that you can more confidently rely on behavior in test environments being the same as in production because of the 1:1 software copy.
Package-based approaches generally suffer from not being able to implement atomic updates, but they have some advantages as well. The installation time of an update is shorter, and the amount of bandwidth used also can be smaller than for image-based updates. Finally, since many develop their homegrown updater and already have packages from their build system, package-based update systems are generally faster to develop from scratch.
People familiar with Linux desktop and server systems might ask why we are not just using the same tools and processes that we know from these systems, including package managers (such as rpm, dpkg), VMs and containers to carry out software updates. To understand this, it is important to see in which aspects an embedded device is different with regards to applying software updates.
I already touched on this property of an embedded system, and this is a widely known issue: an embedded device can, in general, lose power at any time. For example, a smart portable audio system can be unplugged as it is moved around in the house. The battery of a portable GPS could run out or become unreliable in certain weather conditions. In-vehicle-infotainment systems in cars can lose power intermittently as the car is started or stopped. This issue is amplified in battery-powered devices, as the update process itself can consume significant battery and cause the power to run out.
Embedded systems must be designed in such a way that they will tolerate power failure at any given time. The lack of power reliability is in strong contrast to typical data-center environments, where multiple redundant power systems will ensure that power is never lost.
Embedded devices typically are connected using some kind of wireless technology. Although Wi-Fi is used in some devices, it is more common to use wireless standards that have longer range but lower data rates, for example 3G, LoRa, Sigfox and protocols based on IEEE 802.15.4 (low-rate wireless personal area networks).
It is tempting to assume that high-speed wireless networks will be generally adopted by embedded devices as technology evolves, just like what happened with smartphones where you can now stream YouTube videos in high resolution. However, keep in mind that the use cases for smartphones and typical embedded devices always will be very different. For example, an agricultural device that measures and optimizes crop yield needs a high amount of connectivity and should work even in places where there is no 3G coverage. In addition, the amount of data that needs to be sent is very low—perhaps just a few data points per day on the temperature and moisture measurements of the earth. So one should rather assume that embedded devices, especially industrial ones, always will have a limited network data rate.
In addition, wireless networks have frequent and intermittent connectivity loss—for example, when the device is moved to an area with low coverage, like underground.
Although low data rate and intermittent connectivity can be difficult design issues, they usually are easy to identify once something is not right.
Security issues over public wireless networks are much more implicit and difficult to expose. In the context of software updates, there are countless examples of homegrown updaters that do not properly authenticate the update, allowing an attacker to inject malicious code while the update is taking place.
Expensive Physical Access
Once a large-scale issue that cannot be fixed remotely occurs, the cost of remediating it is typically very high. The reason is that embedded devices are typically widely distributed geographically.
For example, a manufacturer of smart energy grid devices can install these devices in thousands of homes in several countries. If there is a critical issue with an update to the Linux kernel that cannot be fixed remotely, the cost of either sending a service technician to all those homes or asking customers to send devices back to the vendor can be prohibitive.
The 2015 Fiat-Chrysler Jeep Cherokee security breach offers a recent real-world example of wide-scale recalls. In this case, 1.4 million cars were recalled. The cost of repairing this issue was certainly in the hundred-million dollar range, perhaps even billions.
Five-to-Ten-Year Device Lifetime
Technology moves very fast, and it's typical to replace common consumer electronics devices like smartphones and laptops every two to three years.
However, more expensive consumer devices like high-end audio systems and TVs are replaced less frequently. Industrial devices that do not directly interact with humans typically have even longer lifetimes. For example, robots used on factory floors or energy grid devices easily can reach a ten-year lifetime.
In conclusion, in the embedded environment, people need to be very wary of the risk of “bricking” devices. Not only can this easily happen due to the power and network properties, but it is also a very expensive situation from which to recover.
Now that you are more familiar with the embedded environment, let's consider the implications this has for an embedded software updater.
As you've seen, both the power and the network can be very unreliable and insecure in an embedded environment. An embedded updater must have a couple properties and features in order to tackle these challenges sufficiently.
The database industry is very familiar with the concept of atomic transactions—the “A” in ACID, where a set of operations either all complete or none of them complete. The classic example for the need for this requirement in database theory is with online transactions. When one user transfers money to another, deducting money from one account should occur only if you also successfully add the money to the other account.
This same property is very important for embedded updaters, in order to handle intermittent update errors like sudden power loss. For an embedded updater, the atomic property can be defined in two parts:
An update is always either completed fully or not at all.
No software component (besides the updater) ever sees a partially installed update.
You can see that common ways of deploying software updates in the desktop environment do not meet this atomicity requirement. For example, while you are installing an rpm package, many files are written and modified across the filesystem, and they would be in an inconsistent and potentially non-recoverable state if you suddenly unplugged your desktop during the installation—the application being updated probably wouldn't start at all.
An important approach to mitigate the risks of bricking devices is to test new software updates extensively before releasing them into production. However, in order to rely on test results, you need a test environment that is as identical as possible to the production environment. It is a classic problem in general operations, be it for embedded devices or data centers, that the test environment diverges from production, so that changes work well in the test environment but cause significant downtime when released to production. This is one of the reasons why full-image updates are so prevalent in the embedded space. If your entire root filesystem is the same, block by block, in the test and production environments, then there are guaranteed similarities. Contrast this to a deployment using rpm packages, which may depend on libraries that have different versions, or patches, on the test and production environments, and maybe even across the production environments as well. Over time, such a design typically will lead to production deployments that fail for reasons that are inconsistent and hard to diagnose.
Authenticity Checks before Updates
From a security perspective, it is very important to know whether software comes from an authorized source or whether an attacker could have injected malicious software into the update. There have been countless cases where embedded devices are simply broadcasting their desire to install an update, and anyone who responds would be able to inject the software of their choosing into the device.
A basic approach to ensuring a level of authenticity is to leverage in-transit security protocols like TLS. If done correctly, this will ensure that the update cannot be modified while in transit from an update server to the device.
However, a more robust end-to-end approach is to embed cryptographic authenticity metadata as part of the update itself. Typically a form of code signing is employed, where digital signatures are created by an authority and verified at the device.
One of the key advantages of code signing over solely relying on in-transit security is that the authority that signs the update can be decoupled from the server that hosts it. For example, someone in the QA department could sign an update offline. This reduces the attack surface in cases where the update server gets compromised, because an attacker can still deploy only updates that have been signed by the QA department.
For performance-sensitive devices, cryptographic mechanisms like Message Authentication Code (MAC) or Elliptic Curve signatures should be considered, as they provide much more efficient verification than RSA or DSA at the same level of security.
Sanity Checks after Updates
Embedded devices are typically single-purpose and run only one main application, although in some cases, they could run several. In either instance, it's important to check the health of such applications after deploying an update. Are they running? Do they have network access? Can the user interact with them successfully on the device?
A software update should not be considered successful just because the device boots; there should be a way to integrate custom application sanity checks as well. Finally, a critical check that should be covered by the updater generically is this: Is it possible to deploy another update?
If any of these checks fail, the updater should have the capability to roll back to the previous known-working software, so that downtime is avoided while the issue is being diagnosed and resolved.
The general workflow for deploying software updates is shown in Figure 4.
Integration with Existing Development Workflow
If you are one person starting from scratch with an embedded/IoT project, you likely can choose all the tools and processes you like the best. However, once several people are collaborating on the same project, and in particular, if there is a product already being developed before software updates were taken into account, it is very important that the software update process integrates well with the development workflow.
At first glance, this may look like a strange criteria for an updater, but many approaches to software updates require a full replacement of existing development workflows. Commercial updater tools more often than not are offered as part of a “platform”, where the updater is bundled together with a full device OS, a cloud back end and other device management features. For existing products, this can pose a significant challenge, because the device OS needs to be replaced, potentially also together with the build system, version control and associated QA processes.
For homegrown updaters, this criteria is typically implicitly taken into account, because teams tend to start with what they have and see what is the shortest path to develop and integrate an updater into it. Since existing build systems tend to output packages like rpm or opkg easily, this is an approach that integrates well and is chosen by many homegrown updaters. However, package-based updates have significant drawbacks with respect to lack of robustness, as I discussed earlier.
As I mentioned previously, embedded devices typically are connected with some kind of low data rate wireless connection. An update process that requires less bandwidth will be favorable over one that takes more, simply because it would cost less and take less time to deploy an update.
Compression is the first feature to look at in order to reduce bandwidth, as this could cut the size of the update in half or more, depending on type and compressibility of the update. There is also a variety of delta-based update mechanisms that could be employed to reduce bandwidth usage further.
Downtime during Update
While an update is being deployed, it is desirable to have as little downtime on the device as possible. How much downtime is acceptable is clearly dependent on the use case of the embedded device. Is it part of the power grid that must function 24x7, or is it a consumer audio system that isn't used at night?
The method for deploying updates impacts the required downtime the most. For example, for full image updates, it's possible to deploy the update from a maintenance mode or use a dual-A/B rootfs approach. The maintenance-mode approach works by rebooting into a maintenance partition, installing the update to the root filesystem partition and then rebooting into the root filesystem partition again; the device is unusable for all of this period. In a dual-A/B rootfs approach, the update is installed to the inactive root filesystem while the device can continue to be used. The downtime in this case is only during the reboot into the updated (previously inactive) partition. The dual-A/B rootfs partition update design is shown in Figure 5.
As you can see, many design choices and trade-offs need to be made on the device side for an embedded updater client.
However, once an updater client is installed and working on embedded devices, the problem of managing all those clients becomes apparent. How can a new update be installed on 1,000 of these embedded devices? Which version of the software are they running? How do you know if an update is installed successfully everywhere, and is there a log for failed updates?
These use cases typically are handled with an update management server, so that updates can be managed across a device fleet.
Many design trade-offs need to be considered in order to deploy software updates to IoT devices. Although historically most teams have decided to implement their homegrown updaters, the recent appearance of several open-source software updaters for embedded Linux means that we should be able to stop re-inventing the wheel.