Tech Talk

Tech Talk




Over-the-Air (OTA) Software Updates


By Manuel Caballero
Senior Software Architect
HARMAN Luxury Audio Group

Manuel Caballero

Introduction
If you go back far enough in time, perhaps 25 years or so, Consumer Electronic (CE) products with software in them had a permanent software configuration. It’s almost unimaginable now but that is how it was – the software was in read-only memories (ROM) that were set at the time of manufacture and never changed throughout the lifetime of the product. Of course, back then software and products generally were much simpler but even so, if there was a bug in the software (and let’s face it, all software has bugs) there was no real way of solving it and the bug became known as a “quirk” whereby users had to find ways of working round or avoiding the problem.

For higher end products, there was the possibility of physically replacing ROM chips, either by a dealer, by mail-back of the unit or occasionally by the end user but this was very rare indeed.

Skip forward a few years into the ‘90s, a new technology allowed ROM chips to be re-programmed within their host. This technology, referred to as “Flash,” is now ubiquitous. That opened up the possibility, for the first time, of in-field software update. The problem then was how to get the firmware image to the device – this was before the Internet was widespread and USB sticks became ubiquitous. As a result, in-field firmware updates remained largely unviable even though the possibility to update the unit was feasible. Note that firmware is software intended to be run in an embedded platform.

Like many things in life, the arrival of the Internet changed the way many things were done and the stars lined up to make in-field software updates possible: all the pieces of the puzzle were in place – the ability to rebuild the software, deliver it to the device and reprogram the firmware image in the device.

Fast forward to today, an in-field software update is a given for all but the smallest, lowest-cost consumer electronic devices. The rest of this article considers the pros and cons of in-field software updates (of which OTA is an essential subset and most people would consider them to be the same thing) and reviews the current state of the art about how it works in real products.

Tech Talk 1

Software Complexity
As mentioned in the Introduction, software complexity in CE products is increasing exponentially. From the ‘80s and ‘90s when firmware in CE devices simply controlled LEDs, switches and relays (for example), modern CE products can and do run full operating systems with advanced graphics and online connectivity capability.

As the complexity of software increases, the complexity of developing and testing it increases too; it is fair to say no non-trivial piece of software is bug-free and thus the ability to address problems in field-based units is a compelling one for the manufacturer and the user.

For manufacturers, there is a distinct benefit in being able to correct faults in the firmware of shipped units. There is also another distinct benefit that new features can be added to extend the lifetime of the product and respond to the ever-moving competitive landscape.

For end users, the benefits are parallel to the manufacturer’s benefits in that annoying problems can be fixed easily without the need to replace or return products for updates.

Market Pressures
Another key driver for in-field updates is the accelerating pace of change in the audio (and wider CE) industry. New formats, codecs and service providers are being introduced at an ever-increasing pace: it is not feasible, for the manufacturer or the consumer, to have to replace units to onboard these new technologies. Of course, it gets to the point when the technology moves on so far, a firmware update is no longer sufficient as the hardware platform is no longer able to support the technology. At that point, the unit becomes End of Life (EOL). With the accelerating pace of change, the ability to perform software updates postpones that EOL decision but there is of course a commercial trade-off of extending product lifetimes for consumers, the engineering effort of supporting the continued development and the opportunities for profits to be made by selling new hardware: that balance needs to be considered carefully for every product to reach the optimum balance of the interests of the manufacturer, the consumer and the environment (as EOL units need to be disposed of at some point). That calculation is a moving target as all the parameters that feed into it are themselves, fast-moving.

Putting the Customer First
Any successful business must put the interest of the customer first and HARMAN is no exception to that. There are some key requirements that must be met to justify a firmware update of a customer’s unit (and it’s worth noting, the unit is the property of the customer):

• The update should provide a tangible benefit to the customer.
• The update must not remove or reduce functionality, at least not without the express and informed permission of the customer.
• The update must be easy (or even transparent) for the customer to install even if the customer has little or no technical knowledge.
• The update must be reliable, secure and not result in excessive “down time” for the customer.

In devising a firmware update strategy, here at HARMAN we have accounted for all of these. Specifically:
• Unit updates always fix bugs or add new features to customer’s devices.
• Most units can be updated Over-the-Air (OTA) or by means of a USB stick for devices that are not connected online (either by design or by dint of customer choice).
• The update mechanism is designed to be reliable and secure. More information on that later.

Tech Talk 2

Technical Description
Overview
At the highest level, any software implementation comprises two key components:
• The processor (microcontroller unit or MCU) that has a list of operations it must perform (in the form of software). As these are embedded devices, we refer to the software as firmware.
• A storage device to contain the firmware so the microcontroller can read it and act upon it.

By changing the firmware stored in the storage device, the operations of the microcontroller can be changed. These changes can, as described above, include fixing bugs (erroneous behavior) or to add new features.

The challenge for any Software Update mechanism is to update the stored copy of firmware in the device’s storage device so that when the microcontroller next starts (e.g. when the unit is powered on) the new firmware runs (or ‘executes’ in software engineer speak).

Hazards
In these days of heightened awareness of security, user choice and privacy, there are a few key points that are highly relevant to the design of a software update mechanism.
• As many of our devices are ‘connected’ i.e. with direct access to the Internet, could the unit be compromised so that it could perform actions that are undesirable?
• If the update fails, either by means of a technical fault or a user intervention (such as powering down the unit during an update) what state is the unit left in? If not handled correctly a unit can become ‘bricked’ – a term that is expanded on below.
• Is the user able to choose when (or if) to install an update and should they be allowed to roll back to a previous version if they so wish? This can become a very complex topic.

Another Brick in the Wall
One common feature of any form of software storage devices (in most cases “Flash” devices mentioned in the introduction) is that they can take some time to update their contents – potentially several minutes. Regardless of how long or short the time is, there is a risk that the update could be interrupted – the risk increases with the amount of time as there are more opportunities for things to go wrong.

If the update of the Flash device is interrupted (for whatever reason – e.g. a fault or the user cutting the power), the contents of the Flash device become undefined. Anyone even remotely involved in software engineering will know that computers don’t like undefined and from a non-technical perspective it means the instructions the microcontroller needs to follow become jumbled or incomplete and as a result, the unit will not operate as intended and in most cases, won’t operate at all. The term used to describe a unit in this state is “bricked” – the reasoning here being the only attribute of the unit in such a state has that is even remotely useful is its weight.

Bricking a unit is almost the worst possible outcome of a failed software update (the worst is physical damage to a unit – which is rare, but possible). Once a unit is bricked, unless a recovery mechanism was designed in, the unit is useless and can’t be recovered by the user: in the best case scenario the unit would need to be returned to a dealer or manufacturer who would have special equipment to recover the unit. In the worst case, if the dealer or manufacturer is unable or unwilling to recover the unit, the unit is effectively beyond repair. It is a bad place to be as the unit is the property of the customer and it has become damaged, possibly through no fault of the customer.

Therefore, here at HARMAN we design our hardware and software update mechanism so that:
• The software update is reliable and if it fails for any reason, the unit “rolls back” to the previous software image.
• No matter how well a software update mechanism is designed, there will always be a small opportunity for the update to fail – careful design can reduce the window of failure to a minimum but it can never be eliminated. Therefore, hardware recovery is essential and is part of all our current hardware designs.

Key Design Considerations
In designing our software update mechanism, we considered all these requirements and constraints and have a solution with these design considerations:
• Our engineers build firmware and apply a digital signature to it. The digital signature allows the receiving unit to confirm the update it has received is complete, intact from its intended source (e.g. HARMAN). This ensures the microcontroller in the device does not attempt to execute invalid, incomplete or unauthorized software).
• We host the software on an FTP site for full Over-the-Air (OTA) capability. Units know the location of this FTP site and can check it periodically for updates.
• For customers whose devices are not connected to the Internet, we also post the firmware on a website so customers can download the file, copy it onto a USB stick and use that to update the unit. This does require a small amount of technical skill but the step-by-step instructions are clearly explained on the product website.


The OTA Update Design
Overview
In this section, we consider all the key components required to implement a real-world OTA Update mechanism for our products.

Flash… Saviour of the Universe!
As mentioned in the introduction, the key component of any update mechanism for an embedded device is Flash memory. Just to recap, Flash is a “read only” memory that can be altered by using a special command sequence. When not in this special mode, it behaves exactly like traditional, fixed, read only memory.

When thinking about flash, there’s a chicken and egg problem that must be solved. If the device MCU is writing new firmware to the Flash device, where is it getting its instructions from? When Flash is being written to, it can’t be read and therefore, the hardware must have a solution for that.

The solution depends on the hardware design and that has evolved over the years in our designs:
• The MCU can have in internal fixed boot ROM that can update the attached Flash chip by allowing the MCU to execute instructions from the internal ROM.
• The design can have two Flash devices so one can be in “read mode” for the MCU to execute from and the other one can be in “write mode” to accept the updated firmware.
• More modern Flash chips can have separate sections for concurrent read and write.

The Update Process
Here are the essential steps performed by the software update process in our current designs:
• In our current designs (or at least the connected units), the unit will periodically check the FTP site for a firmware update. If the unit detects an update is available, it will download the firmware image in blocks and check them for validity. If the received block is valid, it is stored in a temporary area in the Flash device. The key point here is the current firmware is not overwritten (and in fact, can’t be as it’s being executed by the MCU).
• Once all the blocks are received all the blocks are checked against a digital signature present in the final block. If the signature confirms (at least beyond reasonable doubt) that all the blocks were received correctly, a flag is set in the Flash device to indicate a new firmware version is available.
• If for any reason, the process was interrupted or there was an error in the transfer, the validity check will fail and the new firmware flag is not set.
• In all cases, once the transfer is complete, the unit resets itself. On startup, the MCU will start running its bootloader which is located either in ROM within the MCU or in a reserved area of Flash that is protected from overwrites. The bootloader will examine the new firmware flag, and if it is present, it will infer there is a new and valid firmware image available. Based on that it will “flip” the firmware images by flagging the new firmware as current. The area occupied by the previous firmware is then earmarked as the temporary holding area for a future firmware update and the bootloader will run the new firmware image.

For non-connected units, the transport mechanism is a USB stick rather than an online connection to an FTP site – other than that, the basic mechanism is exactly the same.


The Future
Flash? What Flash?
Miracle technology as it is, there are a few disadvantages to using Flash.

In designs with multiple MCUs and Digital Signal Processors (DSPs), each device typically requires its own Flash memory (and perhaps two for failsafe). If there are several MCUs and DSPs, all these Flash chips can significantly add to the complexity and cost of the unit.

With multiple MCUs, all the Flash devices must be updated so that they all contain the correct firmware version. What happens if one fails and others succeed? It soon gets very complicated very quickly. In the case of the Arcam ST60, there are three MCUs in the design, so additional measures are used to ensure all devices have received a valid image before the new firmware is activated. It is a non-trivial task.

Finally, Flash is slow. As MCUs and DSPs have increased in speed, Flash has not kept up with those speed advances and this can significantly reduce the performance of a system as the MCU waits for the Flash device to return the next instruction. There are mitigations such as using faster Flash devices and using caches, but all of these push up costs and power consumption.

Central Firmware
Central Firmware is a term I came up with for a newer way of updating firmware across multiple MCUs/DSPs in a system whilst avoiding all the disadvantages of Flash-based designs and ensuring consistent update application to all devices in the system.

It is a design pattern I have used in previous unrelated designs and propose it would solve many of the challenges of firmware updates in increasingly complex systems. I call this central firmware as it matches the pattern of central heating and central locking.

In a traditional Flash-based system, the primary controller (typically the MCU with the USB or online connectivity) would receive the update image that is a composite image containing the image for all the MCUs in the system. The primary MCU would check the image and then extract the individual MCU images from the composite and send the images to the appropriate MCUs. The other MCUs would receive this and then program their local Flash devices. The entire system would then be reset and then, if everything went to plan, all the microcontrollers would execute their new firmware images.

In a design with central firmware the acquisition of the composite image is the same, but the secondary controllers do not have a local Flash device: only the primary controller has as Flash device (which is needed to store the composite image as well as its own local executable image).

The secondary controllers would run their software from their own internal static memory. The application firmware is injected into the secondary MCU by the primary MCU at boot time and then executed by the secondary MCU. This solves several problems we discussed above regarding Flash:
• There is no chance of the MCUs running incompatible versions due to an interrupted update.
• Only the primary device in the unit needs to have a Flash device thus reducing cost, size, PCB complexity and power consumption.
• The secondary devices (which could be performance-critical DSPs) execute from internal static RAM which is typically considerably faster than Flash.

Like any design, there are always compromises and the compromises for central firmware include:
• The secondary devices must have an intelligent ROM Bootloader that is able to start up the device and listen for firmware injection packets from an external source. Most modern MCUs are capable of this.
• The secondary devices must have sufficient static memory to be able to accommodate the firmware image and any operating memory requirements the system has. Again, most modern MCUs for embedded applications would have this but it needs to be carefully considered.
• It can take a few seconds to inject the image as the link is typically a serial link like a UART or SPI but this is rarely a problem. There is always the option of using faster interfaces such as QSPI or USB. Note that the hardware would be designed so all the MCUs can have their firmware injected concurrently rather than one at a time.

Note that it is normal these days for DSPs to run from static memory (usually internal). The concept is becoming more commonplace for MCUs as well, especially as many of them are taking on DSP duties.

Conclusions
In this article, I discussed the reasoning, design and implementation of an OTA (Over-the-Air) firmware update mechanism as used in contemporary HARMAN Arcam products. We reviewed the design considerations, security and fault tolerance of a typical design. Finally, we looked at a novel idea of central firmware to reduce costs and increase performance for designs with multiple MCUs and DSPs.