I spoke with Rivian software head Wassym Bensaid today about his last harrowing 36 hours. Rivian’s software team scrambled after an incorrect OS update build was sent out to the company’s fleet with an incorrect certificate. The update hung before it could complete, rendering most of the consumer-facing infotainment features inoperable on around 3% of the company’s consumer vehicles, according to Bensaid.
Rivian made Bensaid available to discuss the incident and the OTA fix, which will be going out to customers as early as 9:30 a.m. PT (12:30 p.m. ET).
I think, as a Rivian owner, I’m glad it is going to be able to be fixed via an OTA, but I’m more concerned that this could even actually happen. And it CANNOT happen again.
I asked Bensaid what went wrong, and my understanding is that the software was tested on at least two “developer-build” Rivians that were not affected by the bad certificate before it went out. Of course, the correct version had been tested for over a month on a fleet of at least 1000 test vehicles. But that prerelease subgroup seems like way too few and limited a subset of vehicles to push a live OTA OS update on.
Since the past month, what happened in the final push is the wrong link was selected, unfortunately, with the wrong certificate. So this is what caused the issue. Initially, when we got the reports, there was so we started getting reports around like 5:30pm. Pacific, the reports were a bit confusing in the sense that some people reported bricked cars, others that the cluster and then the camera are still working. So as we were scrambling to get the reports, we wanted to be super conservative, and there was multiple solution paths for us. If cars were truly broken, that would have been a service visit. If parts of the car were still alive, that would have mean, meant probably a way to get them fixed through our mobile service vehicles. And then basically, the team used this opportunity to really zoom out and they came up with a super creative solution, which basically allows us now to fully fix the issue through an over the air update. So we will be sending out a new OTA today, which addresses the issue entirely. So it repairs basically the corrupted image.
Wassym Bensaid
Bensaid noted that Rivian is reevaluating its whole process so that human error can’t ever do something like this again. That means having normal consumer vehicles get the OTA update and tested before sending the update out to more vehicles.
We did not want to go into that line of communication initially, because whether it’s 3% 10% 1% 0.5%, it’s still super important for us. Every user, every customer matters. And Job number one says the last 36 hours was how can we as a team, find the best possible fix for our customers, and then the ranking, the best possible is a remote solution. The worst possible is basically they have to go to service or or they need to tow the vehicle and then the team basically spend a lot a lot of effort. And we managed to come up with really a great solution that helps us to address it remotely. It’s also because we have in place an architecture that has a lot of redundancies and that really allows us to do this kind of operations and actually shows up like once we started understanding what was happening in the field. The vehicle was still operational, the app was still operational on the critical parts of the system was still operational. So the the safety based In redundant based design that we have in place has actually protected us. And then we have used that as a way to basically inject in this case, the recovery solution through a remote fix by leveraging on these safety systems, which is what we will be deploying today.
Wassym Bensaid
Top comment by Luis Lara
I agree that is frustrating but as a Software Engineer I understand that this kind of error happens, the most important thing here is to have the backups available for immediate restoration. Here in Australia, one of the major telcos went down for half a day for an incorrect software update, now that company is getting grilled because they basically didn't have a backup ready in case of any failure. Technology is the less reliable of all sciences, so assume that everything that could go wrong will do.
The build that was supposed to go out was tested for months on regular vehicles, but a single human copy-paste error sent the wrong build out. That process is also being overhauled so that multiple checks of the build go out before it is released to the wider customer group.
Owners who are affected (again, around 3% of the fleet, according to Rivian) should see an update on their phone app and should initiate the process from there. For those few who don’t use an app with their Rivian, they must call the Rivian service line to initiate the update from there.
Electrek’s Take
All of the above is what I want to hear as a Rivian owner, but as a reporter, I would have also liked the communication from Rivian to be more official. The original Reddit post was timely and better than nothing, but it was also a process to verify the user was really Bensaid. It was over 10 hours before the PR team was even able to acknowledge there was a problem, and only after we had shown them the Reddit post.
I think the whole Rivian team can do better here, and from the vibe I’m getting, they do too.
FTC: We use income earning auto affiliate links. More.
Comments