Throwing Wrenches At Autonomous Vehicles

If you need to quickly figure out whether or not an AV is safe, you might want to throw a wrench at it.  If it doesn’t crash, it’s worthy of further review.  If it does, case closed.

Of course, you don’t necessarily throw a literal wrench at the AV, you just a proverbial wrench in it’s plans - add some forced entropy to an otherwise uneventful test drive.  Demand a last minute turn (if it’s reasonable that users might do so), push a shopping cart into it’s lane (if it might ever encounter them), pour black paint over some lane lines.

DiDi Autonomous Driving Robotaxi In Shanghai

 VCG VIA GETTY IMAGES

Few industries receive as much speculative scrutiny as autonomous vehicles (AVs).  The infamous trolley problem is case in point - a google of “trolley problem” + “autonomous” yields 4000 search results (probably 40x the total number of vehicles that have ever driven unmanned on public roads).  When it comes to regulating autonomous vehicles there are intensive conversations about which sensors should be required, which technologies banned, whether or not photorealistic simulation is required, et cetera et cetera.

Out of all of these conversations the consistent outcome tends to be something akin to a complex driving test.  A closed driving course with pre-set obstacles designed to pop out at predetermined times.  Since these tests would be much more thorough than what we’d give a student driver, the thinking goes, passing these tests will help prove an AV is safer.
The problem is that AVs are really really good at cramming for tests.

Almost every AV uses a fair amount of machine learning (ML), which works by looking at a set of relevant data and then determining whether or not current conditions look like that data.  These lane lines look like those it was trained on, so it recognizes them, etc.  From a testing perspective a problem emerges - what if you train on a model just on different attempts at doing that test?

You could imagine an essay writing AI, trained on 100,000 essays on Romeo & Juliet (as well as the original text and grade scores).  Such an AI would be incredibly good at writing an essay on Romeo & Juliet, but lackluster at a question on Hamlet and terrible on a Ulysses question.  For the AV - you could have a system that got top marks on the track but was utterly unable to drive anywhere else in the world.

The solution is actually something that humans do naturally and without advanced technical degrees: adding forced entropy, or throwing the odd wrench in the pre-scheduled test.  What regulators really care about is that these systems can be deployed fairly safely, and a relatively small amount of entropy can break all but the most robust AI systems.  

Asking for a last minute left turn during a ridealong, for example, is the easiest way to get a sense of whether the AV can move in the free world or is actually following an invisible track.  Pushing a shopping cart into a path (when you haven’t previously told the AV developer) is a great way to tell whether the AV system can deal with weird obstacles.

What’s useful about this approach is that it tests the AV system being deployed.  Depending on use case and proposed stage of deployment, that system might include a teleoperator and/or a safety driver; and a successful test is one where the AV doesn’t cause an unavoidable accident.

High res simulation, 3rd party review of design docs, reading VSSAs - all of those steps are still useful in evaluating the safety of an AV, but are only worth the long process of evaluating after an AV passes the wrench test.  

And when it comes to unmanned deployment, my guess is that most AV companies worth over $50m wouldn’t be able to dodge a wrench.

This was originally posted in Forbes.

Making Teleop Safe

An autonomous truck navigating a windy hilly road

A Starsky autonomous truck driving in the hills of the San Francisco Peninsula STARSKY ROBOTICS

(This is the fifth in a series of articles I’m calling ‘Opening the Starsky Kimono,’ where I reveal the Starsky insights we previously considered top secret. You can read about Starsky’s approach to safety here and here, business model here, and thoughts on AI here.)

A truck glides down the interstate. Drivers speeding by look up and are shocked – there’s no driver in that autonomous truck! Up ahead a deer jumps into the truck’s lane and hundreds of miles away a teleoperator is asked to take control of the vehicle. But they aren’t able to in time – either the deer jumped too quickly or the teleoperator wasn’t able to get situationally aware or worse yet: the cellular connectivity isn’t good enough!

Such was the situation painted to me time after time after time as CEO of Starsky Robotics, whose remote-assisted autonomous trucks were supposed to face exactly such a scenario. And yet, it was an entirely false scenario.

As I’ve written about before, safety doesn’t mean that everything always works perfectly, in fact it’s quite the opposite.  To make a system safe is to intimately understand where, when, and how it will break and making sure that those failures are acceptable.  

Just as you can constantly test whether or not your radar is working, you can do the same with a cellular connection.  You can whitelist the routes that have sufficient connectivity, so that if you’re ever in a dead zone you know something must be going wrong and and that pulling over is appropriate.  Finally you can design the entire teleop product around the limitations of human situational awareness - at Starsky we figured that it would take 10 seconds to get situationally aware of a truck and that any emergency within that window needed to be solvable by the truck itself.

Much of the condemnation of teleop safety stems from the availability heuristic - the notion that if something can be recalled it must be important or at least more important than things that can’t be as easily recalled.  Almost everyone has experienced poor cellular connectivity first hand, and it’s always particularly frustrating.  On the other hand, few have been nearly as annoyed at the failures of radars, LIDARs, GPUs, or high grade cameras.  

A man is frustrated while looking at his cellphone

Everyone has experienced bad cell reception, which is part of why it's so easy to dismiss technologies that rely upon it, like teleop, as unreliable. GETTY

Of course the connectivity of a teleoperated vehicle will have problems, just as LIDARs will fail and machine learning models will spit out bad results.  Safety critical systems are never perfect, and to that end teleop is unexceptional - you also need to build a safety case around it.

That safety case is far less difficult than most expect.  The two frequently failing components of a teleop system are the connectivity and the remote driver.  For the former you can easily write specs on the required level of connectivity, tests to verify them, and then the logic that if the system starts failing tests it pulls over.  You can make all of that a lot easier if you’re able to control where your system operates, so as to give you a fighting chance of good connectivity.  

A truck driver sits behind a series of screens and remotely drives a truck

In certain settings, like an access restricted yard like the one above, even direct drive teleop can be safe. STARSKY ROBOTICS  

When planning around teleoperators the primary issue comes from product design.  As my Starsky co-founder pointed out here, latency becomes a larger issue for direct teleoperation at higher speeds (because with constant latency more distance passes between a command being ordered and carried out).  At Starsky we solved this by designing what we called “Supervised Autonomy,” where a remote driver would choose from a finite list of discrete behaviors (like left lane change, slow down 5mph, etc).  Not only was this approach far less latency intensive, it also meant that every time our remote driver ordered a command we would create low dimensional decision making data that we could eventually use to train a higher-level AI (if there was ROI).

I was once having a candid discussion with a peer who told me that there were two types of autonomy companies: those who weren’t serious about unmanned and those who use teleop.  

As the trough of AI disillusion continues to deepen, I reckon there will be more and more who realize that not only is teleop useful for taking the person out of the vehicle, it’s safe.

This was originally published in Forbes

The End of Starsky Robotics

This blog was originally published on Starsky's Medium Blog, a la Medium.  You can read the original article here.


In 2015, I got obsessed with the idea of driverless trucks and started Starsky Robotics. In 2016, we became the first street-legal vehicle to be paid to do real work without a person behind the wheel. In 2018, we became the first street-legal truck to do a fully unmanned run, albeit on a closed road. In 2019, our truck became the first fully-unmanned truck to drive on a live highway.

And in 2020, we’re shutting down.

I remain incredibly proud of the product, team, and organization we were able to build; one where PhDs and truck drivers worked side by side, where generational challenges were solved by people with more smarts than pedigree, and where we discovered how the future of logistics will work.

Like Shackleton on his expedition to Antarctica, we did things no one else ever has. Similarly, though, it didn’t turn out as planned.

Much of Starsky office team circa Feb 2019. Nothing in my life has made me as proud as getting to work with this incredible team.

So what happened?

Timing, more than anything else, is what I think is to blame for our unfortunate fate. Our approach, I still believe, was the right one but the space was too overwhelmed with the unmet promise of AI to focus on a practical solution. As those breakthroughs failed to appear, the downpour of investor interest became a drizzle. It also didn’t help that last year’s tech IPOs took a lot of energy out of the tech industry, and that trucking has been in a recession for 18 or so months.

The AV Space

There are too many problems with the AV industry to detail here: the professorial pace at which most teams work, the lack of tangible deployment milestones, the open secret that there isn’t a robotaxi business model, etc. The biggest, however, is that supervised machine learning doesn’t live up to the hype. It isn’t actual artificial intelligence akin to C-3PO, it’s a sophisticated pattern-matching tool.

Back in 2015, everyone thought their kids wouldn’t need to learn how to drive. Supervised machine learning (under the auspices of being “AI”) was advancing so quickly — in just a few years it had gone from mostly recognizing cats to more-or-less driving. It seemed that AI was following a Moore’s Law Curve:

Source: TechTarget

Projecting that progress forward, all of humanity would certainly be economically uncompetitive in the near future. We would need basic income to cope, to connect with machines to stand a chance, etc.

Five years later and AV professionals are no longer promising Artificial General Intelligence after the next code commit. Instead, the consensus has become that we’re at least 10 years away from self-driving cars.

It’s widely understood that the hardest part of building AI is how it deals with situations that happen uncommonly, i.e. edge cases. In fact, the better your model, the harder it is to find robust data sets of novel edge cases. Additionally, the better your model, the more accurate the data you need to improve it. Rather than seeing exponential improvements in the quality of AI performance (a la Moore’s Law), we’re instead seeing exponential increases in the cost to improve AI systems — supervised ML seems to follow an S-Curve.


The S-Curve here is why Comma.ai, with 5–15 engineers, sees performance not wholly different than Tesla’s 100+ person autonomy team. Or why at Starsky we were able to become one of three companies to do on-public road unmanned tests (with only 30 engineers).

It isn’t incredibly unprecedented — S-curves are frequent in technological adoption (Moore’s Law is actually made up of a number of S curves as chip technologies continuously replace each other as the best candidate to continue the phenomenon’s overall curvature). The problem is when try to compare the current technology how good humans are at driving. I’d propose that there are possible options: we’ve already surpassed human equivalence (show below as L1), we’re nearly there (L2), or we’re a ways off (L3).


If L1 is the line of human equivalence, then leading AV companies merely have to prove safety to be able to deploy. I don’t think I know anyone serious who believes this, but it is a possibility. If L2 is the case, the bigger teams are somewhere from $1–25b away from solving this problem. When big AV investors say that autonomy is an industry just for big companies, this is the bet that they’re making. If, however, L3 is the line of human equivalence it’s unlikely any of the current technology will make that jump. Whenever someone says autonomy is 10 years away that’s almost certainly what their thought is. There aren’t many startups that can survive 10 years without shipping, which means that almost no current autonomous team will ever ship AI decision makers if this is the case.

There aren’t many startups that can survive 10 years without shipping

Why We Didn’t Survive

To someone unfamiliar with the dynamics of venture fundraising, all of the above might seem like a great case to invest in Starsky. We didn’t need “true AI” to be a good business (we thought it might only be worth ~$600/truck/yr) so we should have been able to raise despite the above becoming increasingly obvious. Unfortunately, when investors cool on a space, they generally cool on the entire space. We also saw that investors really didn’t like the business model of being the operator, and that our heavy investment into safety didn’t translate for investors.

Trucking Blues

If teleop solves half the challenge of autonomy, the other half is solved by being the operator. As the trucking company you can choose where you operate — allowing you to pick your battles. Your system only has to be safe on the routes and in the conditions you choose to drive in (on the easiest routes and pulling over and waiting in bad conditions).

The nature of the participants in the trucking industry also reinforces the decision to be an operator. Trucking companies aren’t great technology customers (you should see what they use), and no one knows how to buy safety-critical on-road robots. Even if Starsky perfected general autonomy and perfectly validated safety, it would take years to deploy sufficient systems to make the necessary profits.


“You can always tell how serious a company is about unmanned by how seriously they talk about teleop” a vendor once told me. Nevertheless, we found an incredible amount of industry and investor resistance to our teleop-dependent approach.

While trucking companies don’t know how to buy safety critical robots, they do know how to buy trucking capacity. Every large trucking company does so — their brokerages buy capacity from smaller fleets and owner-operators, many of whom they keep at an arm’s length because they don’t know how much to trust their self-reported safety metrics. At Starsky we found 25+ brokers and trucking companies more than willing to dispatch freight to trucks they already suspected were unmanned. While this is a lower margin business than software’s traditional 90%, we expected to be able to get to a 50% margin in time.

It took me way too long to realize that VCs would rather a $1b business with a 90% margin than a $5b business with a 50% margin, even if capital requirements and growth were the same.

And growth would be the same. The biggest limiter of autonomous deployments isn’t sales, it’s safety.

No One Really Likes Safety, they like Features

In January of 2019, our Head of Safety, our Head of PR, and I gathered in a conference room for a strategy session. The issue: how could we make safety seem exciting enough to cover. A month earlier we had publicly released our VSSA, a highly technical document that detailed how we decided to approach safety. We had pitched it to a particularly smart reporter, but instead of covering it in detail they mostly wrote about teleoperation. We left the meeting in a fluster — we couldn’t figure out how to make safety engineering sexy enough to garner its own reporting.

And we never really figured out how.


Ironically, we were planning on launching a fleet of 10 v2 trucks by January 2020. These systems were designed to be consistent enough to enable us to prove safety across the broader fleet, allowing unmanned regular service by June 2020.

The problem is that people are excited by things that happen rarely, like Starsky’s unmanned public road test. Even when it’s negative, a plane crash gets far more reporting than the 100 people who die each day in automotive accidents. By definition building safety is building the unexceptional; you’re specifically trying to make a system which works without exception.

Safety engineering is the process of highly documenting your product so that you know exactly the conditions under which it will fail and the severity of those failures, and then measuring the frequency of those conditions such that you know how likely it is that your product will hurt people versus how many people you’ve decided are acceptable to hurt.

Doing that is really, really hard. So hard, in fact, that it’s more or less the only thing we did from September of 2017 until our unmanned run in June of 2019. We documented our system, built a safety backup system, and then repeatedly tested our system to failure, fixed those failures, and repeated.


Credit: Tanya Sumang

The problem is that all of that work is invisible. Investors expect founders to lie to them — so how are they to believe that the unmanned run we did actually only had a 1 in a million chance of fatality accident? If they don’t know how hard it is to do unmanned, how do they know someone else can’t do it next week?

Our competitors, on the other hand, invested their engineering efforts in building additional AI features. Decision makers which could sometimes decide to change lanes, or could drive on surface streets (assuming they had sufficient map data). Really neat, cutting- edge stuff.

Investors were impressed. It didn’t matter that that jump from “sometimes working” to statistically reliable was 10–1000x more work.


So, what’s next?

Around November 12 of 2019, our $20m Series B fell apart. We furloughed most of the team on the 15th (probably the worst day of my life), and then started work on selling the company and making sure the team didn’t go without shelter (or visa status, or healthcare for the new and expectant parents).

We were able to land many of the vulnerable jobs by the end of January and I’m in the process of selling the assets of the company (which includes a number of patents essential to operating unmanned vehicles). Like the captain of a sinking ship, I’ve gotten most of the crew on lifeboats and am now noticing the icy waters at my ankles while I start to think about what I do next.


From my vantage point, I think the most likely line of human equivalence is L3 which means that no one should be betting a business on safe AI decision makers. The current companies who are will continue to drain momentum over the next two years, followed by a few years with nearly no investment in the space, and (hopefully) another unmanned highway test for 5 years.

I’d love to be wrong. The aging workforce which will almost certainly start to limit economic growth in the next 5–10 years; the 4000 people who die every year in truck accidents seem a needless sacrifice. If we showed anything at Starsky, it’s that this is buildable if you religiously focus on getting the person out of the vehicle in limited-use cases. But it will need to be someone else to realize that vision.

Signing off,

Stefan.