At the mercy of suppliers.

An interesting article came out in the New York Times about Dell's problems with capacitors from one of its suppliers . It really is amazing the extent to which a company is at the mercy of its suppliers. Despite all the effort that folks put into qualifying their parts, there remains an unavoidable element of faith and risk.

Sun had a series of such mishaps. The first I was involved in involved CRT monitors that burst into flame. The deflection coils of heavily used machines would overheat, the insulation would melt, smoke, and for the truly unlucky, burst into flame. Paying to fix all of the affected machines was hideously expensive. And even though the monitor manufacturer shouldered a lot of the expense, it was a near-death experience. But we got past it.

When Sun folks get together and bullshit about their theories of why Sun died, the one that comes up most often is another one of these supplier disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-II. Total killer product for large datacenters. We sold lots. But then reports started coming in of odd failures. Systems would crash strangely. We'd get crashes in applications. All applications. Crashes in the kernel. Not very often, but often enough to be problems for customers. Sun customers were used to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even figure out if it was a hardware problem or a software problem - Solaris had to be updated for the new machine, so it could have been a kernel problem. But nothing was reproducible. We'd get core dumps and spend hours pouring over them. Some were just crazy, showing values in registers that were simply impossible given the preceeding instructions. We tried everything. Replacing processor boards. Replacing backplanes. It was deeply random. It's very randomness suggested that maybe it was a physics problem: maybe it was alpha particles or cosmic rays. Maybe it was machines close to nuclear power plants. One site experiencing problems was near Fermilab. We actually mapped out failures geographically to see if they correlated to such particle sources. Nope. In desperation, a bright hardware engineer decided to measure the radioactivity of the systems themselves. Bingo! Particles! But from where? Much detailed scanning and it turned out that the packaging of the cache ram chips we were using was noticeably radioactive. We switched suppliers and the problem totally went away . After two years of tearing out hair out, we had a solution.

But it was too late. We had spent billions of dollars keeping our customers running. Swapping out all of that hardware was cripplingly expensive. But even worse, it severely damaged our customers trust in our products. Our biggest customers had been burned and were reluctant to buy again. It took quite a few years to rebuild that trust. At about the time that it felt like we had rebuilt trust and put the debacle behind us, the Financial Crisis hit...


I sure hope that Apple's iPhone4 antenna problem isn't another such defining moment :-(

June 29, 2010