Recent
CalendarNavigation | ![]() On a New Road
Sun had a series of such mishaps. The first I was involved in involved CRT monitors that burst into flame. The deflection coils of heavily used machines would overheat, the insulation would melt, smoke, and for the truly unlucky, burst into flame. Paying to fix all of the affected machines was hideously expensive. And even though the monitor manufacturer shouldered a lot of the expense, it was a near-death experience. But we got past it.
When Sun folks get together and bullshit about their theories of why Sun died, the one that comes up most often is another one of these supplier disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-II. Total killer product for large datacenters. We sold lots. But then reports started coming in of odd failures. Systems would crash strangely. We'd get crashes in applications. All applications. Crashes in the kernel. Not very often, but often enough to be problems for customers. Sun customers were used to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even figure out if it was a hardware problem or a software problem - Solaris had to be updated for the new machine, so it could have been a kernel problem. But nothing was reproducible. We'd get core dumps and spend hours pouring over them. Some were just crazy, showing values in registers that were simply impossible given the preceeding instructions. We tried everything. Replacing processor boards. Replacing backplanes. It was deeply random. It's very randomness suggested that maybe it was a physics problem: maybe it was alpha particles or cosmic rays. Maybe it was machines close to nuclear power plants. One site experiencing problems was near Fermilab. We actually mapped out failures geographically to see if they correlated to such particle sources. Nope. In desperation, a bright hardware engineer decided to measure the radioactivity of the systems themselves. Bingo! Particles! But from where? Much detailed scanning and it turned out that the packaging of the cache ram chips we were using was noticeably radioactive.
We switched suppliers and the problem totally went away But it was too late. We had spent billions of dollars keeping our customers running. Swapping out all of that hardware was cripplingly expensive. But even worse, it severely damaged our customers trust in our products. Our biggest customers had been burned and were reluctant to buy again. It took quite a few years to rebuild that trust. At about the time that it felt like we had rebuilt trust and put the debacle behind us, the Financial Crisis hit... Boom. I sure hope that Apple's iPhone4 antenna problem isn't another such defining moment :-( |
Posted by Saheed on June 29, 2010 at 11:15 AM PDT #
Posted by Andrew on June 29, 2010 at 07:04 PM PDT #
Posted by Les Stroud on June 29, 2010 at 09:50 PM PDT #
Posted by George on June 29, 2010 at 09:57 PM PDT #
Posted by George on June 29, 2010 at 10:17 PM PDT #
Posted by petzi-baer on June 30, 2010 at 02:31 PM PDT #
Posted by john Bielaszewski on June 30, 2010 at 04:33 PM PDT #
Posted by Julius Daigdigan on June 30, 2010 at 07:39 PM PDT #
Posted by Mayuresh Kathe on July 01, 2010 at 05:09 AM PDT #
Posted by Mahboob on July 02, 2010 at 11:25 AM PDT #
Posted by Fabio on July 02, 2010 at 12:19 PM PDT #