Recent

Support Wikipedia

Calendar

Navigation


On a New Road

At the mercy of suppliers.

Tuesday June 29, 2010

An interesting article came out in the New York Times about Dell's problems with capacitors from one of its suppliers. It really is amazing the extent to which a company is at the mercy of its suppliers. Despite all the effort that folks put into qualifying their parts, there remains an unavoidable element of faith and risk.

Sun had a series of such mishaps. The first I was involved in involved CRT monitors that burst into flame. The deflection coils of heavily used machines would overheat, the insulation would melt, smoke, and for the truly unlucky, burst into flame. Paying to fix all of the affected machines was hideously expensive. And even though the monitor manufacturer shouldered a lot of the expense, it was a near-death experience. But we got past it.

When Sun folks get together and bullshit about their theories of why Sun died, the one that comes up most often is another one of these supplier disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-II. Total killer product for large datacenters. We sold lots. But then reports started coming in of odd failures. Systems would crash strangely. We'd get crashes in applications. All applications. Crashes in the kernel. Not very often, but often enough to be problems for customers. Sun customers were used to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even figure out if it was a hardware problem or a software problem - Solaris had to be updated for the new machine, so it could have been a kernel problem. But nothing was reproducible. We'd get core dumps and spend hours pouring over them. Some were just crazy, showing values in registers that were simply impossible given the preceeding instructions. We tried everything. Replacing processor boards. Replacing backplanes. It was deeply random. It's very randomness suggested that maybe it was a physics problem: maybe it was alpha particles or cosmic rays. Maybe it was machines close to nuclear power plants. One site experiencing problems was near Fermilab. We actually mapped out failures geographically to see if they correlated to such particle sources. Nope. In desperation, a bright hardware engineer decided to measure the radioactivity of the systems themselves. Bingo! Particles! But from where? Much detailed scanning and it turned out that the packaging of the cache ram chips we were using was noticeably radioactive. We switched suppliers and the problem totally went away. After two years of tearing out hair out, we had a solution.

But it was too late. We had spent billions of dollars keeping our customers running. Swapping out all of that hardware was cripplingly expensive. But even worse, it severely damaged our customers trust in our products. Our biggest customers had been burned and were reluctant to buy again. It took quite a few years to rebuild that trust. At about the time that it felt like we had rebuilt trust and put the debacle behind us, the Financial Crisis hit...

Boom.

I sure hope that Apple's iPhone4 antenna problem isn't another such defining moment :-(

Comments:

Wow! You really should write a book about your time at Sun!

Posted by Saheed on June 29, 2010 at 11:15 AM PDT #

On a smaller scale, Seagate arguably killed Amstrad: http://articles.sfgate.com/1997-05-10/business/17748230_1_amstrad-plc-faulty-disk-seagate-s-drives

Posted by Andrew on June 29, 2010 at 07:04 PM PDT #

I think I was one of those people. Ended up replacing the RAM after finding a paper blaming it on Gamma rays. It's hard to imagine those were two seperate incidents. :) Took forever to isolate that problem.

Posted by Les Stroud on June 29, 2010 at 09:50 PM PDT #

Wow so Sun was buying radioactive SRAM from IBM (their competitor) and got screwed. I read the article and there was no mention of a lawsuit or anything like that, how come? Lesson being learned here is never buy parts from your competitor :P

Posted by George on June 29, 2010 at 09:57 PM PDT #

"I sure hope that Apple's iPhone4 antenna problem isn't another such defining moment :-( " Actually it shouldn't be, Steve Jobs told the public that iPhone 4 users are just holding the phone wrong lol ref: http://www.theregister.co.uk/2010/06/25/iphone4_antenna/

Posted by George on June 29, 2010 at 10:17 PM PDT #

That problem helped Fujitsu selling their servers and people stay on SOLARIS. Sometimes second source just works better.

Posted by petzi-baer on June 30, 2010 at 02:31 PM PDT #

A few years ago (decades actually) I worked as a programmer for a company that made OEM assemblies for Ford. When shipments of machined parts that were made for us by our suppliers came in, they were subjected to so called statistical quality control. This means that a small random sample of each shipment was checked with appropriate measuring tools to make sure that they were "in spec". I am surprised that this would not be done with electronic parts as well. An while no amount of cutting corners by Dell would surprise me, who would think to look for radioactivity in DRAM assemblies.

Posted by john Bielaszewski on June 30, 2010 at 04:33 PM PDT #

Interesting article about Ultra Sparc, incidentally my most favorite machine. We had our first Ultra Sparcs in the 90s that ran for years 24x7 without any problem. It's a good thing we did not have the ones with this difficult issue.

Posted by Julius Daigdigan on June 30, 2010 at 07:39 PM PDT #

Nice anecdote, more, more, please a lot more... :-)

Posted by Mayuresh Kathe on July 01, 2010 at 05:09 AM PDT #

Which tool do you use for making those pictures?

Posted by Mahboob on July 02, 2010 at 11:25 AM PDT #

It is a great sadness to see Java in the hands of a cannibal business, which refers only to the profit and does not look for people. I think that you can not stand still, believe it is time to forget what was behind and on to the next, create something totally unrelated to any business, something to revolutionize and give cheer to orphans Sun, Combining all experience this wonderful team that is fed up with Oracle to make the strength of the community, can you get away ...

Posted by Fabio on July 02, 2010 at 12:19 PM PDT #

Post a Comment:
Comments are closed for this entry.