Tuesday, August 28th, 2007
Jedi/Sector One: Ruby 1.9 twice as fast as Ruby 1.8
Jedi/Sector One: Ruby 1.9 twice as fast as Ruby 1.8
In a recent posting I said “if you are claiming Erlang has proven to reach “nine nienes” of reliability you are dishonest or don’t know what you are talking about – choose what ever you prefer.”
Alexis considered this “a bit harsh”. Probably we have an misunderstanding with the term “dishonest”. I did’t use it as an ethical category but as in “disonest to oneself”. Perhaps “not completely candid” would have been better. What I really mean: “doesn’t follow good scientific reasoning”. When we are talking about Computer Science most things are provable correct or incorrect. Often we have a mathematical proof in other instances we have a “softer” proof, like the one in front of a court. But it is not about believing and feeling. Reliability engineering is a nice hard science and either you reach 99.9999999 % reliability or not. And reaching that is always done by proof. Be it statistical or formal or soft-kuddel-muddel. If you are talking about high reliability just observing doesn’t cut it.
The thing about 9 nines is, that it is an extremely ambitious goal. I’ve spend some years as an lecturer in “dependable distributed systems” and at our institute we were happy to build systems with 99.999% reliability in a non-stop environment. Actually there are many critical systems (e.g. trading floors) which get by with much lower reliability by being idle 12 hours a day. With that is somewhat easy to get a decent reliability during daytime. If you act in global market places you can shift your processing around the world as the day progresses.
With 99.999% you have 5 minutes downtime a year. Or a medium outage (half an hour) every 6 years or a mayor outage (6 hours) every 142 years. Quite good.
So there are claims that Erlang is build to be 10.000 fold that good. If you claim 99.9999999% this is a outrageous claim and you better back that up by facts.
Suppose an Erlang VM had only an single worry: bit errors introduced by cosmic rays or whatever. Estimating the chances of bit flips is a somewhat involved process but let’s assume that per MBit a single bit error occurs every 50 years. That means that you see a single bit error about every second day on a machine with 1GB RAM. Let’s assume somehow you use “better RAM” (ECC, concrete shielding, whatever) which is 100 times more reliable. So you get roundabout two bit errors per year.
Let’s assume there are “good” and “bad” bit errors. A “good” bit error is one which can be fixed automatically. E.g. dropping a connection and restarting the process or something like that. A “bad” bit error crashes the system, e.g. because some internal data structures of the VM or the Operating System got corrupted. We also assume that a “bad” bit error results in 5 minutes downtime for rebooting. To get 99.9999999 % reliability such a reboot can happen only every 10.000 years. That means only on in 20.000 bit errors is allowed to be an “bad” one. This means that of your 1 GByte RAM only 52 kByte may contain critical data structures which would result in a “bad” bit flip / in a reboot.
One actually might be able to construct a system where only 52 kByte contain critical data (stack return addresses, process table, MMU data, etc) but it will be a tall order. And this will get you only 99.9999999 % with regard to a single error source: RAM failures. What about other hardware failures, infrastructure failures, operator errors, infrastructure errors.
I’m convinced Erlang is a very good base for building reliable systems. Probably you can build more reliable systems in less time with the Erlang/OTP stack than with any other mainstream approaches. But see, I consider Erlang mainstream here. For some much more heavy handed approach for building highly reliable systems see “the space shuttle main engine controllers”, “the redundancy management in the Space Shuttle Avionics System” and An Assessment of Space Shuttle Flight Software Development Processes (1993).
But if a bunch of Rocket Scientists (ok, the CS crowd had written the software) have gone through an extremely well thought process and where not able to get to 99.9999999 % reliability everybody else who claims to have gotten there should have better evidence than marketing literature.
While I’m perfectly willing to belief that a AXD301 cluster so far has only dropped 1 in 1.000.000.000 calls that doesn’t mean it has nine nines of reliability. I think basically the Erlang community is doing itself a disservice by claiming 99.9999999 % reliability. Like the teenage boy which claimed he had already kissed 10.000 girls. It theoretically could be true but it is much more likely that so far the boy hasn’t kissed an girl at all.
For why Dr. Amstrong in 2003 writes “For the Ericsson AXD301 the only information on the long-term stability of the system came from a power-point presentation showing some figures claiming that a major customer had run an 11 node system with a 99.9999999% reliability, though how these figure had been obtained was not documented.” (Amstrong, “Making reliable distributed systems in the presence of sofware errors”, p 191) and in 2007 “The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable … but we did 9.” (What’s all this fuss about Erlang?) is a mistery to me. But if an academic source tells me there is no documented evidence of nine nines and some language advocacy page claims otherwise to me the Phd thesis wins with me.
But to be frank, I’m also somewhat disturbed by this two claims on reliability. Compare the wording: “figures claiming that” (2003) to “has achieved” (2007). Perhaps there is data I’m not aware of. But as long as it is unpublished it is more or less irrelevant and just marketing hype.
If I search Google for 99.9999999 reliability I get 513 hits. If I search for 99.9999999 reliability Erlang. So it seems that more than 10 % of all discussion of 99.9999999 % reliability on the internet seem to discuss Erlang (another 70% or so seem to discuss the power grind and we know how reliable this is)
Lets see what this we find on the Web on Erlang and reliability:
* [Erlang/OTP] has been used by Ericsson to achieve nine nines (99.9999999%) of availability.
* The most reliable Erlang-based systems [... have ...] 99.9999999% uptime.
* AXD301 [...] has a fault tolerance of 99.9999999% (9 nines!) That’s 31 ms a year.
* Erlang powers the telephone system in the UK with 31ms downtime per year ? that?s 99.9999999% availability
* Erlang was designed for 99.9999999% uptime
The general sentiment seems to be that it is a fact that AXD 301 reaches nine nines avability/uptime/whatever. And as stated above I have big problems believing that. And just because a system didn’t go down for a dome time doesn’t allow you to reason about reliability. Before a power failure drained the USV the server this blog has been running on had a uptime of about 420 days. So it had NO downtime in a year. Does this mean 100% reliability? No.
Erlang is an interesting language. OTP is great engineering. Erlang has considerable momentum compared to other languages with unusual concepts. There is no need use 99.9999999 % which ring so hollow.
BTW: Besides various Powerpoint slides and such stuff I fond a nice set of Numbers on the ASD301 in “Four-fold Increase in Productivity and Quality ? Industrial-Strength Functional Programming in Telecom-Class Products“, Ulf Wiger, 2001. There it is Assumed an AXD301 runns 1.460.00 LoC C/C++, 1.240.000 LoC Erlang and 27.000 LoC Java if you include Erlang itself. But then you also would have to include the C Runtime and the Java VM & Compiler and possible the OS (Solaris) which would result in C being the absolutely dominating Language. If we check only the ATM-Switch specific Software Wiger reports 1.000.000 LoC Erlang, 1.000.000 LoC C/C++ and 13.000 LoC Java. The popular Erlang advocacy documents report that the AXD 301 system includes 1.7 million lines of Erlang (e.g. Byte).
So obviously Erlang is not the only thing which makes an AXD 301 tick. I assume there is also a lot of clever reliability engineering in the C code and in the hardware.

Frames!
The wonderfull Bicycle Repair Man refactoring tool is now available for TextMate.
Paul Bissex hacked it to be a nice Textmate Bundle. To install it do:
$ sudo easy_install http://bicyclerepair.sourceforge.net/bzr/bicyclerepair-nightly.tar.gz
$ wget http://e-scribe.com/software/python/biketextmate/biketextmate.tgz
$ tar xzvf biketextmate.tgz
$ mv BicycleRepairMan.tmbundle/ ~/Library/Application\ Support/TextMate/Bundles/
$ rm biketextmate.tgz
Restart TextMate.
Now you can open a Python file, press Ctrl+Shift+B and refactor away.
Enjoy!
P.S.: What is Refactoring? Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior. (Martin Frowler)
I was aware Python can reload modules on the fly, but I never used it. Today I tried it and imideately got addicted.
Let’s say I got an exception in ERP.nachschub.MusterNachschubEngine().compute(). I just go to the editor and fix it in another window. Without restarting the Python interpreter I now can import the fixed version:
>>> reload(ERP.nachschub) >>> m = ERP.nachschub.MusterNachschubEngine() >>> m.compute()
Great!
Everybody and his dog babbles about Erlang being able to reach “nine nines reliability” and claiming Jon Amstrong showed that in his Phd Thesis. Did he? No, not at all. To cite him:
“Evidence for the long-term operational stability of the system had also not been collected in any systematic way. For the Ericsson AXD301 the only information on the long-term stability of the system came from a power-point presentation showing some figures claiming that a major customer had run an 11 node system with a 99.9999999% reliability, though how these figure had been obtained was not documented.” (Amstrong, “Making reliable distributed systems in the presence of sofware errors”, p 191, emphasis added by me)
So Erlang advocates: if you are claiming Erlang has prroven to reach “nine nienes” of reliability you are dishonest or don’t know what you are talking about – choose what ever you prefer.
Yesterday I published pyMessaging, a Python toolkit for accessing message brokers along the lines of JMS.
At the moment we use it with ActiveMQ. We have serious issues with the ActiveMQ/Stomp combo. ActiveMQ does not retain Message ordering and under somewhat higher load conditions it starts losing messages. This issues come and go as they like. Yesterday the unit tests where running smoothly, today 5% of them fail.
But for low-load, low reliability application it can be used. Still it is very distressing that a software stack is acting in such an unpredictable way.
I’m aiming to test pyMessaging with other brokers and implement additional protocols, like AMQP.
I just uploaded pyMessaging, checked the Python Package Index just to see

that the second most recent package was IPy, a pice of software originally written but by me but nowadays maintained by INL. Strange coincidence.
“Software as we know it is the bottleneck on the digital horn of plenty. It takes up tremendous resources in talent and time. It’s disappointing and hard to change. It blocks innovation in many organizations.” (Charles Simonyi)
Perhaps RabbitMQ is an alternative to ActiveMQ. At a first glance RabbitMQ is nuch smaller and thus better:
$ find activemq-core/src/main/java/org/apache/activemq/ -name '*.java' | xargs wc -l | tail -n 1
139184 total
$ find ./erlang/rabbit/src/ -name '*.erl' | xargs wc -l | tail -n 1
7541 total
ActiveMQ thas 20 times as many lines of code (LoC) than RabbitMQ. So by a rough estimate is has at least 20 times as manny bugs.
RabbitMQ uses AMQP as it’s native Protocol. AMQP is more complex than the STOMP Protocol promoted by ActiveMQ but of about equal complexity than OpenWire, the “real” Protocol used by ActiveMQ internally. And AMQP seems to be backed by a reasonalby large groups of companies. There are a bunch of AMQP implementations available in the Apacke Qpid Project which was funded by RedHat. There also seem to be a few other implementations of AMQP Brokers.
On the other hand, ActiveMQ doesn’t pass the QPid AMQP complience test (whoever may be to blame):
$ ./run-tests Using specification from: ../specs/amqp.0-8.xml Warning: duplicate id: Constant(name=xa_rbrollback, id=1) Warning: duplicate id: Constant(name=xa_rbtimeout, id=2) Warning: duplicate id: Constant(name=xa_heurhaz, id=3) Warning: duplicate id: Constant(name=xa_rdonly, id=7) ................../Users/md/[...]/codec.py:98: DeprecationWarning: integer argument expected, got float self.write(pack(fmt, *args)) .........................Warning: duplicate id: Constant(name=xa_rbrollback, id=1) Warning: duplicate id: Constant(name=xa_rbtimeout, id=2) Warning: duplicate id: Constant(name=xa_heurhaz, id=3) Warning: duplicate id: Constant(name=xa_rdonly, id=7) .... ^C