I went to a large meeting lately where a big company announced that it would not only migrate existing mainframe applications to the new Z/OS platform from IBM but also existing distributed applications running on Unix type midrange systems. Instead of replacing those Unix boxes with e.g. windows based boxes which was considered neither cost saving nor effective. Linux was no topic either.
But is this really a surprise or does it just make sense for large corporations to move back to the mainframes? What's wrong with distributed computing?
I might have gotten parts of this answer a day before the meeting by a not so happy coincidance: A small link-checker program that I wrote with a colleague last year suddenly stopped working. Its job was to run Sunday night at 2.00 am and check all external links in a content management system against various databases, especially a big document database. It used to work for almost half a year and we hadn't changed anything.
Before we discuss the problem I'd like to mention that it took two weeks to discover the fact that it had stopped working. I remember another case in a company where an essential backup-procedure didn't run for a couple of days because a critical SUN server had been taken down and nobody noticed that this box also started and controlled the backups. So this is already one problem of distributed systems: Monitoring and dependency checking. I usually do one session on distributed systems management in my lecture and one statement is: Only if you can point to a remote object and the system management will tell you exactly which business processes depend on this object, only then do you have a maintainable and manageable distributed systems. I guess we have a long way to go there.
But back to the original problem: The cause of the link-checker failure turned out to be organizational: the document database maintainers had 2 weeks ago decided to change their maintenance schedule and do updates etc. on Sunday at 2.00 am. Ok, that was why the checker stopped working. But weren't both dates documented somewhere? Sureley, just like the constraints of the position module responsible for the position of the ariane 5 during take-off where documented somewhere. The constraints clearly said what trajectory the rocket was supposed to take: the module came from an older ariane 4 and was not fit for the much steeper trajectories of an ariane 5 model. The module basically dumped core, clogged the system bus and finally the rocket had to be destroyed - a half a billion dollar failure beautifully described by Bertrand Meyer on a few excellent pages. Read this and then read the therac-25 radio-therapy disaster and lose all hope (;-)
Glad that I don't program rockets...
So what is the answer? Sadly, we have to say that we do not really control those pesty distributed systems. What would we need? How about a topic map based system that takes its data from live distributed systems instead of stale data from source code maintenance systems? Ever wondered where those compilers are that CORBA's interface repository was supposed to enable? Live meta-data from live objects?
BUT there is Google. It seems to work nicely was one of the responses I got. I don't know much about googles architecture. All I heard is that it's running on thousands of linux boxes. But is this highly redundant architecture what we usually use for distributed systems? Nope. We live with lots of single points of failures and I believe we just get used to the fragile nature of those distributed systems.
AND there is peer-to-peer (kazaa, gnutella etc.). Yes, certainly very successful in private homes. I have yet to see successful business models based on p2p. But at least this stuff does work fairly reliable. Could it be because it does NOT NEED systems management? Scalability is achieved in P2P networks because every new node is not only a load problem but at the same time contributes to the resource pool by increasing cpu and network resources. Clearly a scalable solution with a rather extreme architecture and also far from our regular business application architecture.
The self-healing autonomous systems from IBM have not appeared yet, at least not outside their sysplex. Aaaah, here we go: this is a distributed system. Yes it is but it is tightly coupled and redundant. Not like google but also not like our typical web application or portal infrastructure as well.
Joachim Thomas, a friend of mine, showed me a nice IBM redbook about application migration to virtual linux boxes on their mainframes. The offer several "patterns" how to do this and when it makes sense. One pattern is to simply run previously separate apps in different linux virtual machines. But this does not solve the dependency checking problem - except that perhaps the system administration on mainframes is better organized. What it does is to make an installation much cheaper because a virtual linux vm is certainly cheaper than buying and installing a new Sun server.
Many companies standardize their hardware/OS setup for distributed applications which results in many cases in fairly expensive hardware costs only to be dwarfed by yearly maintenance costs. And which sometimes don't work at all: when I programmed an internet portal for a company we wanted a physical architecture using many small boxes with one application server and perhaps not more than two clones running on them. What we got was 4 huge Enterprise 4500 with many processors. It turned out the machine with JDK1.1 could not use more than 2 CPUs... But the E4500 was the standard equipment.
But the new mainframe patterns raise questions. Discussion these things Joachim quickly notices that in some cases the necessary bandwith to the host might become a problem and we started to distinguish necessary distribution from voluntary distribution: Necessary then when the bandwith or latency to the host is a limiting factor. Voluntary then when the application could run just as well on a sysplex. And we got the feeling that voluntary distributed apps could indeed run better on the host. Again the portal example: dependencies on backend services running on other midrange systems turned out to be the real killer. An information integration portal that also does some transactions would run much better on the host if most of the information sources and transactions would run on the host as well.
Some final ramblings on distributed systems management. Joachim is working on a model to let small applications coexist on application servers running in different virtual machines. This is harder as it sounds because there are open questions about the initial size of those machines, how much to pay for a "slot", what happens when an application outgrows a machine? How to balance internal loads and prevent one VM from allocating all resources (how much kernal support is there for this to achieve?). But the cost reductions could be enormous. What is still required in this model is central system management. Applications need to be delivered and maintained exclusively by operations personel.
Another friend of mine wanted to offer EJBs using an ASP model with Websphere. This turned out to be impossible: System management has ONE root user - clearly not a distributable solution.
How many companies run distributed transactions across several mid-range machines? When I started with CORBA many years ago it became quickly evident that the reliability decreased with distribution. Silvano Maffeis has shown the reliability deficits of distributed objects in his work on Electra. Later I worked with component broker from IBM and ended up saying: distributed computing clearly works best on the mainframe when I learned about heuristic outcomes of transactions and how long distributed transactions take. Debugging multi-machine multi-layer applications is far from easy, on J2EE as well as on DCOM. And when I look at the latest web-services standards (reliability, transactions, coordinations, security) I get the feeling that they are now simply re-implementing well known technologies with equally well known problems.
So are google, peer-to-peer and sysplex the new role models for distributed computing? I always thought about the "swarms" of distributed objects as one of many alternatives of distributed computing. But maybe it is only one of three alternatives. Maybe we really need much more redundancy to achieve the promised distributed reliability. This would require radically different programming models - actually: would those be programming models in the traditional sense of programmers doing the programming? (See the promises feature in "E" e.g.)
Last note: We see a convergence of several technologies - web-services, Grid computing, application servers. Rumor has it that IBM wants to integrate the Open Grid Service Architecture (OGSA) in Websphere and JBOSS. A brandnew redbook on globus is now available. Let's see how this could again shift the balance between hosts and distributed systems.