|
Re: Let some other process fix the error (Long): msg#00358lang.erlang.general
>>>>> "ja" == Joe Armstrong <joe@xxxxxxx> writes: ja> You *can't do this with unix process like concurrency* - you can ja> observe failure but not accurately diagnose the reason for ja> failure. Ja, it's sometimes nice to know if there was a particular type of failure in some other process ... but an OTP supervisor process cares very little about the type of failure one of its children experienced. Knowing the cause of the failure is important from a logging point of view. However, AFAIK, the only reason why it cares is if the dead child was configured to be permitted to die without restart. Last week, I stumbled across the research of George Candea and Armando Fox at Stanford. They've been doing research into "crash-only computing" and "recursively rebootable" software systems. It takes an Erlang OTP person just a few minutes to read their work before saying, "Wow, they're implementing OTP-like supervisor behaviors for Java systems." Well, there are a few differences: * Each recursively-restartable Java component must be running in its own JVM -- operating system process separation provides the smallest unit of failure. The same principle could be applied to a distributed system of independent OS processes communicating via CORBA or other IPC mechanism. * Restart behavior isn't configured at what an OTP person would call a particular supervisor process. Instead, there's a single recovery manager component which maintains a tree of component dependencies. All restartable components are leaves of the tree. It uses an OTP-like "all-for-one" component restart strategy: if a component monitoring agent notifies the recovery manager that a component has failed, the recovery manager will attempt to restart all components that have the same dependency tree parent. If that doesn't fix the problem, the manager restarts all components with the same dependency tree grandparent. And so on, until the system is running without faults. For more info, see: http://www.stanford.edu/~candea/research.html http://swig.stanford.edu/public/projects/RR In a past life, I tried preaching this kind of restart logic for a multi-OS-process, multi-OS-instance system. My preaching fell upon deaf ears, much to my chagrin. It's nice to see that I wasn't a (utterly, completely) crazy man raving in the wilderness. :-) -Scott |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: How Does One Sort Strings?: 00358, Matthias Lang |
|---|---|
| Next by Date: | invoking functions on a remote machine without erl running: 00358, HP Wei |
| Previous by Thread: | Let some other process fix the error (Long)i: 00358, Joe Armstrong |
| Next by Thread: | erl_parse bug.: 00358, Vladimir Sekissov |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |