|
Let some other process fix the error (Long): msg#00350lang.erlang.general
On Wed, 23 Apr 2003, Jay Nelson wrote: > These sorts of questions are what make erlang so interesting > to me. They appear simple and broad brush at first, but if > you look closely they are actually very subtle and involve > tradeoffs that you would never recognize in other languages. > In C, you must code defensively because once you core dump > there are no options and with runaway code there is no telling > what will happen next. In erlang you have so many choices it > is difficult to decide what to do. > ... a very nice description of Erlang philosophy ... Amazingly people like Jay (who I have never met) and many other seem to have intuitively understand the Erlang principles - this must be more by "osmosis" than by reading the documentation. > Don't worry about errors, just let it fail. That is the Erlang way. > Like all things Erlang it sounds simple and easy, but is actually a > very subtle thing. The real principle is "let some other process fix the error" The "let if fail philosophy" is a consequence of this. Let me explain: Erlang was *designed* for making fault-tolerant systems, so: 1) To make something fault tolerant you need at least *two* computers (obviously) 2) If one computer fails you must fix the error on the *other* computer. This means that: 3) To fix an error you do not make any attempt to do it locally - you can't fix an error on a computer if the computer has just crashed - you must do it somewhere else. In the Erlang model *everything* is a process - even computers - so we want the same semantics. Thus processes do not do their own error recovery (they can't they have crashed) - other processes must clean up after them. - this is the "let some other process fix the error" principle. Question 1: If some process (A) crashes which of the many other processes in the system is responsible for the recovery operations? Answer: Those processes which are linked to A. Question 2: How do the linked processes know what to do - surely they need to know why A died? Answer: The reason for the exit is sent as an argument in a signal which is sent from the dieing process to all the processes in its link set. To implement this places a number of requirements on the programming language and run-time system - namely: 1) We must be able to remotely detect errors 2) We must be able to automatically diagnose errors "Let it fail" is often the *only* sensible thing to do. Let me explain... - There are exceptions - There are errors - They are not the same thing Start with exceptions: The run-time system generates exceptions - these occur when the run-time system does not know what to do. For example if a divide by zero condition occurs the run-time system does not know what to do - so what does it do? - it aborts the process with a {'EXIT', divide_by_zero} exception. This is fine and in line with our fault-handling philosophy "some other process will fix the error." What about Errors? Well what is an error? An error is "a deviation between what the program is supposed to do" and what it is observed to do. What it is supposed to do is "what was in the specification". Example (my favorite) -- let's suppose the spec say we are to write a function asm that turns a load opcode into the instruction 1 and a store opcode into the opcode 2. This is easy: asm(load) -> 1; asm(store) -> 2. Now suppose that what system tries to call asm(jump) - what should happen? Suppose you are the programmer and you are used to writing defensive code (just like they taught you) - you'd write: asm(load) -> 1; asm(store) -> 2; asm(X) -> and then what?? What code do you write? - the programmer is now in the situation that the run-time system was faced with when it encountered a divide-by-zero situation you cannot wrote any sensible code here - all you can do is terminate the program. Remember "Some other process will fix the error". So maybe you write: asm(load) -> 1; asm(store) -> 2; asm(X) -> exit({bad_arg_to, asm, X}). But why bother. The Erlang compiler compiles asm(load) -> 1; asm(store) -> 2. almost as if it had been defined: asm(load) -> 1; asm(store) -> 2; asm(X) -> exit({bad_arg, asm, X}). The defensive code *detracts* from the pure case and confuses the reader - and the diagnostic is often no better than that which the compiler supplies automatically. Now the "some other process will fix the error" philosophy only makes sense if you have a process based language with total Independence between processes. You *can't do this in a sequential language* - you get ONE try (your processes) and it crashes you loose control. You *can't do this with thread based concurrency* - threads *share* resources (usually memory) - if one thread corrupts shared memory - disaster. You *can't do this with unix process like concurrency* - you can observe failure but not accurately diagnose the reason for failure. This design was not accidental - Erlang was *designed* to program fault-tolerant systems. The key requirement the *one* requirement that I always considered far more important than anything else was to be able to make a system which could recover from software errors. We knew that our systems would end up with millions of lines of code and be written by large teams of programmers - in such systems there are bound to be many mistakes. I can think of no other way of programming such a system *without* independent processes. The *reason* for independent processes is NOT efficiency (I don't give a hoot about efficiency) - it is to allow large teams of programmers to work together. Give each programmer their own processes to work with and let them hack away - if their process dies, who cares, "some other process will fix the error." From this the worker-supervisor model is a short step away. The basic idea is "try to do something - if you can't do it give up and try to do something simpler." There are two other points to note: 1) All programmers are not equal Some are better than others - so then you let your better programmer programs the error recovery strategies, and let them identify and program the code that does the error recovery. 2) All code is not equal As Martin Björkland once said: - There is code that can recover from errors - There is code that will not recover from errors - You have to make your mind up In particular the error recovery code *must* be correct (so don't mess with error_handler.erl) Taking 1) and 2) together you arrive at the following: Try to structure you problem so that you can write it as "lots of regular 'pure' code with a well defined structure' *and* - "a small module of stuff that sucks" Get your inexperienced programmers to write "referentially transparent" *pure* code Get your lead programmers to write the messy stuff. Now if you use OTP you get the mess for free - every time I look at the OTP stuff I think "Oh my goodness - couldn't it have been written in a *much* more simple manner - I start hacking and only then do I remember *why* it was written as it was written." There is an underlying logic to Erlang - which I have always known but find very difficult to explain - it is particularly difficult to explain it to programmers who think "threads" and "sequential" code. My current "best argument" is one I've partially ventilated here: " ... look to make a fault-tolerant system you need TWO computers not ONE right ... and If you've got TWO computers you need to start thinking about distributed programming *whether you like or not* and if you're going to do distributed computing then you'll have to think about the following ... and so on..." The reaction is varied - everything from (rarely) "you're absolutely right" to - (more commonly) "what about efficiency" Me, I can make an incorrect program run arbitrarily quickly - that is no challenge, the following program, for example, computes factorial(10000000000) in less than a picotwinkle. factorial(N) -> 42. /Joe |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | mnesia:match_object() on an ordered_set: 00350, Ulf Wiger |
|---|---|
| Next by Date: | Re: UBF byte register question: 00350, Joe Armstrong |
| Previous by Thread: | Re: catching errors from linked processes (Long)i: 00350, Chris Pressey |
| Next by Thread: | Re: Let some other process fix the error (Long): 00350, Scott Lystig Fritchie |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | Mail Home | sitemap | FAQ | advertise |