|
Re: catching errors from linked processes (Long): msg#00329lang.erlang.general
These sorts of questions are what make erlang so interesting to me. They appear simple and broad brush at first, but if you look closely they are actually very subtle and involve tradeoffs that you would never recognize in other languages. In C, you must code defensively because once you core dump there are no options and with runaway code there is no telling what will happen next. In erlang you have so many choices it is difficult to decide what to do. There are actually three issues at play in this question: how should code be structured; how should errors be reported; and how should processes be layered. Each of these issues depends on the situation at hand. Taken together they raise another issue: how do you plan to reuse or share the code and processes? Code should be as simple as possible and no simpler. This is the standard erlang admonishment. Sounds well enough -- challenges your skill but makes sense. When it comes to real code though, it is actually deeper than that. The structure of code depends on the stage of development. There are at least 4 stages: get it working; make it clear; worry about errors; and worry about performance. They may not always be performed in this order, but in most cases it is better if they are. Get it working is the great globs of wet clay stuff, where you hack for a few hours until you see reasonable results occasionally. You only feed it valid data with expected results and shape it until it looks like it is working. Make it clear is when you understand the problem after having solved it once. You recognize the repetition and generalization and restructure the code to capture the problem space rather than to issue a handful of correct answers. Much code ends after this step. Worry about errors is when the code structure is fairly stable, but you now consider what wacky input might arrive and what the consequences are. Even if "let it fail" is your motto, you need to reason about what that means at this stage. The clarity model might get dramatically shifted to handle errors because operationally it turns out error handling is more important than code maintenance. Worry about performance should only happen when enough of the other stages are correct so that slow performance is the most glaring problem. Again the code may be restructured against clarity to improve performance. Simple as possible but no simpler means different things for each of the stages of "code development" (as in growth and maturity just like "child development"). Errors Don't worry about errors, just let it fail. That is the erlang way. Like all things erlang it sounds simple and easy, but is actually a very subtle thing. I can think of three good reasons to let it fail: highlight errors in the code so they can be corrected; avoid something really bad from happening and reset to a known state; or notify external systems (or users) about bad data they are supplying so it can be corrected. If you are still actively developing the code, failure is the quickest way to find the exact point where things went wrong. It facilitates finding coding errors even if other users report the error to you. Here the point of failure is most important, followed by information about the failure. Runaway code can be dangerous in some situations. Banks don't want money given away (the US Govt once sent out thousands of checks with refunds for people who shouldn't have gotten them, but it was cheaper to let them keep it then to try to get it back), flying machines can't fail catastrophically, etc. Once the system detects that it is in trouble, failing can stop things from getting out of control as well as reset the state of the process to the initial known restart state. Stopping and restarting is more important here than why the failure occurred. (Could this be called "defensive coding"?) When user input or input from other systems causes unexpected processing, failure can notify them that something should be altered, although it is essential that you determine what effect failure has on the external system (whether human or otherwise). In this case, the information conveyed is probably more important than the failure or where it happened. The external system has to change its behavior based on the error result. In each of these three cases the type and amount of information communicated as part of the failure serves different goals and would vary accordingly. Process layering Supervisor processes can be used to monitor other processes. Typically they would do two things: relay error states and restart the failed process. If collecting and logging the failure is useful, an external process can catch the failures, sort the errors and relay them to other processes, produce statistics or log them to files or databases. [During debugging I could imagine a pop up window like the toolbar process window that showed a tally of process failures counted by type as detected in the error stack trace state.] If the process is a service that needs to remain available the supervisor can restart to a known function state (the initial state). It can also try to reason out why the failure occurred (or just use some simple rule of thumb) and try alternatives like restarting on another node, substituting a different process or after having no luck failing and letting a higher level supervisor try to restart this supervisor and cause a downward cascade of restarts with new parameters supplied by higher level logic. Again, it depends on the goals and purpose as to what behavior the supervisor should take (if one is even necessary). Code and process reuse The partitioning of code and processes may get restructured as soon as you have a second system which needs to reuse some of the functionality. Splitting the code into processes may facilitate reuse, recovery, failure and supervision. I'll have to check my list of reasons for processes. I'm sure I had one for abstraction and code reuse, but I don't think I had one for fault isolation. Fault Isolation = use a process where failure will keep a bad state from propagating through the system causing more damage, and when a restart of the faulted process can produce a fully functioning system automatically (possibly after moving the process or changing its config parameters before restart). The above shows that what erlang tries to do is provide a layered approach to code development. Make each layer understandable and uncluttered, layer process logic so that algorithm, error handling and recovery are not intertwined, and use processes judiciously to manage the abstraction, isolate faults and restart (or upgrade code) in pieces without require the entire system to fail or stop. So the simple "let it fail" to me means: 1) Don't worry about errors when trying to get the code to work 2) When organizing a system of processes (which may occur before #1), really, really worry about what *should* happen when a process dies. Architect the system to "do the right thing" and to dictate what the code should *intend* in the failure cases and how other processes should react. 3) Expect *every* line of code you write to fail the entire process and write the code so that the (un)intended failure happens in an intended way In Chris' code example he used a case statement and received an error that was vague. Using function clause failure provides more information. Here are three different ways of writing the same code. foo(Pred, X) -> case Pred of true -> bar(X) end. This is the method that allows you to add error handling clauses in the case. If you really, really don't want a failure here, put an open ended clause that does something useful (and maybe causes the failure elsewhere). If you don't need too much information to react, the closed case above will fail but not give a lot of info. In that scenario, the clarity of code would override the need for detailed failure info. foo(Pred, X) -> Pred = true, bar(X). Here even less failure info is provided, although the code is probably clearer. In the first example, the reader wonders why the case is left dangling with something left out. Here the Pred is intentionally a roadblock. foo(true, X) -> bar(X). This may end up for clarity or for performance reasons. It is more concise, but relies more on the language and compiler. As it happens, it also provides more info in the failure case. Any of the three choices is valid, but it depends on the stage of code development and the existence and sophistication of external processes and supervisor processes. If you later decide to reuse the code, the restrictions on failure cases, error reporting requirements or other constraints may change and cause a refactoring or rewriting of the code. Notice that in C / Java you don't get the choice. You have to handle the error and assume the caller will understand the error flags you pass back. If the process fails you have no recourse and no second chance. You can use try and catch in Java but the algorithm becomes cluttered and the logic is confusing to get right. jay |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: catching errors from linked processes: simplest way?: 00329, Raimo Niskanen |
|---|---|
| Next by Date: | erl_parse bug.: 00329, Vladimir Sekissov |
| Previous by Thread: | Call for Papers: III CMSRAi: 00329, Luís Moniz Pereira |
| Next by Thread: | Re: catching errors from linked processes (Long): 00329, Chris Pressey |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | Mail Home | sitemap | FAQ | advertise |