On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> Hi!
>
>
>
> I would like to report another incident when rebooting of a few nodes
> resulted in server crash. Those nodes became unresponsive because of some
> other problem, not related to Torque in any way, were put offline and
> rebooted. This is not the first time when loosing nodes made server
> unresponsive or led to core dump.
>
>
>
> Core was generated by `pbs_server'.
>
> Program terminated with signal 11, Segmentation fault.
>
> Reading symbols from /usr/lib/libkvm.so.2...done.
>
> Reading symbols from /usr/lib/libc.so.4...done.
>
> Reading symbols from /usr/libexec/ld-elf.so.1...done.
>
> #0 0x1038636 in get_next ()
>
> (gdb) bt
>
> #0 0x1038636 in get_next ()
>
> #1 0x1012a7f in remove_job_delete_nanny ()
>
> #2 0x1013e5c in on_job_exit ()
>
> #3 0x1028c24 in dispatch_task ()
>
> #4 0x10042e7 in process_Dreply ()
>
> #5 0x1039f3d in wait_request ()
>
> #6 0x100f9c3 in main ()
>
> #7 0x1001fa6 in _start ()
>
>
>
> We run some kind of pre-release of Torque 2.1.2 on FreeBSD 4.10
>
>
>
> This really worries me. This kind of broken fault tolerance can result in
> questioning if Torque is acceptable for mission-critical production
> environment.
>
> Did someone experience anything like this? Is it FreeBSD related? Is it hard
> to fix?
I'm pretty sure I got this stuff fixed up in the 2.1.2 release. I have
had nodes be rebooted during jobs dozens of times with 2.1.2 without an
issue. Can you do a diff between your source and the 2.1.2 tarball?
Are you using keep_completed? Was the job forcibly purged?
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
|