logo       
Google Custom Search
    AddThis Social Bookmark Button

Segfault is pbs_sched in next_job(): msg#00091

Subject: Segfault is pbs_sched in next_job()
Hello All,

In working at a customers site today, I ran into an issue where pbs_sched would segfault and die after a relatively short period of time. A large amount of jobs are in the queued and several jobs run nicely until pbs_sched segfaults. I was running TORQUE 2.1.8, and after the segfault I upgraded pbs_sched to version 2.1.9 and I still see the same behavior.

According to the segfault below, it occurs in last_job++, which simply bumps a integer. I'm curious how simply bumping an int could cause a segfault. last_job is a integer offset into an array, but even if it was an array out of bounds error, I don't think the segfault would occur until the array was accessed.

Below is a backtrace from GDB, and I've attached the core file. Its about 920k or so. I'd appreciate any help with this:

---SNIP---
[root@scyld sched_priv]# gdb pbs_sched core.13973
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `pbs_sched'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib64/libtorque.so.0...done.
Loaded symbols for /usr/lib64/libtorque.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_beo.so.2...done.
Loaded symbols for /lib64/libnss_beo.so.2
Reading symbols from /usr/lib64/libbeoconfig.so.0...done.
Loaded symbols for /usr/lib64/libbeoconfig.so.0
Reading symbols from /lib64/libnss_bproc.so.2...done.
Loaded symbols for /lib64/libnss_bproc.so.2
Reading symbols from /usr/lib64/libbproc.so.2...done.
Loaded symbols for /usr/lib64/libbproc.so.2
#0 0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0) at fifo.c:689
689           last_job++;
(gdb) bt
#0 0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0) at fifo.c:689
#1  0x00000000004052de in scheduling_cycle (sd=1767992687) at fifo.c:432
#2  0x000000000040455d in main (argc=Variable "argc" is not available.
) at pbs_sched.c:1036
---END SNIP---

-Joshua Bernstein
Software Engineer
Penguin Computing



Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>