logo       

Re: debugging around machine-checks...: msg#00143

os.freebsd.devel.alpha

Subject: Re: debugging around machine-checks...


Fred Clift writes:
> by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what
> I have of ram in size)
>
> and then I do my fiddling with XFree86 that gives me the machine-check and
> I end up at the SRM prompt. At this point, I know that just booting will
> fail. I have to power-cycle the box and when it comes back up, savecore
> either doesn't find anything, or isn't being run by the rc scripts. Once
> I get a chance to log in /var/crash has only minfree in it...
>

That *should* work..

> Should I be doing something else?
>
> I just looked in /var/log/mesages and saw no evidence of crashdumps being
> written (ie dumping to.... or dump 254 253 252 251... etc).

If you powercyle, the message buffer is lost.

When I would crash X on an old miata, 1/2 the time I'd get a
'machine check in pal mode' -- this doesn't even get caught by the
OS.

However, if you're seeing the message below, I do not understand
why you're not getting a crashdump.

In any case, since the problem is probably with the X server (based on
the mesage below), a crashdump would not help you.


>
> >
> > Can't you use the program counter from the panic output as a start?
> > If its in the X server, there should be a PC from userspace.
> > (see disclaimer below)
> >
>
> So can you interpret this for me then - honestly I just dont know what all
> the fields represent -- I should probably just go read the source code and
> see :)
>
> Oct 8 06:42:24 liron /kernel: unexpected machine check:
> Oct 8 06:42:24 liron /kernel:
> Oct 8 06:42:24 liron /kernel: mces = 0x1
> Oct 8 06:42:24 liron /kernel: vector = 0x660
> Oct 8 06:42:24 liron /kernel: param = 0xfffffc0000006068
> Oct 8 06:42:24 liron /kernel: pc = 0x1604006ac
> Oct 8 06:42:24 liron /kernel: ra = 0x12006cb10
> Oct 8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200
> Oct 8 06:42:24 liron /kernel: pid = 90765, comm = XFree86
> Oct 8 06:42:24 liron /kernel:
> Oct 8 06:42:24 liron /kernel: panic: machine check
>
>
> The program counter is pc? so I should be able to, with gdb and a
> debug-version of XFree86, figure out what code this is?

Yes, except its in a shared lib, or other dynamically loaded text.
I don't know how you could debug that without a cordump.
The ra (return address) is at least somewhere in the main text
of the program (not a shared lib).

<...>

> Your explanation is helpful, and perhaps I'll try your suggestion of
> turning userland machine checks into sigbus or something - I'm sure I'm
> just begging for trouble here, but at least this isn't a production
> machine that other people depend on :).
>
> To send a signal to a process from within the kernel, it seems I just call
>
> psignal(pid, signo)
>
> - is this right?
>

More or less. I think trapsignal may be more correct.

> Thanks very much for your information - looks like a little check in
> machine_check() in interrupt.c will do pretty much what I want - perhaps
> I'll make sure that my hack only works on processes who's name starts
> with 'X' or something just to be safe....

Good luck to you!!

Drew

To Unsubscribe: send mail to majordomo@xxxxxxxxxxx
with "unsubscribe freebsd-alpha" in the body of the message



<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise