logo       

Seeing memory corruption, GC moves my objects around: msg#00326

lisp.cmucl.devel

Subject: Seeing memory corruption, GC moves my objects around

Hi,

I see memory corruption in a system I am maintaining and I could track
it down to say that the GC is overwriting heap memory we have valid
objects in. I am on late CMUCL 18 on x86-linux. I have all memory
spaces moved and resized but I am otherwise not noticeably different
from stock CMUCL 18.

I am looking for strategies to debug this, in particular I'd like to
hear theories how we could confuse the GC by errors in our code,
e.g. by wrong type declarations.

I have my more concrete questions at the end after explanations, sorry
for the long mail.

%%

Why the GC is the prime suspect:
--------------------------------

First of all here are the steps that leads me to believe it is the GC
overwriting stuff:

- I have one setup of CMUCL binary, application code and application
data that makes it reproduceable in > 50% of all runs, with the
memory corrupted precisely at the same time and place.

- data being corrupted are structs in a hashtable which is pointed to
by a special variable.

- I wrote a checker function which is walking all entries in the
hashtable and looks into the struct members.

- the code where I can reproduce it looks like this:

(dotimes (i *max-something* t)
(when (and (= ...)
;; Some comparison stuff.
;; There is only readonly code here!
(= ...))
(return nil)))

*max-something* is always 6. I modify the function like this:

(dotimes (i *max-something* t)
(hashchecker) ; that is my mentioned function inspecting all
; entries in the hashtable.
(when (and (= ...)
(= ...))
(return nil)

The hashchecker finds entries in the hashtable to be damaged on the
3rd of 6 runs through this loop. And there is only readonly code
here. It is always the same entry in the hashtable that is damaged.

I further modify the code like this:

(ext:gc-off)
(dotimes (i *max-something* t)
(hashchecker) ; [1*]
(when (and (= ...)
(= ...))
(return nil)
(ext:gc-on)
(hashchecker) ; [2*]

Then the loop runs through, [1*] does never see corruption, but [2*]
finds the hashtable entries corrupted. Again, there is no code
writing any data structure either in the loop nor behind the loop.

So that is why I am convinced it is the GC moving stuff behind my
back.

(It is not clear to me why the GC runs at all during the dolist, it
should be consing-free, I need to look into this, but it is rather a
blessing if it makes it reproducible).

%%

Error symptoms:
---------------

While I can get the memory corruption precisely at this place most
times I run this test, the data I find where I wanted to find my
structs varies. It is always the same field in the struct that is
damaged, it is a string field, not declared to be of any specific
type, filled with either nil or a string. On memory corruption the
contents of that undeclared text field are most often float, often a
bignum, less often an unprintable object or an "#<Unknown Pointer
Object, type=#xD0 {44119F8F}> cannot be coerced to a string." (our
dynamic heap starts at 0x44000000).

Sometimes, but not often the Lisp later dies with GC lossage by
finding an object with impossible tag bits on the heap.

I can reproduce this precise memory corruption in this CMUCL version
with this application version, some of many similar datasets and only
on one Linux kernel version(!). But I do not see large-scale memory
corruption from other places, just this one, and it is either there
right at the beginning or else the Lisp never(tm) dies.

There must be something very specific about this situation that causes
a local, identifiable memory corruption right here.

%%

Whining:
--------

So my questions for debugging strategies are:

1) assume we have wrong type declarations. What would be a scenario
where we promise to be of one type, put in another and confuse the
GC?

If we promised something to be a fixnum and put a pointer-containing
object in it I assume we could do that.

2) is there maybe a problem with hashtables and GC, maybe with
hashtables in special variables? Anybody ever found such a problem?

Generally, any suggestions what I can do to figure out what is going
on here?

I can rebuild CMUCL and still reproduce the problem, so hacks to the GC
code would be no problem.

Thanks
Martin
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <cracauer@xxxxxxxx> http://www.cons.org/cracauer/
No warranty. This email is probably produced by one of my cats
stepping on the keys. No, I don't have an infinite number of cats.




<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise