|
Seeing memory corruption, GC moves my objects around: msg#00326lisp.cmucl.devel
Hi, I see memory corruption in a system I am maintaining and I could track it down to say that the GC is overwriting heap memory we have valid objects in. I am on late CMUCL 18 on x86-linux. I have all memory spaces moved and resized but I am otherwise not noticeably different from stock CMUCL 18. I am looking for strategies to debug this, in particular I'd like to hear theories how we could confuse the GC by errors in our code, e.g. by wrong type declarations. I have my more concrete questions at the end after explanations, sorry for the long mail. %% Why the GC is the prime suspect: -------------------------------- First of all here are the steps that leads me to believe it is the GC overwriting stuff: - I have one setup of CMUCL binary, application code and application data that makes it reproduceable in > 50% of all runs, with the memory corrupted precisely at the same time and place. - data being corrupted are structs in a hashtable which is pointed to by a special variable. - I wrote a checker function which is walking all entries in the hashtable and looks into the struct members. - the code where I can reproduce it looks like this: (dotimes (i *max-something* t) (when (and (= ...) ;; Some comparison stuff. ;; There is only readonly code here! (= ...)) (return nil))) *max-something* is always 6. I modify the function like this: (dotimes (i *max-something* t) (hashchecker) ; that is my mentioned function inspecting all ; entries in the hashtable. (when (and (= ...) (= ...)) (return nil) The hashchecker finds entries in the hashtable to be damaged on the 3rd of 6 runs through this loop. And there is only readonly code here. It is always the same entry in the hashtable that is damaged. I further modify the code like this: (ext:gc-off) (dotimes (i *max-something* t) (hashchecker) ; [1*] (when (and (= ...) (= ...)) (return nil) (ext:gc-on) (hashchecker) ; [2*] Then the loop runs through, [1*] does never see corruption, but [2*] finds the hashtable entries corrupted. Again, there is no code writing any data structure either in the loop nor behind the loop. So that is why I am convinced it is the GC moving stuff behind my back. (It is not clear to me why the GC runs at all during the dolist, it should be consing-free, I need to look into this, but it is rather a blessing if it makes it reproducible). %% Error symptoms: --------------- While I can get the memory corruption precisely at this place most times I run this test, the data I find where I wanted to find my structs varies. It is always the same field in the struct that is damaged, it is a string field, not declared to be of any specific type, filled with either nil or a string. On memory corruption the contents of that undeclared text field are most often float, often a bignum, less often an unprintable object or an "#<Unknown Pointer Object, type=#xD0 {44119F8F}> cannot be coerced to a string." (our dynamic heap starts at 0x44000000). Sometimes, but not often the Lisp later dies with GC lossage by finding an object with impossible tag bits on the heap. I can reproduce this precise memory corruption in this CMUCL version with this application version, some of many similar datasets and only on one Linux kernel version(!). But I do not see large-scale memory corruption from other places, just this one, and it is either there right at the beginning or else the Lisp never(tm) dies. There must be something very specific about this situation that causes a local, identifiable memory corruption right here. %% Whining: -------- So my questions for debugging strategies are: 1) assume we have wrong type declarations. What would be a scenario where we promise to be of one type, put in another and confuse the GC? If we promised something to be a fixnum and put a pointer-containing object in it I assume we could do that. 2) is there maybe a problem with hashtables and GC, maybe with hashtables in special variables? Anybody ever found such a problem? Generally, any suggestions what I can do to figure out what is going on here? I can rebuild CMUCL and still reproduce the problem, so hacks to the GC code would be no problem. Thanks Martin -- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Martin Cracauer <cracauer@xxxxxxxx> http://www.cons.org/cracauer/ No warranty. This email is probably produced by one of my cats stepping on the keys. No, I don't have an infinite number of cats. |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Do people want this? :-): 00326, Raymond Toy |
|---|---|
| Next by Date: | Re: Couple of Emacs hacks: 00326, Helmut Eller |
| Previous by Thread: | Do people want this? :-)i: 00326, Fred Gilham |
| Next by Thread: | Re: Seeing memory corruption, GC moves my objects around: 00326, Raymond Toy |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |