|
|
Mozy Online Backup: 2GB Free. Automatic. Secure.
Subject: Re: Re: bit error rates - msg#00038
List: linux.file-systems.yaffs
On 2/15/06, Peter Barada <peterb@xxxxxxxxxxx> wrote:
> Well my perspective would be BadThing if its either 1% or 50% of the
> units suffering block losages like that.
If it's 1% and this is configurable, then there's no problem. I'd
leave the current semantics and live with a small return rate if I
felt data integrity was improved.
> In either case, you end up with 30-50% of your available space being lost.
True. But if this is configurable then people can decide. The fact
that nobody has patched the current behavior publicly suggests that
not enough people have problems with it.
> Imagine a 1GB iPod type device that after a year turns into a .5Gb iPod. I
> can imagine
> customers would get pretty bent out of shape over that...
Actually, I bet they do get 1% return rates or something anyway, so
that's not a big problem :-) I'm already on my second iPod (badly
horribly designed thing - it's possible for the filesystem to get
trashed too easily - but I got it because I wanted to run ipodlinux).
In the case of an iPod, most people care less and Apple certainly
expect you to have copies of all of your *uhum* paid music in iTunes
anyway.
Jon.
Was this page helpful?
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: Delayed list mail
On Wed, 15 Feb 2006, Wookey wrote:
OK, no problems. Sorry for a last email...
> There has been a small flood of pent-up mail arriving on the YAFFS list
> over
> the last few hours. That is my fault. I turned mailman off to do some
> server
> admin last week and forgot to turn it back on again until yesterday.
> The
> mail sent during that time has now come through. (That's why your patch
> was
> delayed Sergey - it's not a conspiracy).
>
> I don't think any mail has been lost. Anything you sent that hasn't
> turned
> up should be resent. Apologies for the delay.
>
> Wookey
> --
> Aleph One Ltd, Bottisham, CAMBRIDGE, CB5 9BA, UK Tel +44 (0) 1223
> 811679
> work: http://www.aleph1.co.uk/ play:
> http://www.chaos.org.uk/~wookey/
>
> _______________________________________________
> yaffs mailing list
> yaffs@xxxxxxxxxxxxxxxxxxxxxx
> http://stoneboat.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs
>
---
******************************************************************
* KSI@home KOI8 Net < > The impossible we do immediately. *
* Las Vegas NV, USA < > Miracles require 24-hour notice. *
******************************************************************
Next Message by Date:
click to view message preview
Fwd: bit error rates]
---------- Forwarded message ----------
From: Jon Masters <jonmasters@xxxxxxxxx>
Date: Feb 15, 2006 11:53 PM
Subject: Re: [Yaffs] bit error rates]
To: William Watson <wjw1961@xxxxxxxxx>
On 2/15/06, William Watson <wjw1961@xxxxxxxxx> wrote:
> I will also note that a NAND vendor who paid us a visit at about that same
> time said that we should expect WORSE soft error behaviour with succeeding
> generations of NAND flash chips. The geometries would get smaller and
> smaller, the chip dies would get larger and larger, and the amount of time
> for production testing of each chip would not increase, or at least, not
> increase as fast as the total storage of a chip. Thus, the testing per page
> would only go down in subsequent generations of chips. These two statements
> seemed to say that we would see both (1) increased rates of ECC errors, and
> (2) an increase in the number of marginal blocks not marked bad by the chip
> vendor.
But this sounds like it might be better to be additionally cautious -
I agree that marking OOB data is a good idea, maybe I'll get to look
at that.
> Another obvious alternative strategy for preventing data loss due to
> accumulation of multiple bit errors would be to periodically read the entire
> data array, checking for ECC errors. You'd want to calculate the impact
> that such reading would have on the rate of appearance of errors, as well as
> the impact on system and NAND performance. For a standard file system, it
> might suffice to perform one additional data chunk read for every N read
> requests, incrementing the "scrub" page each time. This would ensure a
> complete read scrub at a fixed percentage overhead. One could also perform
> a read scrub every M write operations, if desired.
A low priority kernel thread which sat and got woken up about as much
as kswapd probably wouldn't have much impact but could do this - and
only run when nothing else is using the flash part. This would be
better in the MTD layer though and might necessitate some changes to
the locking currently used.
Jon.
Previous Message by Thread:
click to view message preview
Re: Re: bit error rates
On Thu, 2006-02-16 at 12:43 +1300, Charles Manning wrote:
> On Thursday 16 February 2006 02:25, Jon Masters wrote:
> > On 2/10/06, Charles Manning <manningc2@xxxxxxxxxxxxx> wrote:
> > > I think an interrupted erase is probably more likely to cause
> > > problems, but again this is just a hunch.
> >
> > I wonder how we could implement logic to detect this.
> >
> > > Dealing to an interrupted write is relatively straight forward. It
> > > will always be the last page written before the system went
> > > down. Most of the time (except for the last page written to a
> > > block), we can detect the last page because it is the last page
> > > in the currently allocated block.
> >
> > I don't think this is currently testing on mount though.
>
> That is correct, it is not being done at present. I was thinking as to how it
> might be done.
> >
> > > It would be nice to improve this, but as Jon sayas, I think data
> > > integrity should always come first!
> >
> > Other people seem to disagree with my previous suggestions and I'm not
> > saying I can't be wrong in the matter :-) But I've not seen excessive
> > numbers of blocks being marked bad (except when fixing the OOB
> > code...) with read ECC failures. I accept though that this might just
> > be good old fashioned paranoia so if one of the vendor folks on this
> > list can comment, it would really help.
>
> Some people have reported seeing a large number of blocks (~30-50%) being
> retired on some devices. That's obviously not a GoodThing, but I'd like to
> see what % of units failed. Then, how does one measure and evaluate this?
>
> To my mind, if you ship 1000 units and half of them lose 30-50% of their
> blocks in a year of normal use, that's probably a BadThing. If this only
> happens on 1% of shipped units it might be an OKThing (depending on your
> perspective).
Well my perspective would be BadThing if its either 1% or 50% of the
units suffering block losages like that. In either case, you end up
with 30-50% of your available space being lost. Imagine a 1GB iPod type
device that after a year turns into a .5Gb iPod. I can imagine
customers would get pretty bent out of shape over that...
> However, losing data is also a BadThing.
>
> It's one of those rock-and-hard-place sandwich choices. Any mods will be
> configurable to allow current semantics.
>
> -- Charles
>
Next Message by Thread:
click to view message preview
Re: Re: bit error rates
Peter Barada wrote:
On Thu, 2006-02-09 at 23:13 +0000, Sergei Sharonov wrote:
Yes, I have. I use a YAFFS1 NOR-based system, and in the writes, we lay
down the data chunk, and then the tag. In the unlikely event that a
power-cycle occurs while writing the data, the tag is still empty, but
some of the data chunk is not erased, and then next time a write occurs
into that chunk, YAFFS sees that the write fails since the previous data
was written(and retires the whole block), even though the tag indicated
the chunk is empty.
To fix this, I used two bits in the pageStatus byte in the tag, and
write the tag first, then the data, and then update the tag. Assuming
that the pageStatus starts out as 0xff, then the first tag write puts in
the value of the tag, but writes a pageStatus byte of 0xfe to indicate
that a write is in progress, then writes the chunk data, and then comes
back an re-writes the tag with the same data, and a pageStatus of 0xfc.
In the rest of the code, the chunk is assumed to be valid if the
pageStatus is 0xff(and objectId is non-0xfffff) or if 0xfc, empty if the
objectId is 0xfffff, and deleted if the pageStatus is either 0xfe, or
0x00(the value written to delete a tag).
This solved the problem for me. I assume an approach like this would
work for NAND...
If works for NAND only if the additional partial page programming
fits NAND specifications.
I have looked at Samsung datasheet and it say:
"The number of consecutive partial page programming
operation within the same page without an intervening erase
operation should not exceed 2 for main array and 3 for spare array."
So it should be ok for Samsung since in this case is 1 + 2.
However Toshiba say: "Multiple partial page programming attempts
in a block can aggravate this error symptom" referring to Program
disturb soft error.
I think a better solution is to check the power fail flag before any
erase/programming cycle as suggested by Charles.
Unfortunately this means modify the MTD driver, something like this:
static int my_nand_erase(struct mtd_info *mtd, struct erase_info *instr)
{
if ( check_powerfail() )
return -EIO;
else
return nand_erase_nand (mtd, instr, 0);
}
and in nand_write_ecc()
/* Check, if it is write protected or power fail */
if ( nand_check_wp(mtd) || check_powerfail() )
goto out;
Cheers,
Claudio Lanconelli
|
|