On Time Machine Errors and Data Integrity...


Today I got my first serious error with Time Machine. An automated hourly backup was running when all of a sudden an alert panel popped up saying there was some error and Time Machine failed to backup. (Not very helpful.)

I immediately tried to run a manual backup, and the same error occurred. So I checked the system logs and found some nonsense like:

/System/Library/CoreServices/backupd[457]: Error: (-50) Creating directory 
/System/Library/CoreServices/backupd[457]: Failed to make snapshot.
/System/Library/CoreServices/backupd[457]: Backup failed with error: 2

So I opened up Disk Utility, and I thought I would try running a Verify Disk/Repair Disk and maybe a Repair Permissions just in case.

Well, Repair Disk was unable to run because it was unable to unmount the external disk. I decided to run Repair Permissions on the main (internal) disk. The progress meter said less than a minute to go. Ten minutes later, it said the same thing. So I decided to abort the process, which left some process still running which I then manually killed. So I decided to reboot.

After boot up, I was still unable to mount the disk. But I did run Verify Disk on the external disk, and this time I got information. (It didn’t need to unmount for Verify I presume.) It said “keys out of order”. Not knowing what this meant, I did a Google search and it did not sound good. I decided I would reboot one more time, this time with the external disk off, and then try to power it up, hoping I would get unmount control back. But before that, I ran Repair Permissions on the main disk one more time and it took some time again and I got impatient so I aborted it. I didn’t have to kill any run-away processes this time.

After the next boot, I powered on my external disk, and I was successfully able to run “Repair Disk”. However after it finished, Repair Disk didn’t mention anything about “keys out of order” or it needed to fix anything. It seemed as if nothing was wrong.

I then Repaired Permissions on the main disk and waited it out this time. Twenty minutes later (still saying less than a minute), it finished. So I rebooted one more time with the external disk on to see what would happen. 

After the boot, everything seemed to be back to normal, except that Spotlight began completely reindexing the external disk. But I could freely mount/unmount the external disk. I started a backup, and no critical alert appeared. However, the backup seemed to be taking much longer than normal.

Checking the system logs, I found:
/System/Library/CoreServices/backupd[173]: Event store UUIDs don't match for volume Macintosh HD
Node requires deep traversal:/ reason:kFSEDBEventFlagMustScanSubDirs|kFSEDBEventFlagReasonEventDBUntrustable|

So this leads me to wonder, does Time Machine do integrity checks?

In this case where things have been marked “Untrustable”, what does Time Machine really do about it? Does it actually do some kind of data integrity check? Can I actually feel secure that my data is correct, or do I need to assume everything is suspect and format and start over losing all my snapshots? And assuming Apple does check things, when does it happen? How is it determined things are “untrustable”?

It would be good if Apple put out a Technical Q&A documenting specifically how Time Machine works so we can judge how safe our data really is, especially after a problem like this.

It seems to me that Time Machine should automatically do periodic integrity checks on all backed up data (maybe once a week or once a month) and there should be a way to manually launch an integrity check. Having backups of corrupt data is as problematic as having no backups at all in my opinion.

Any how, allowing Time Machine to finish running, and then keeping the machine on to allow for more automated backups to run, have allowed more errors to appear in the system logs. Things such as:

Error: (-36) copying /mach_kernel.ctfsys

kernel[0]: disk1s5: I/O error.
mds[1189]: (Error) IndexCI in ContentIndexFlushCache:Caught mach exception. Fun Fun Fun.

mds[1189]:(Error) IndexGeneral in bool si_writeBackAndIndex(__SI*, __ContentIndex*, const __CFDictionary*, const __CFDictionary*, db_obj**, const __CFString*, int, bool, bool, bool, __SIUserCtx*, const __CFString*, int32_t, bool):ContentIndexUpdateContent failed

mds[1189]: (Error) IndexGeneral in void setAttributes(si_set_attr_ctx*, Boolean, long unsigned int):Couldn't update index.

mds[1189]: (Error) IndexGeneral in notify_lowspace:low space for device 234881030
mds[1189]: (Error) Volume: LOW DISK SPACE device:234881030
mds[1189]: (Warning) Task: MDSStoreTask failed to resolve storeID (6b000020)

For those that are curious, the disk was only 30% full at the start of this day and had plenty of disk space left. Even now, the disk doesn’t look full. (Maybe half.)

It looks like I will be formatting shortly.

(Apple Bug Filed: 5606623)

Update: 2008/01/11
I spoke with a DTS engineer after Cocoaheads, and he confirmed that Time Machine does not do any integrity checks.

Copyright © PlayControl Software, LLC / Eric Wing