Duplicati backup restore (broken) versus test (ok) - Madness

Even longer version in the Duplicati forums (@ forums.duplicati.com), with previous cases, reports and follow up to this problem...

Summary:

After all this testing I can only make one conclusion: Backup is corrupted (restore) and good (test for integrity) at the same time.

There's clearly flawed and very inconsistent logic somewhere. - Again, this should run anyone responsible for the data integrity absolutely mad. I personally find this extremely worrying.

This is the reason why I recommend that everyone should run full restore tests on alternate system on regular intervals and quite often. Because the test feature seems to be very dangerously and misleadingly broken.

Either the database rebuild is somehow broken? And if it isn't, then the test feature is fatally flawed. Unfortunately I can't exactly tell where the disparity occurs. (Update just when posting: It seems that the key factor with latest version is database compaction, where old data is removed, if that task is aborted things often get covertly messed up due to broken / missing recovery logic).

If I test the data set with the original database (before any of these things I've done here) the test claims the backup is good.

I've also confirmed and checked at the data source where the backups are created, that the file does currently exist (corrupted-file.dat). And if there's any corruption, it could have been easily refreshed any day to the backup, but that isn't happening. Yet the file / data remains broken in the backup set for extended time. Which probably means the corruption is in the blocks which aren't being changed often.

... And then the nitty gritty details ...

I started by creating test environment, where I tried a few "expected" scenarios which would cause this problem. But of course none of my tests triggered the issue. It would have been just so nice to work with datasets in kilobytes instead of tens of gigabytes. - Oh well....

... To be continued ... Several hours later ...

Log from external test without local db:

2021-XX-XX 11.40.54 +03 - [Verbose-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-ProcessingBlocklistVolumes]: Pass 3 of 3, processing blocklist volume 149 of 149
2021-XX-XX 11.42.30 +03 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-RecreateCompletedCheckingDatabase]: Recreate completed, verifying the database consistency

Output:

2021-XX-XX 11:42:37,899: Duplicati error output:

ErrorID: DatabaseIsBrokenConsiderPurge

Recreated database has missing blocks and 12 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

But if I run these tasks from the primary server with the database (which is usually used for the restore testing for most of users (!!)), it says there's nothing to fix (test and repair both passes quickly without any action).

Next: Ok, I really don't want to modify the real production backup, because it's just slightly broken. I'll create a copy of the backup ste, and let's run repair with it in separate test environment.

... To be continued ... Several hours later ...

It progresses as expected to the point quite quickly:

Output:

Backend event: Get - Completed: duplicati-ife2631cc922e4e909388950c6c7acf7f.dindex.zip.aes (173,28 KT)

Processing indexlist volume 151 of 151

And now it's taking surprisingly long without any seeming progress. Yet file-system io-statistics reveal that it's clearly working with SQLite database and writing lots of data into it's temp-file.

The actions the repair will take, should indicate what the problem with the data set it. Question remains, why the database rebuild failed earlier? With the errors, there's nothing wrong with the data?

I'll need to check out what that rebuilt database contains, if I can send it to you guys. Or maybe it's not even necessary after this analysis.

... To be continued ... Several hours later ...

Now it says:

Pass 3 of 3, processing blocklist volume 1 of 149

... To be continued ... At least hour later ... Still 1 of 149? ...

It seems that this might take well, more than several hours. Maybe tomorrow ...

I reviewed the verbose log, and there's nothing wrong with the data or the recreate process. Ok. Yet the final verdict is.

Based on earlier error summary, it can be assumed that this is the same case:

ErrorID: DatabaseIsBrokenConsiderPurge

Recreated database has missing blocks and 12 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

I'll took full copy of all state data at this point. DB / backup blocks etc. I'll now run those steps, if I remember correctly this is quite quick operation.

Output from this operation:

ListBrokenFiles

11 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

10 : X.X.2021 21.30.27 (1 match(es))

corrupted-file.dat (783,22 MT)

9 : X.X.2021 21.30.29 (1 match(es))

corrupted-file.dat (783,22 MT)

8 : X.X.2021 21.30.29 (1 match(es))

corrupted-file.dat (783,22 MT)

7 : X.X.2021 21.30.29 (1 match(es))

corrupted-file.dat (783,22 MT)

6 : X.X.2021 21.30.28 (1 match(es))

corrupted-file.dat (783,22 MT)

5 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

4 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

3 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

2 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

1 : X.X.2021 21.30.26 (1 match(es))

corrupted-file.dat (783,22 MT)

0 : X.X.2021 21.30.29 (1 match(es))

corrupted-file.dat (783,22 MT)

So there are broken files in many sets. Why test then claims that the backup is valid?

Let's run purge... And the output:

... several similar segments omitted, as unnecessary ...

Purging 1 file(s) from fileset XX.XX.2021 21.30.29

Starting purge operation

Processing filelist volume 1 of 1

Replacing fileset duplicati-2021XXXXT183029Z.dlist.zip.aes with duplicati-2021XXXXT183030Z.dlist.zip.aes which has with 1 fewer file(s) (783,22 MT reduction)

Purging file corrupted-file.dat (783,22 MT)

Writing files to remote storage

Backend event: Put - Started: duplicati-2021XXXXT183030Z.dlist.zip.aes (43,36 KT)

Uploading file (43,36 KT) ...

Backend event: Put - Completed: duplicati-2021XXXXT183030Z.dlist.zip.aes (43,36 KT)

Backend event: Delete - Started: duplicati-2021XXXXT183029Z.dlist.zip.aes ()

Deleting file duplicati-2021XXXXT183029Z.dlist.zip.aes ...

Backend event: Delete - Completed: duplicati-2021XXXXT183029Z.dlist.zip.aes ()

Database was previously marked as in-progress, checking if it is valid after purging files

Purge completed, and consistency checks completed, marking database as complete

Now when it's good, I'll rerun test and full restore. Let's see what happens.

Repair passed, all good.

Test ... Now it passed.

Finally full restore of latest version to confirm that restore is now working.

Well, now restore worked and claims that everything is ok. But unsurprisingly the one file is missing from the restored data set which got pruned out earlier. Technically restore worked, and claimed to be ok. But as we well now know, it isn't.

Lot's of discussion were in private discussion with people involved with the project (including sending debug data and even test database snapshot). But let's drop this for background and testing information, if someone is looking for the matter.

A few words how I did run the tests:

test “source” all --full-remote-verification --full-block-verification

As far as I understand, that’s as through test as possible? Test did report extra blocks in files (as usual, it’s still annoying that those are labelled as ERRORS). But no other problems are found.

I just thought it’s important to mention the options used for testing. Any suggestions if the tests could be defined to be more through?

I do have about 16 kilobytes more of the same stuff, it's been broken over two years and is still broken. But I find that utterly pointless to post it. See the forums for some of the public discussion. If people don't understand transactions or mind about data integrity, it's pretty much lost battle.

Final update

I'm currently running extensive tests to see, if the issue has been remedied already with the latest canary version. When the issue is fixed so I don't see it repeatedly, I'll update the very top of the post.
2022-09-04 - Unfortunately it's confirmed, the program still corrupts data even with the latest canary version. Well, someday it might get fixed!

Don't get me wrong

I'm only complaining, because I like this project and it's features are good. I'm just not a big fan of covert data corruption. Some people say that I've got negative attitude towards some programs. Well, my attitude isn't negative. I just want potential users to know about serious issues, before they waste lot of time with (somewhat dangerously broken) stuff, like I have.

And for sure, Duplicati is Open Source and one of the questions was, that if there's a problem, why you don't provide pull request. Well, actually as after thought and seeing how much time testing took and there was no fix after that, maybe doing the static code analysis would have been faster way to get to the point. Because I pretty much exactly know what kind of logic failure I'm looking for, and it's not that hard to spot it.
It's somewhere writing transaction "starting X" and recovering from situation where transaction log on start still says "starting X". I write transactionally solid programs all the time, so issues like this aren't strange to me at all. Yet I usually tend to take care of these problems during design phase and pay especial attention during implementation phase, so I don't need to afterwards spend a lot of time explaining to customers how everything is fked and why data set X is missing and data set Y is delivered ten times, with slightly different content each time.
All this reminds me from someone else's blog post, where they said that: "If you don't provide pull request with issue ticket then F-off."
2022-09-04