Partial Error Recovery

Recovering in case of partial error in any of the file operations

Repair/Rollback of allocation are ways to handle difference of data of an allocation on the blobbers hosting it which happen due to failure of applying operations on some blobbers hosting the allocation.

Why can such errors occur ?

These errors are client-dependent, meaning they can occur based on how you chose to design the client that uses Züs network and blobbers. The clients that Züs offer (CLI tools, Webapps) depend on GoSDK, which implements file operations in a way that trades consistency off for availability and fault tolerance (AP system as defined by CAP theorem).

In GoSDK, A file operation is considered successful when the client receives data_shards + 1 success messages out of all the blobbers hosting the allocation in which the operation takes place. This doesn't guarantee that all the blobbers of the allocation will run the operation successfully, which can lead sometimes to inconsistency between stored data for an allocation among its blobbers.

If you're implementing a client that prefers consistency over availability, you'll not have to deal with such issue. However, if one of the allocation blobbers goes down, the allocation will be almost unusable.

How to deal with such errors ?

Although it's important that Since such errors are client-dependent, dealing with them is the responsibility of the client in the first place. We don't dictate a specific way to deal with such errors. However, GoSDK gives a good example of how such error can be handled, introducing the concepts of Repair and Rollback.

Repair

Repair occurs if a successful file operation was missed by some of the blobbers hosting the allocations. Since the operation itself is successful, the user will need to initiate a new special operation to perform the repair. The repair itself is performed as follows:

  • Starting from the root directory, if the directory is empty, or a file is encountered:

    • If it's found in the majority of the blobbers, it will be created in the rest of them (and copied from the majority if it's a file).

    • If it's not found in the majority, it will be deleted from the other.

  • If the directory has children, the process of repair will continue through its children.

To run the repair process, you can either use zboxcli start-repair command or zboxcli updateallocation (without settings add_blobber_id ). Check ZboxCLI README for detailed documentatio of both commands.

Rollback

  • Rollback means undoing the allocation changes ran on the minority of blobbers if it's not sycned by the majority of blobbers hosting the allocation.

  • Rollback runs on an allocation if and only if the number of the blobbers on the latest version (having the latest changes) is less than the number of data shards of the allocations.

Last updated