Fallback Job: The Final Link in a Fault-Tolerant Microservice

In this article we would like to present you with a topic that we consider interesting due to its simplicity and the amount of headaches that it can prevent; Fallback Job. This tool is one of the most commonly used for solving the problem of faults in distributed transactions.

The Issue

Suppose, for example, that we have a service (a microservice of course), which needs to manage files in a repository, such as a user service in which we must upload and store an avatar (or the user’s photo).Let’s assume that our service is hosted by a cloud service provider, and it uses the provider’s file service (S3 on AWS, for instance).
Our service must ensure then, among other things, that S3 correctly saves the image, and later that it deletes it if desired.We could blindly trust that these processes will be executed correctly and ignore any issues. As for uploading files there is no major problem, in the event of a fault we can return an exception and that’s it. We could even do the same for deletion: trust that it will work and voila, job done.But what if we want to provide a more robust microservice that transacts correctly? Let’s agree that if our application does not store these files, it is not the one that is failing, but the provider is, right?So, we want to be more robust and we begin to doubt the “reliability” of our cloud provider so we ask ourselves “if an error occurs, what should we do about it?” Well, this is where the subject in question, “Fallback Job ”, comes in, ready to solve our 5% of “unhappy ” cases.

Fallback Job, the solution

Fallback Job is a Cron application that will run every X amount of time and will be responsible for solving the truncated operations that have not been resolved correctly, and in this way we can maintain consistency between services.

How? Very simply:

In our service we will store a flag which will confirm whether the provider’s file has been deleted correctly.
When we execute the deletion in a normal workflow, we will wait for the provider’s response. In the event that they notify us that everything has gone OK, we will then save the flag as “DELETED”
We will provide a PRIVATE endpoint, which will allow us to delete the records that are “NOT DELETED”
Our Fallback Job will run every X time calling this private endpoint to delete the marked file

Code

As always, the complete and working code can be found at:

Project structure

The structure is quite similar to that of our previous articles. The only thing that has been added here is the deletingFallback folder which is an Adapter, which will allow us to abstract the file provider implementation as well as implement a mock or stub for unit testing.

Fallback print

Endpoints / Controllers

In “pkg/handlers/rest.go” we can find the REST service where we will expose our endpoints. As can be seen in the image we have the endpoint to see if the service is live or not: “/ap health”, the endpoints related to the resource, and the endpoint “api/users/fallback” (HTTP Delete). Due to a limitation of the framework it is mapped within “api/users/:id”, and internally in the method that acts as controller we will turn to our dear old “if ”.
Fallback2

Function that controls the requests of the DELETE endpoint.

Fallback 3

Main and Dependency Injection

In the file “cmd/server/main.go” we find the initialization of all the services, the log tool, the repository, the FileServer Adapter, the services, and finally the Gin framework.
Fallback 4

Delete Service

In the file “pkg/deleting/service.go” we find the logic needed to delete users. The first thing we do is try to delete the image associated with the user and, if we do not succeed, what we do first is register the error and mark the user as “to delete”. If an error has not occurred, we delete it directly from the repository/database.

Fallback5

Deletion Service of users who were marked

In the file “pkg/deletingFallback/ service.go” we can find the RemoveUsersFallback method which should obtain the users that have been marked, and then for each one we will try to delete the image from the file server. If this fails we skip the iteration and move on to the next user.
Fallback 6

Fallback Job

The code that we have below is simply a system service, which runs once an hour (but can be configured in hours, days, months, etc.), this service will only validate that the service is working at first (validating the endpoint “api/health”), and once this has been done, we configure the Cron, and it runs.In the execution, all we do is validate that the execution has been successful and our service will have ended.

Main

Fallback 7

Server Validation Running

In this method, all we do is ping the server and see if it is live (StatusCode = 200)

Fallback 8

Fallback Service Call

This method simply creates an HTTP Delete call to the specified endpoint and verifies that the response is correct, otherwise, it will throw up a “Panic” error which will stop the execution of the service.
Fallback 9

Tips

If the Fallback job fails, it means that there is a problem with the main service, therefore the service will keep restarting and eventually the INFRA team will detect it and send us a notification.
We must set out a resource “/health“, “/ping“, which we can consult to see if our service is working correctly.
We should always require special authorization for the Fallback job service, and depending on our architecture or design we may need to have it in a separate API.
A considerable improvement to this pattern would be to use message queues, and perhaps in a future article we will look at how.

About the Author

Mauricio Bergallo is a Systems Engineer with extensive knowledge in a variety of programming languages. Great experience and understanding with all aspects of the software development life cycle.

Mauricio Bergallo

Updated August 03, 2021