Distributed Transaction Management : Microservices, Idempotency and Failure Handling
Recently, I had a discussion in my team around transaction management in micro-services. It is well known that it’s hard to implement transaction in micro-services oriented architecture. In this article, I try to write down my current understanding of transaction management in micro-services architecture. I’m not doing precise technical discussion. Rather, I’m examining an idea that how to connect the several concepts around the distributed transaction management.
Recap: Transaction
Let’s recap what’s the transaction at first. Personally, I felt something mysterious and difficult from the term “transaction”. But, I realize tht it’s actually a pretty simple concept.
Think about a simple example of a system of virtual currency. If the customer earn or spend the virtual currency, we may need to
- write a record in ledger table, and
- write a record in customer balance table.
Important thing here is, these 2 records in 2 tables have to be in sync. I mean if you find one record for the customer earn/spend in ledger table, you should also find the corresponding record in the customer balance table. This is “transaction”. Transaction is a collection of data operations. And they are either of 1. all operations are successfully done, or 2. none of them succeeded. This characteristic of data management has another name, data consistency. From my experience, consistency management and transaction management are often used exchangeably.
For the example virtual currency system, in short, we need to avoid the following 2 situations.
- The record exists in ledger table but not in customer balance table
- The record exists in customer balance table but not in ledger table
So far, nothing is difficult, isn’t it ?
RDB Transaction
It starts to get complicated when you need to consider failure handling. Yes, computer system, especially I/O, might fail. So, if one of the writes to ledger or customer balance table fails, then there will be inconsistency. This is a problem.
In traditional Relational Database (RDB), this kind of failure is handled by so-called “transaction”. Yes, I understand the terminology is confusing here. “Transaction” here is the name of functionality provided by RDB instead of the concept of transaction, which I explained in the previous section.
As yo know, “RDB transaction” works like this. For success case,
- start “transaction”
- write a record to the ledger table
- write a record to the customer balance table
- end “transaction”.
for failure case,
- start “transaction”
- write a record to the ledger table
- write a record to the customer balance table but it fails!
- rollback. It means the write to the ledger table in step2 is canceled.
That’s it. It’s still simple and good. Now, we can avoid the inconsistency in the ledger and customer balance tables even in the world where we could fail to write records.
Transaction Management in micro-services: Saga
The limitation of “RDB transaction” is that we can’t use the nice functionality among multiple databases. And in the micro-services architecture, each database is typically hidden by independent service. So, there is no hope we can make use of good old days “RDB transaction” in order to avoid data inconsistency.
Let’s say we have 2 different services (or APIs), ledger service and customer balance service instead of 2 tables in single database. Now, we need to call these 2 services through network in order to create a ledger record and customer balance record for the virtual currency management.
What should I do for the failure handling to avoid the inconsistency without the nice “RDB transaction” ?
Wait, I already explained what “RDB transaction” is doing when writes fail. We can do the exactly same thing with remote service calls, can’t we ? I mean if one of your service calls fails then cancel all the related service calls. Like this.
- start “transaction”. In micro-services, it is doing nothing.
- call create API of the ledger service
- call create API of the customer balance table but it fails!
- rollback starts. call cancel API of the ledger service to delete record created in step2
This means that, both the ledger service and the customer balance service need to support cancel (or delete) API in addition to create API. Implementing cancel API might need some effort but it should be not very complicated.
Although you need to implement a certain amount of extra logic compared to RDB world, the logic is still not complicated. If you detect the API failures, then just cancel the related calls. That’s it.
This cancel on failure pattern has name of “Saga” in the context of micro-services. But, I think it is a straight forward alternative of rollback operation in “RDB transaction”.
Further Thinking: What happens if the cancel fails in Saga ?
O.K. Now, we have a way to manage transaction in micro-services. That’s good. However, in practice, you have to answer the following question.
“What happens if the cancel API somehow fails during rollback ?”.
It is annoying because it’s a recursive question. O.K. we could retry the cancel API when it fails but that retry could also fail. and that retry on the retry could fail again. and fail again and … never ending.
Here the concept of idempotence comes. I think you’ve already heard a word idempotence at least once. If the API is idempotent, multiple identical requests has the same effect as making a single request. What does it mean in the context of our virtual currency system?
Let’s get back to the examples before. for failure case,
- start “transaction”. In micro-services, it is doing nothing.
- call create API of the ledger service
- call create API of the customer balance table but it fails!
- rollback starts. call cancel API of the ledger service to delete record created in step2. But, the cancel somehow fails so the record in the ledger service remains
This is the case at stake. After this happens, we have data inconsistency among the 2 tables, the ledger has record but the customer balance doesn’t.
What happens if we retry the whole operation ?
- start “transaction”. In micro-services, it is doing nothing.
- call create API of the ledger service. Remember, the API is idempotent now. So, the record in the ledger service doesn’t change at all.
- call create API of the customer balance service. Wow, it succeeds this time.
- end transaction.
This is lucky case. Previous failure in the customer balance service was just a temporal failure so the retry works. Important thing here is that we can call the ledger create API multiple times safely without thinking anything because it’s idempotent.
Oh, wait but what happens if the write to the customer balance table never succeeded ? This can definitely happen in production systems, for example when the input data is invalid and the validation on the customer balance service always fails.
Let’s simulate this case. This is a bit confusing scenario. Remember that in the first, the customer balance creation failed and the cancel on the ledger also failed.
- start “transaction”. In micro-services, it is doing nothing.
- call create API of the ledger service. Remember, the API is idempotenct now. So, the record in the ledger service doesn’t change at all.
- call create API of the customer balance service. It fails again due to the invalid data.
- rollback starts. call cancel API of the ledger service to delete record created in step2. It could fail again or it could succeed.
Now, we reach the most interesting part of this article. Yes, you are right. The rollback in the second retry might fail. But, it should success with the finite number of retries.
What do I mean by the finite number of retries ? To explain it, I have to categorize the failure into 2 different classes.
- Retryable Failure: This is typically caused by the server down or network failure. So, after the system get recovered, the request should succeed on retry.
- Unretryable Failure: This is typically caused by invalid data. Because the data is invalid, the service intentionally fails the request. To mitigate this, you need to change the contents of the request so that simple retry of the identical request never succeed.
I think you already understand. Yes, we have a secret tip when we build our logic with Saga pattern. That is
- Cancel API shouldn’t fail with Unretryable failure.
Creation APIs can fail with Unretryable failure. But, cancel can’t.
Further Further Thinking: What happens if all the failures are retryable ?
Last thinking I would like to share is this above question. In other words, if all the creation API never fail with unretryable failure as same to the cancel API, do we still need rollback steps in Saga?
Given the creation API is idempotent, we don’t need rollback step at all. We can just retry all the steps until all the creation API succeed. The nice thing of this is that we no longer need the cancel APIs. Isn’t it nice ?
So, the transaction management in micro-services would be much simpler if you can make sure that
- all the creation APIs are idempotent
- all the failures from the creation APIs are retryable
How can we satisfy the second condition ? The answer is if there is no invalid data in the creation API calls. In other words, if you can validate your request data perfectly in the client of the creation API, we can avoid to call the creation API with invalid data.
But, you know this is kind of tautology. How can we validate the data perfectly without calling the creation API ? Is it almost identical to have the identical logic inside the creation API ? Is that even possible in the practical situation where the logic is always changing ?
This issue can be addressed if the creation API provides trial API which test if the contents in the request can be successfully processed or not. The name might be dry-run API or validation API or whatever.
Very interesting. We now bump into the basic idea of 2-phase commit, which is famous protocol to implement distributed transaction management. In the first step of 2 phase commit, the participants make sure that there should be no failure happens from invalid data. Then, if we can make sure that all the participant can create the record successfully, the coordinator instructs all the participants to create the record.
The whole process for our virtual currency management will be like this.
For success case,
- start “transaction”. In micro-services, it is doing nothing.
- call trial API of the ledger service. O.K. it succeeded.
- call trail API of the customer balance service. O.K. it succeeded
- call create API of the ledger service. It should succeed at least with multiple number of retries.
- call create API of the customer balance service. It should succeed at least with multiple number of retries.
- No rollback step needed.
Failure should be caught in the trial step.
- start “transaction”. In micro-services, it is doing nothing.
- call trial API of the ledger service. O.K. it succeeded.
- call trail API of the customer balance service. Sad, it failed.
- No rollback step needed because we didn’t create anything yet.