Outbound SMS Greylisting
Introduction
In certain cases it's possible that Helium cannot connect to an SMS aggregator as expected due to technical issues on the aggregator's side. In these cases a connection timeout may occur.
When attempting to connect to an external SMS aggregator, certain resources are allocated on Helium's side. These resources cannot be reclaimed until Helium has either received a response from the aggregator or a timeout occurs.
In high traffic environments this might lead to many resources being occupied while waiting for responses from an aggregator and, in effect, starving other critical processes in Helium of resources.
To combat this, Helium makes use of greylisting on an aggregator level.
Greylisting Behaviour
If multiple attempts to connect to an aggregator results in timeouts within a certain time interval, greylisting for that aggregator is activated and stays activated for a certain time period.
During this period all attempts to send SMS messages using the greylisted aggregator will fail in order to avoid a scenario where server resources are occupied while waiting for responses from the aggregator. Note that this includes messages that are sent as part of Helium's retry mechanism.
Greylisting is therefore based on the assumption that if multiple connection issues (timeouts) occur within a short time period, there is a high likelihood that further connection issues will be experienced for a short time period thereafter.
The number of timeouts that result in greylisting, the time interval in which they must occur and the time period for which greylisting stays activated are all values that can be configured on a Helium server. These values are configured as part of a system wide config and not per aggregator.
Greylisting Configuration
The behaviour of greylisting can be configured as part of a system wide configuration in Helium. The table below describes the fields relevant to this configuration:
Field | Description |
|---|---|
greylistingEnabled | Specifies whether greylisting is enabled (can be activated for an aggregator) or not. Valid values are true and false. |
failureThreshold | The number of timeouts that has to occur (within and interval of failureCounterResetTime) before greylisting is activated for the aggregator. The value must be greater that 0. |
failureCounterResetTime | The time period, in minutes, during which timeouts that occur can result in greylisting being activated for the aggregator. The value must be greater that 0. |
greylistingTime | The time period, in minutes during which greylisting stays active after being activated for an aggregator. The value must be greater that 0. |
These fields can be found in the sms_service_config table in Helium and can be updated and inspected using the sms-update-service-config.py and sms-get-service-config.py scripts in the Helium project.
Please consult DevOps in case any changes to this config is required.
Given the field descriptions above, consider the following configuration:
{
"id":"00000000-0000-0000-0011-000000000000",
"greylistingTime":10,
"failureThreshold":3,
"greylistingEnabled":true,
"failureCounterResetTime":10
}Given this, the greylisting behaviour will be as follows:
If three timeouts occur (for the same aggregator) within ten minutes of each other, greylisting will be activated for that aggregator and stay activated for a ten minute period.
Examples
Example showing greylisting being activated
Given the above configuration, and assuming the same aggregator is used for all messages, consider the following case in which greylisting will be activated:
Time | Messages Content | Attempt Number | Result | Notes |
|---|---|---|---|---|
12:00 | Test 1 | 1 | java.net.SocketTimeoutException: Read timed out | First timeout. Failure counter is set to 1. |
12:01 | Test 1 | 2 | java.net.SocketTimeoutException: Read timed out | Second timeout within ten minutes. Failure counter is incremented to 2. |
12:02 | Test 1 | 3 | java.net.SocketTimeoutException: Read timed out | Third timeout within ten minutes. Failure counter is incremented. Failure threshold of 3 is reached and greylisting is activated. |
12:04 | Test 2 | 1 | Greylisted | Message fails due to greylisting. |
12:05 | Test 2 | 2 | Greylisted | Message fails due to greylisting. |
12:06 | Test 2 | 3 | Greylisted | Message fails due to greylisting. |
12:15 | Test 3 | 1 | Success | More that 10 minutes has lapsed since greylisting was activated. Normal routing is resumed. Here we also assume that the aggregator has recovered and the message is sent successfully but this is not a given. |
Example showing greylisting not being activated
Given the above configuration, and assuming the same aggregator is used for all messages, consider the following case in which greylisting won't be activated:
Time | Messages Content | Attempt Number | Result | Notes |
|---|---|---|---|---|
12:00 | Test 1 | 1 | java.net.SocketTimeoutException: Read timed out | First timeout. Failure counter is set to 1. |
12:01 | Test 1 | 2 | java.net.SocketTimeoutException: Read timed out | Second timeout within ten minutes. Failure counter is incremented to 2. |
12:02 | Test 1 | 3 | Success | Aggregator recovers and message is sent successfully. |
12:12 | Test 2 | 1 | java.net.SocketTimeoutException: Read timed out | More than ten minutes has lapsed since the last timeout implying the failure counter has since been reset to 0. Failure counter is incremented to 1. |
12:13 | Test 2 | 2 | java.net.SocketTimeoutException: Read timed out | Second timeout within ten minutes. Failure counter is incremented to 2. |
12:13 | Test 2 | 3 | Success | Aggregator recovers and message is sent successfully. |