Outbound SMS Greylisting

Introduction

In certain cases it's possible that Helium cannot connect to an SMS aggregator as expected due to technical issues on the aggregator's side. In these cases a connection timeout may occur.

When attempting to connect to an external SMS aggregator, certain resources are allocated on Helium's side. These resources cannot be reclaimed until Helium has either received a response from the aggregator or a timeout occurs.

In high traffic environments this might lead to many resources being occupied while waiting for responses from an aggregator and, in effect, starving other critical processes in Helium of resources.

To combat this, Helium makes use of greylisting on an aggregator level.



Greylisting Behaviour

If multiple attempts to connect to an aggregator results in timeouts within a certain time interval, greylisting for that aggregator is activated and stays activated for a certain time period.

During this period all attempts to send SMS messages using the greylisted aggregator will fail in order to avoid a scenario where server resources are occupied while waiting for responses from the aggregator. Note that this includes messages that are sent as part of Helium's retry mechanism.

Greylisting is therefore based on the assumption that if multiple connection issues (timeouts) occur within a short time period, there is a high likelihood that further connection issues will be experienced for a short time period thereafter.

The number of timeouts that result in greylisting, the time interval in which they must occur and the time period for which greylisting stays activated are all values that can be configured on a Helium server. These values are configured as part of a system wide config and not per aggregator.



Greylisting Configuration

The behaviour of greylisting can be configured as part of a system wide configuration in Helium. The table below describes the fields relevant to this configuration:

FieldDescription
greylistingEnabledSpecifies whether greylisting is enabled (can be activated for an aggregator) or not. Valid values are true and false.
failureThresholdThe number of timeouts that has to occur (within and interval of failureCounterResetTime) before greylisting is activated for the aggregator. The value must be greater that 0.
failureCounterResetTime

The time period, in minutes, during which timeouts that occur can result in greylisting being activated for the aggregator.

The value must be greater that 0.

greylistingTime

The time period, in minutes during which greylisting stays active after being activated for an aggregator.

The value must be greater that 0.

These fields can be found in the sms_service_config  table in Helium and can be updated and inspected using the sms-update-service-config.py  and sms-get-service-config.py  scripts in the Helium project.

Please consult DevOps in case any changes to this config is required.

Given the field descriptions above, consider the following configuration:

{  
   "id":"00000000-0000-0000-0011-000000000000",
   "greylistingTime":10,
   "failureThreshold":3,
   "greylistingEnabled":true,
   "failureCounterResetTime":10
}

Given this, the greylisting behaviour will be as follows:

If three timeouts occur (for the same aggregator) within ten minutes of each other, greylisting will be activated for that aggregator and stay activated for a ten minute period.



Examples

Example showing greylisting being activated

Given the above configuration, and assuming the same aggregator is used for all messages, consider the following case in which greylisting will be activated:

TimeMessages ContentAttempt NumberResultNotes
12:00Test 11java.net.SocketTimeoutException: Read timed outFirst timeout. Failure counter is set to 1.
12:01Test 12java.net.SocketTimeoutException: Read timed outSecond timeout within ten minutes. Failure counter is incremented to 2.
12:02Test 13java.net.SocketTimeoutException: Read timed outThird timeout within ten minutes. Failure counter is incremented. Failure threshold of 3 is reached and greylisting is activated.
12:04Test 21GreylistedMessage fails due to greylisting.
12:05Test 22GreylistedMessage fails due to greylisting.
12:06Test 23GreylistedMessage fails due to greylisting.
12:15Test 31Success

More that 10 minutes has lapsed since greylisting was activated. Normal routing is resumed.

Here we also assume that the aggregator has recovered and the message is sent successfully but this is not a given.

Example showing greylisting not being activated

Given the above configuration, and assuming the same aggregator is used for all messages, consider the following case in which greylisting won't be activated:

TimeMessages ContentAttempt NumberResultNotes
12:00Test 11java.net.SocketTimeoutException: Read timed outFirst timeout. Failure counter is set to 1.
12:01Test 12java.net.SocketTimeoutException: Read timed outSecond timeout within ten minutes. Failure counter is incremented to 2.
12:02Test 13SuccessAggregator recovers and message is sent successfully.
12:12Test 21java.net.SocketTimeoutException: Read timed outMore than ten minutes has lapsed since the last timeout implying the failure counter has since been reset to 0. Failure counter is incremented to 1.
12:13Test 22java.net.SocketTimeoutException: Read timed outSecond timeout within ten minutes. Failure counter is incremented to 2.
12:13Test 23SuccessAggregator recovers and message is sent successfully.