Deploying a Transit Gateway that integrates with a DX connection on AWS with the CDK

Deploying a Transit Gateway that integrates with a DX connection on AWS with the CDK

AWS released the Transit Gateway (TGW) back in 2018. It provided a breakthrough in enabling customers to connect Amazon Virtual Private Clouds (VPCs) and their on-premises networks using a single gateway. On its own, the TGW is a really powerful service but when paired with other resources like the Direct Connect (DX), some limitations start to appear.

Those limitations were discussed extensively in the article with title AWS Transit Gateway for connecting to on-premise: A thorough study. In this extension to the original article, we will focus on all the limitations currently in place when trying to deploy a fully backed Infrastructure as Code (IaC) solution leveraging CloudFormation, developed in this particular case with the AWS Cloud Development Kit (CDK).

The main driver of this article is the fact that CloudFormation has limited support in resources specific to the DX and the TGW services, and most importantly the integration between the two. Furthermore, we will not focus on all the potential ways to integrate the solution, rather we will on resources required in order to deploy the solution as it was discussed in the article that was mentioned earlier.

With all this in mind, we will focus on the following three topics:

  • Required resources

  • Custom CloudFormation resources with the AWS CDK

  • Code Reference for the Custom Resources

  • Extra pointers

Required resources

In principle, some parts of the solution are not deployed with IaC. A good reason for this is resources that require external verification from 3rd parties, or even resources that appear in the destination account as a result of an external process. One such example are estabilishing Hosted DX Connection with networking partners of AWS.

With the assumption that a DX connection has already been establised and is available in the AWS account, the following resources need to be created for a complete solution.

Resource NameCloudFormation Resource Type
directconnect-gatewayCustom
directconnect-gateway-associationCustom
customer-gatewayNative
transit-gatewayNative
resource-shareNative
vpn-connectionNative
transit-gateway-routeNative
transit-virtual-interfaceCustom
cloudwatch-alarmNative

Custom CloudFormation resources with the AWS CDK

There are several ways to deploy Custom Resouces with the AWS CDK. The CDK itself has a mini-framework for developing custom resources. For more information, you can look in the documentation of the CDK here.

The alternative is to define lambda functions for your custom resources and interact with CloudFormation either complete manually, or by leveraging a library that abstracts the calls to CloudFormation. One example is the crhelper library for Python. You can find more information on this here.

In order to define your custom resources using lambda, you need to perform two actions:

  1. You need to define and deploy the lambda function that holds the logic for the custom resource.

  2. You need to define the custom resource that will leverage the lambda function that was defined above.

A boilerplate of how you can achieve this with the CDK can be found below.

from aws_cdk import (
    aws_iam as iam,
    aws_lambda as lambda_,
    aws_logs as logs,
    core
)

class DirectConnectTGW():
    def __init__(self, scope)
        custom_resource_lambda = lambda_.Function(
            scope,
            '<custom_resource>',
            code=(
                lambda_.AssetCode(
                    os.path.join(
                        os.path.dirname(__file__),
                        '<path_to_source>'
                    )
                )
            ),
            handler='main.lambda_handler',
            runtime=lambda_.Runtime.PYTHON_3_7,
            timeout=core.Duration.minutes(5),
            initial_policy=[
                iam.PolicyStatement(
                    actions=[
                        'lambda:AddPermission',
                        'lambda:RemovePermission',
                        'events:PutRule',
                        'events:DeleteRule',
                        'events:PutTargets',
                        'events:RemoveTargets'
                    ],
                    resources=[
                        '*'
                    ]
                ),
                iam.PolicyStatement(
                    actions=[
                        '<policies_specific_to_the_lambda_code>'
                    ],
                    resources=[
                        '*'
                    ]
                )
            ],
            log_retention=logs.RetentionDays.ONE_YEAR
        )

        dc_gateway_custom_resource = core.CustomResource(
            scope, '<custom_resource_id>',
            service_token=custom_resource_lambda.function_arn,
            properties={
                '<env_var_name1>': '<env_var_value1>',
                '<env_var_name2>': '<env_var_value2>'
            }
        )

Code Reference for the Custom Resources

For the context of this article, the custom resources have been developed using Lambda and the code is utilizing the crhelper library from AWS. As part of this blog, reference lambda functions written in Python are available here.

Extra pointers

Custom Metric for the BGP session of the TVI

I would like to make a special note in regard to the transit virtual interface and monitoring the BGP session status. Currently, there is no metric available for this. This can be easily achieved by creating a lambda function that will generate a custom metric for you. The following Python snippet showcases a potential way to achieve this.

import boto3
import datetime
import dateutil
import os

def lambda_handler(event, context):
    region_name = os.environ['AWS_REGION']

    dc = boto3.client('directconnect', region_name=region_name)
    cw = boto3.client('cloudwatch', region_name=region_name)

    virtualInterfaces = dc.describe_virtual_interfaces(
        connectionId=event['connectionId']
    )['virtualInterfaces']

    for vi in virtualInterfaces:
        state = None
        if vi['virtualInterfaceState'] == 'available':
            state = 1
        else:
            state = 0

        cw.put_metric_data(
            Namespace='AWS/DX',
            MetricData=[
                {
                    'MetricName': 'BGPStatus',
                    'Dimensions': [{
                        'Name': 'VirtualInterfaceId',
                        'Value': vi['virtualInterfaceId']
                    }],
                    'Timestamp': datetime.datetime.now(dateutil.tz.tzlocal()),
                    'Value': state
                }
            ]
        )

    return True
Integrating the above in your CDK solution could achieved with something like this:

        # Custom Metric for Transit Virtual Interface connection status
        transit_vi_custom_metric_function = lambda_.Function(
            scope,
            'TransitVirtualInterfaceCustomMetric',
            code=(
                lambda_.AssetCode(
                    os.path.join(
                        os.path.dirname(__file__),
                        'transit-virtual-interface-custom-metric'
                    )
                )
            ),
            handler='main.lambda_handler',
            runtime=lambda_.Runtime.PYTHON_3_7,
            timeout=core.Duration.minutes(1),
            initial_policy=[
                iam.PolicyStatement(
                    actions=[
                        'cloudwatch:PutMetricData',
                        'directconnect:DescribeVirtualInterfaces'
                    ],
                    resources=[
                        '*'
                    ]
                )
            ],
            log_retention=logs.RetentionDays.ONE_YEAR
        )

        # the trigger (event) for the Lambda function
        transit_vi_custom_metric_rule = events.Rule(
            scope, 'TransitVirtualInterfaceCustomMetricRule',
            schedule=events.Schedule.rate(core.Duration.minutes(1))
        )
        transit_vi_custom_metric_rule.add_target(
            targets.LambdaFunction(
                handler=transit_vi_custom_metric_function,
                event=events.RuleTargetInput.from_object(
                    {
                        'connectionId': <dx_connection_id>
                    }
                )
            )
        )

        bgp_status_state = cloudwatch.Alarm(
            scope,
            'BGPStatusAlarm',
            metric=cloudwatch.Metric(
                metric_name='BGPStatus',
                namespace='AWS/DX',
                dimensions={
                    'VirtualInterfaceId': <tvi>.get_att_string(
                        'TransitVirtualInterfaceId'
                    )
                },
                period=core.Duration.minutes(1),
                statistic='Maximum'
            ),
            evaluation_periods=5,
            threshold=1,
            alarm_description=(
                f'Alarm for TVI Connection State'
            ),
            comparison_operator=(
                cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD
            )
        )

Static Routes for the TGW

Deploying static routes within your TGW route table in code can prove benneficial. If you remember from the original article, there is currently a limitation on the number of routes that can be advertised on premise from a DX connection. As an amendment to the original article, I should mention here that it is possible to advertize a superset of the actual CIDR blocks behind the TGW over the DX connection. This can be used to tackle in a way the above limitation by creating supersets instead of advertizing the CIDR bock of each VPC.

The problem that arises when you start using supersets of the CIDR blocks, is that if you are using in conjuction with the DX a VPN connection as a failover, the VPN will advertize via automatic propagation the CIDR blocks of all the VPCs behind the TGW. That will result in a deviation in the CIDR blocks advertized on premise via the DX and the VPN. In order to tackle the above issue, one can inject the TGW route table with superset static routes, exatly as they are advertized via the DX connection.

The following lambda function can help you pick up from tour configuration the manually advertized supersets and inject them in your TGW route table.

import boto3
import logging
import os

def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    region = os.environ['AWS_REGION']

    ec2 = boto3.client('ec2', region_name=region)

    cidr_blocks = event['cidr_blocks']
    tgw_route_table_id = event['tgw_route_table_id']

    for cidr_block in cidr_blocks:
        # find all routes that belong in a specific CIDR block or its subnets
        routes = ec2.search_transit_gateway_routes(
            TransitGatewayRouteTableId=tgw_route_table_id,
            Filters=[
                {
                    'Name': 'route-search.subnet-of-match',
                    'Values': [
                        cidr_block,
                    ]
                },
                {
                    'Name': 'attachment.resource-type',
                    'Values': [
                        'vpc',
                    ]
                }
            ]
        )

        # array with the CIDR blocks that were identified
        cidr_blocks_for_routes = (
            [d['DestinationCidrBlock'] for d in routes['Routes']]
        )

        # if a static route does not already exis
        if len(cidr_blocks_for_routes) > 0:
            if cidr_block not in cidr_blocks_for_routes:
                tgw_attachements = (
                    [d['TransitGatewayAttachments'] for d in routes['Routes']]
                )
                tgw_vpc_attachements = (
                    [d for d in tgw_attachements if d[0]['ResourceType'] in ['vpc']]  # noqa: E501
                )
                # identify attachment id for VPC
                tgw_attach_id = tgw_vpc_attachements[0][0]['TransitGatewayAttachmentId']  # noqa: E501

                logger.info('Creating route for: %s' % str(cidr_block))

                ec2.create_transit_gateway_route(
                    DestinationCidrBlock=cidr_block,
                    TransitGatewayRouteTableId=tgw_route_table_id,
                    TransitGatewayAttachmentId=tgw_attach_id
                )

    ####################
    # start - cleanup
    # find all static routes to vpcs
    all_static_routes = ec2.search_transit_gateway_routes(
        TransitGatewayRouteTableId=tgw_route_table_id,
        Filters=[
            {
                'Name': 'type',
                'Values': [
                    'static',
                ]
            },
            {
                'Name': 'attachment.resource-type',
                'Values': [
                    'vpc',
                ]
            }
        ]
    )

    # array with the CIDR blocks that were identified
    all_cidr_blocks_for_static_routes = (
        [d['DestinationCidrBlock'] for d in all_static_routes['Routes']]
    )

    for cidr_block in all_cidr_blocks_for_static_routes:
        if cidr_block not in cidr_blocks:
            logger.info('Deleting route for: %s' % str(cidr_block))
            ec2.delete_transit_gateway_route(
                DestinationCidrBlock=cidr_block,
                TransitGatewayRouteTableId=tgw_route_table_id
            )
    ####################
    # end - cleanup
    ####################

All the above can be found here together with the CloudFormation custom resources.

Conclusion

Using the TGW to connect the on-premise environment to VPCs on AWS could simplify things a lot. Limitations from CloudFormation could complicate the solution as explained above. I do hope the information provided here and the custom resources that have been shared will help you get going. Knowing AWS, these limitations of CloudFormation should one by one be tackled in future announcements, but for now, this is the state of things.