AWS使用CFN+Lambda+CLoudwatch Events+SNS+Role检测EC2 RI利用率

AWS使用CFN+Lambda+CLoudwatch Events+SNS+Role检测EC2 RI利用率

Scroll Down

0、引

生产环境中经常经常需要检查EC2、RDS、ElasticCache的RI利用率、到期时间、型号等数据,那现在需要检测EC2、RDS、Elasticache的RI利用率,于是就就参考Amazon SDK 和 之前公司已有的检测RI过期时间的脚本 写了这么一个测试

0.1 v1版本流程
① 首先创建一个SNS主题,订阅通知为Email
② 然后写了一个lambda脚本,主要功能是获取到当前时间,然后返回当前时间一周内的EC2实例 RI的利用率情况,当判断到当前的利用率小于97%时,通过SNS发出邮件提醒
③ lambda测试完成之后,编写CFN自动创建lambda、role、cloudwatch Events定时调用lambda
0.2 v2版本更新
① CloudFormation 模板中忘记写SNS部分,v2版本在CFN中添加上SNS Topic、SNS Subscription部分,并调整了CFN中它们的位置顺序( 类似yaml这种语言,创建资源的顺序都取决于在文件中定义的前后顺序)这里因为lambda中要使用topic的ARN,所以将lambda的放在SNS主题、订阅后边
② v2版CloudFormation模板中为Lambda函数添加了环境变量 topic_arn(使用!Ref 从SNS Topic资源获取)、before_days( 用户想获得的当前前多长时间的一个窗口期 )、appenv(项目名称),lambda函数中使用os.environ['变量key']获取到该值
④  v2版lambda函数较v1版增加了对RDS、Elasticache RI 的具体的实例类型、平台类型、RI数量,可用区等将详细信息,做一个判断,利用率低于97%的时候执行sns通知   
0.3 体系流程图

CFN_Lambda_SNS_RI.jpg

自动化流程是:

将lambda代码上传到S3存储桶,部署CFN自动创建Role、Lambda Function、Lambda Permission、CloudWatch Events资源,Cloudwatch Events 定时使用 Role 去执行 Lambda 函数,Lambda判断 RI 利用率如果低于某个值,输出RI详细信息并通过调用SNS,通知到用户

1、输出展示

1.1 ( 为展示数据,此账号此输出未作筛选,此账号RI资源多,仅测试Lambda函数功能,未使用CFN自动化 )

RIlog.jpg

1.2  另一账号使用CFN并测试SNS输出展示

image.png

2、使用此方案流程

2.1 将 lambda 代码上传到S3存储桶

image.png

import json
import boto3
import datetime
import os


# 获取当前时间
def get_date():
    # datetime.datetime.today()获取到当前时间,格式为2020-11-03 14:12:28.339466
    # strftime("%Y-%m-%d")是去除时分秒,只留年月日
    # datetime.datetime.today() - datetime.timedelta(days=1)是获取到今天日期后,减1天
    # 从 lambda 环境变量获取到用户希望收集当前时间前多久的一个时间段的RI利用率值
    before_days = os.environ['before_days']
    start_time = (datetime.datetime.today() - datetime.timedelta(days=int(before_days))).strftime("%Y-%m-%d")
    stop_time = datetime.datetime.today().strftime("%Y-%m-%d")
    return start_time, stop_time


# 将response的利用率转为整型
# def to_int(str):
#     try:
#         int(str)
#         return int(str)
#     except ValueError:
#         try:
#             float(str)
#             return int(float(str))
#         except ValueError:
#             return False

# ec2 ri utilization
def get_ec2_reservation_utilization():
    # regions = ['cn-north-1', 'cn-northwest-1']
    # 项目名称,从环境变量中获取
    appenv = os.environ['appenv']
    # 定义 sns 消息标头内容
    message_list = ["  " + appenv + " RI Utilization Monitor: "]

    # for region in regions:
    ec2_client = boto3.client('ce')
    '''
        如果定义了时间戳,get_reservation_utilization()返回的数据类型是字典,
        它会返回每一天的利用率作为一个Group,并在每一个Group中返回一个利用率,
        最后会返回设定时间戳内总的利用率'Total'字段中
    '''
    start_time, stop_time = get_date()

    response = ec2_client.get_reservation_utilization(
        TimePeriod={
            'Start': start_time,
            'End': stop_time
        },
        Filter={
            'Dimensions': {
                'Key': 'SERVICE',
                'Values': [
                    'Amazon Elastic Compute Cloud - Compute'
                ]
            }

        },
        GroupBy=[
            {
                'Type': "DIMENSION",
                'Key': "SUBSCRIPTION_ID"
            }
        ]
    )

    # 字典类型,直接获取到response对应字段
    # TotalUtilizationPercentage = response['Total']['UtilizationPercentageInUnits']

    # for循环获取每一个group下的RI

    # ec2_ri_groups_list = response['UtilizationsByTime'][0]['Groups'][0]['Attributes']['instanceType']
    ec2_ri_groups_list = response['UtilizationsByTime'][0]['Groups']
    # for ri in ec2_ri_groups_list:
    #     if  ri['Attributes']['instanceType'] == 'db.m4.2xlarge':
    #         instanceType = 'db.m4.2xlarge'
    #         return instanceType

    # print(ec2_ri_groups_list)
    # print("0------------------")
    if ec2_ri_groups_list:
        message_list.append("  -------EC2 RI--------")
        for ec2_ri_group in ec2_ri_groups_list:
            ec2_ri_region = ec2_ri_group['Attributes']['region']
            ec2_ri_numberOfInstances = ec2_ri_group['Attributes']['numberOfInstances']
            ec2_ri_instanceType = ec2_ri_group['Attributes']['instanceType']
            ec2_ri_platform = ec2_ri_group['Attributes']['platform']

            # 格式化输出利用率
            float_ec2_ri_UtilizationPercentage = float(ec2_ri_group['Utilization']['UtilizationPercentage'])
            ec2_ri_UtilizationPercentage = "{:.2f}%".format(float_ec2_ri_UtilizationPercentage)

            if float_ec2_ri_UtilizationPercentage < 97:
                message = "            On the AZ " + ec2_ri_region + " , " + ec2_ri_numberOfInstances \
                          + " EC2 " + ec2_ri_instanceType + " RI Utilization is " + ec2_ri_UtilizationPercentage + " , its platform is " + ec2_ri_platform
                message_list.append(message)
        # 如果利用率较高,不输出分割线
        # if float_ec2_ri_UtilizationPercentage >= 97:
        #     message_list.remove("  -------EC2 RI--------")
    return message_list


# rds ri utilization
def get_rds_reservation_utilization():
    # regions = ['cn-north-1', 'cn-northwest-1']
    message_list = get_ec2_reservation_utilization()
    # print(message_list)
    # print("0------------------------")
    # for region in regions:
    rds_client = boto3.client('ce')

    start_time, stop_time = get_date()
    rds_response = rds_client.get_reservation_utilization(
        TimePeriod={
            'Start': start_time,
            'End': stop_time
        },
        Filter={
            'Dimensions': {
                'Key': 'SERVICE',
                'Values': [
                    'Amazon Relational Database Service'
                ]
            }
        },
        GroupBy=[
            {
                'Type': "DIMENSION",
                'Key': "SUBSCRIPTION_ID"
            }
        ]
    )

    rds_ri_groups_list = rds_response['UtilizationsByTime'][0]['Groups']
    if rds_ri_groups_list:
        message_list.append("  -------RDS RI--------")
        for rds_ri_group in rds_ri_groups_list:
            rds_ri_region = rds_ri_group['Attributes']['region']
            rds_ri_numberOfInstance = rds_ri_group['Attributes']['numberOfInstances']
            rds_ri_instanceType = rds_ri_group['Attributes']['instanceType']
            rds_ri_platform = rds_ri_group['Attributes']['platform']

            float_rds_ri_UtilizationPercentage = float(rds_ri_group['Utilization']['UtilizationPercentage'])
            rds_ri_UtilizationPercentage = "{:.2f}%".format(float_rds_ri_UtilizationPercentage)
            if float_rds_ri_UtilizationPercentage < 97:
                message = "            On the AZ " + rds_ri_region + " , " + rds_ri_numberOfInstance \
                          + " RDS " + rds_ri_instanceType + " RI Utilization is " + rds_ri_UtilizationPercentage + " , its platform is " + rds_ri_platform
                message_list.append(message)
        # 如果利用率较高,不输出分割线
        # if float_rds_ri_UtilizationPercentage > 97:
        #     message_list.remove("  -------RDS RI--------")
    return message_list


# elasticache ri utilization
def get_elasticache_utilization():
    # regions = ['cn-north-1', 'cn-northwest-1']
    message_list = get_rds_reservation_utilization()
    message_str = "\n".join(message_list)

    # for region in regions:
    elasticache_client = boto3.client('ce')

    start_time, stop_time = get_date()
    elasticache_response = elasticache_client.get_reservation_utilization(
        TimePeriod={
            'Start': start_time,
            'End': stop_time
        },
        Filter={
            'Dimensions': {
                'Key': 'SERVICE',
                'Values': [
                    'Amazon ElastiCache'
                ]
            }
        },
        GroupBy=[
            {
                'Type': "DIMENSION",
                'Key': "SUBSCRIPTION_ID"
            }
        ]
    )

    elasticache_ri_groups_list = elasticache_response['UtilizationsByTime'][0]['Groups']
    if elasticache_ri_groups_list:
        message_list.append("  -------ElastiCache RI--------")
        for elasticache_ri_group in elasticache_ri_groups_list:
            elasticache_ri_region = elasticache_ri_group['Attributes']['region']
            elasticache_ri_numberOfInstance = elasticache_ri_group['Attributes']['numberOfInstances']
            elasticache_ri_instanceType = elasticache_ri_group['Attributes']['instanceType']
            elasticache_ri_platform = elasticache_ri_group['Attributes']['platform']

            float_elasticache_ri_UtilizationPercentage = float(
                elasticache_ri_group['Utilization']['UtilizationPercentage'])
            elasticache_ri_UtilizationPercentage = "{:.2f}%".format(float_elasticache_ri_UtilizationPercentage)

            if float_elasticache_ri_UtilizationPercentage < 97:
                message = "            On the AZ " + elasticache_ri_region + " , " + elasticache_ri_numberOfInstance \
                          + " ElastiCache " + elasticache_ri_instanceType + " RI Utilization is " + elasticache_ri_UtilizationPercentage + " , its platform is " + elasticache_ri_platform
                message_list.append(message)
                message_str = "\n".join(message_list)
    return message_str


def sns_publish():
    # 使用os.environ获取lambda中的环境变量,该环境变量值在CloudFormation创建lambda时已获取到,传递给lambda的Environment环境变量
    topic_arn = os.environ['topic_arn']
    topic_region = 'cn-north-1'
    # 获取到返回的消息值
    message_str = get_elasticache_utilization()
    sns = boto3.client('sns', region_name=topic_region)
    response = sns.publish(
        TopicArn=topic_arn,
        Subject='RI Utilization Monitor',
        Message=message_str
    )


def lambda_handler(event, context):
    # TODO implement
    before_days = os.environ['before_days']

    # 判断是否执行
    # TotalUtilizationPercentage = get_reservation_utilization()[0]
    # if float(TotalUtilizationPercentage) < 97:
    sns_publish()

2.2 修改CloudFormation代码可变字段
AWSTemplateFormatVersion: 2010-09-09
Description: RI-Utilization-Monitor

Resources:
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      Path: /
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Policies:
        - PolicyName: RI_Utilization_Monitor
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeReservedInstances
                  - ec2:DescribeReservedInstancesModifications
                  - ec2:DescribeReservedInstancesOfferings
                  - ec2:DescribeReservedInstancesListings
                  - ce:GetReservationUtilization
                  - sns:Publish
                  - s3:Get*
                  - s3:List*
                Resource: "*"

  CloudwacthEventsScheduledRule:
    Type: AWS::Events::Rule
    Properties:
      Name: RI_Utilization_Monitor
      Description: AWS Cloudwatch Events Schedule Rule
      ScheduleExpression: "cron(00 08 ? * FRI *)"                  # 修改这里周期调用Lambda函数的周期,GMT
      State: "ENABLED"
      Targets:
        -
          Arn:
            Fn::GetAtt:
              - LambdaFunctionCreator
              - Arn
          Id: GetReservationUtilization

  PermissionForEventsToInvokeLambda:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !GetAtt
          - LambdaFunctionCreator
          - Arn
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt
          - CloudwacthEventsScheduledRule
          - Arn
  SNSTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: RI_Utilization_Monitor
  SNSSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      Endpoint: 'XXXX@light2cloud.com'             # 修改这里SNS订阅的邮箱
      Protocol: email
      TopicArn: !Ref SNSTopic
  LambdaFunctionCreator:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: GetReservationUtilization
      Description: Lambda For RI Utilization Monitor
      Environment:
        Variables:
          topic_arn: !Ref SNSTopic
          before_days: 7                               # 修改这里选择返回当前时间前多久的一个RI利用率
          appenv: Friso                                # 修改这里项目名
      Runtime: python3.7
      Handler: GetReservationUtilization.lambda_handler
      MemorySize: 128
      Role: !GetAtt LambdaExecutionRole.Arn
      Timeout: 60
      Code:
        S3Bucket: XXXXXXXXXXXXXXXX    # S3桶名
        S3Key: RI/GetReservationUtilization.zip   # S3路径/文件名
2.3 使用已有模板生成CloudFormation,并上传yaml文件

image.png

2.4 需要一个CloudFormation的执行角色权限

image.png

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": "*",
           "Resource": "*"
       }
   ]
}
2.5 执行CLoudFormation自动化创建Lambda函数、Lambda Role、SNS Topic、Sns 订阅等资源

测试时可以将定时调用的周期距离当前时间近一些,待无问题之后在设定为日常使用所需要的一个周期

3、测试代码
l = {
  "UtilizationsByTime": [{
    "Groups": [
    {
      'Key': 'string',
      'Value': 'string',
      "Attributes": {
        "AccountId": "0123456789",
        "AccountName": "0123456789",
        "AvailabilityZone": "",
        "CancellationDateTime": "2019-09-28T15:22:31.000Z",
        "EndDateTime": "2019-09-28T15:22:31.000Z",
        "InstanceType": "t2.nano",
        "LeaseId": "0123456789",
        "NumberOfInstances": "1",
        "OfferingType": "convertible",
        "Platform": "Linux/UNIX",
        "Region": "us-east-1",
        "Scope": "Region",
        "StartDateTime": "2016-09-28T15:22:32.000Z",
        "SubscriptionId": "359809062",
        "SubscriptionStatus": "Active",
        "SubscriptionType": "All Upfront",
        "Tenancy": "Shared"
      },
      "Key": "SUBSCRIPTION_ID",
      "Utilization": {
        "PurchasedHours": 2208,
        "TotalActualHours": 2208,
        "UnusedHours": 0,
        "UtilizationPercentage": 100
      },
      "Value": "359809062"
    },
    {
      "Attributes": {
        "": "0123456789",
        "AccountName": "asdasdad",
        "AvailabilityZone": "us-east-1d",
        "CancellationDateTime": "2017-09-28T15:22:31.000Z",
        "EndDateTime": "2017-09-28T15:22:31.000Z",
        "InstanceType": "t2.nano",
        "LeaseId": "asdasda",
        "NumberOfInstances": "1",
        "OfferingType": "Standard",
        "Platform": "Linux/UNIX",
        "Region": "us-east-1",
        "Scope": "Availability Zone",
        "StartDateTime": "2016-09-28T15:22:32.000Z",
        "SubscriptionId": "359809070",
        "SubscriptionStatus": "Active",
        "SubscriptionType": "All Upfront",
        "Tenancy": "Shared"
      },
      "Key": "SUBSCRIPTION_ID",
      "Utilization": {
        "PurchasedHours": 2151,
        "TotalActualHours": 2151,
        "UnusedHours": 0,
        "UtilizationPercentage": 100
      },
      "Value": "359809070"
    },
    {
      "Attributes": {
        "AccountId": "0123456789",
        "AccountName": "sdasad",
        "AvailabilityZone": "us-west-2a",
        "CancellationDateTime": "2017-09-20T04:06:02.000Z",
        "EndDateTime": "2017-09-20T04:06:02.000Z",
        "InstanceType": "t2.nano",
        "LeaseId": "asdasda",
        "NumberOfInstances": "1",
        "OfferingType": "Standard",
        "Platform": "Linux/UNIX",
        "Region": "us-west-2",
        "Scope": "Availability Zone",
        "StartDateTime": "2016-09-20T04:06:03.000Z",
        "SubscriptionId": "353571154",
        "SubscriptionStatus": "Active",
        "SubscriptionType": "Partial Upfront"
      },
      "Key": "SUBSCRIPTION_ID",
      "Utilization": {
        "PurchasedHours": 1948,
        "TotalActualHours": 0,
        "UnusedHours": 1948,
        "UtilizationPercentage": 0
      },
      "Value": "353571154"
    }
  ],
  "TimePeriod": {
    "End": "2017-10-01",
    "Start": "2017-07-01"
  },
  "Total": {
    "PurchasedHours": 6307,
    "TotalActualHours": 4359,
    "UnusedHours": 1948,
    "UtilizationPercentage": 69.11368320913270968764864436340574
  }
  }]
}

ec2_ri_groups_list = l['UtilizationsByTime'][0]['Groups']

message_list = [" RI Utilization Monitor: "]

for ec2_ri_group in ec2_ri_groups_list:
    ec2_ri_region = ec2_ri_group['Attributes']['Region']
    ec2_ri_numberOfInstances = ec2_ri_group['Attributes']['NumberOfInstances']
    ec2_ri_instanceType = ec2_ri_group['Attributes']['InstanceType']
    ec2_ri_platform = ec2_ri_group['Attributes']['Platform']
    # 格式化输出利用率
    ec2_ri_UtilizationPercentage = "{:.2f}%".format(float(ec2_ri_group['Utilization']['UtilizationPercentage']))
    message = "            On the AZ " + ec2_ri_region + " , " + ec2_ri_numberOfInstances \
              + " EC2 " + ec2_ri_instanceType + " RI Utilization is " + ec2_ri_UtilizationPercentage + " , its platform is " + ec2_ri_platform
    message_list.append(message)
    message_str = "\n".join(message_list)

print(message_str)
N、CFN Error排错过程:

1、cron表达式有误,已解决

设置了每周周日执行,日期就不能填*,必须填?

2、CFN中创建角色的权限不够

排错步骤:

① 首先检查CFN是否执行成功

② 检查CFN中设定的资源是否都创建成功:Role、Cloudwatch Events、Lambda

③ 检查lambda执行情况

排错时找到了lambda报错原因:权限问题

"errorMessage": 
         "An error occurred (AccessDeniedException) when calling the GetReservationUtilization operation: User: arn:aws-cn:sts::936669166135:assumed-role/RI-Utilization-Monitor-LambdaExecutionRole-KD1YDAO01XWR/GetReservationUtilization is not authorized to perform: ce:GetReservationUtilization on resource: arn:aws:ce:cn-northwest-1:936669166135:/GetReservationUtilization",

解决:

更改CFN创建角色的权限为:

     Policies:
       - PolicyName: RI_Utilization_Monitor
         PolicyDocument:
           Version: 2012-10-17
           Statement:
             - Effect: Allow
               Action:
                 - ec2:DescribeReservedInstances
                 - ec2:DescribeReservedInstancesModifications
                 - ec2:DescribeReservedInstancesOfferings
                 - ec2:DescribeReservedInstancesListings
                 - ce:GetReservationUtilization
                 - sns:Publish
                 - s3:Get*
                 - s3:List*
               Resource: "*"

3、CFN中弄混 !GetAtt、!Ref ,无法获取TopicARN

解决:

① 查看报错信息

RICFNTopicERROR.png

查看AWS::SNS::Topic返回值官方文档

TopicCFNDoc.png

③ 报错定位

RICFN.png

查看!Ref官方示例

!RefDoc.png

⑤ 修改CFN代码(图片只有部分,详见版本2 CFN代码)

TopicCFNRight!RefARN.png

参考:

1、Cloudwatch-Events-Cron 表达式

2、sns boto3

3、GetReservationUtilization Syntax

4、AWS::Events::Rule

5、查看AWS::SNS::Topic返回值官方文档

6、查看!Ref官方示例