Keep Your Cloud Under Control: Infrastructure Best Practices

By Ronny Shani | Contributing editor: Alencar Souza

A few days ago, we asked ChatGPT what’s important when conducting an AWS infrastructure assessment. Per the Oracle of San Francisco’s recommendation, we requested further instructions on setting up Network Access Control Lists (NACL) rules and applying them via Terraform.

Unlike previous attempts (asking ChatGPT’s help transitioning from AWS Fargate to a Terraform setup), which resulted in superficial, bordering on reckless suggestions, OpenAI generated valuable answers this time (see the complete log).

However, the takeaway from both experiments is that OpenAI needs explicit, clear prompts to generate usable, customized results. But these can only be a product of experience. In other words, you need to know what to ask, or you’re guaranteed to receive vague answers that lead to misconfigured instances.

“It doesn’t care which service you’re running (ECS or EKS) or offer fundamental best practices based on hands-on experience. For example, it didn’t suggest running a Relational Database Service (RDS) in a private subnet or enforcing Multi-Factor Authentication (MFA) for users,” explains Project A’s Senior DevOps Engineer Alencar Souza.

The only way to solve this Socratic paradox is through old-school methods: Hire an experienced, knowledgeable pro to review your existing infrastructure setup, recognize particular vulnerabilities, and advise on custom optimization.

High demand for infrastructure assessments

One of the services we offer our portfolio companies is thorough infrastructure audits that analyze various aspects of cloud services, including security and costs.

Cloud optimization assessments focused on costs, for example, include reviewing billing information, allowing companies to make informed decisions, avoiding paying for redundant resources, and cutting overall IT spending. In these assessments, we identify unutilized/idle resources, analyze data transfer fees, monitor cost anomalies, and more.

According to Alencar, who’s performing these audits, security assessments are in high demand, as they provide crucial insights into daily operations, featuring deliverables such as security best practices.

When securing your assets in the cloud, there are a few topics you should focus on:

Access control and management
Network security
Incident management
Storage security
Platform security (services connected to your cloud provider)
Workload security (services deployed to your cloud provider).

Frequent vulnerabilities

Alencar has recently completed infrastructure assessments for a couple of startups. By now, he can identify patterns–no matter which cloud provider the company relies on: “There’s a 99% probability that you’re going to find the same set of common security vulnerabilities everywhere:

Users with too many privileges
Missing multifactor authentication
Exposed secrets
Too broad network permissions
Services exposed to the public
Workloads with no limited scope.

All these,” he mentions, “can be avoided when adopting security best practices.”

How come it’s so common?
“Startups care less about infrastructure security–they focus on deploying and releasing as fast as possible. They also hold off hiring a dedicated DevOps person, who could have mitigated most of the problems mentioned earlier.

An essential step in the process is determining the criticality level and deciding which issues should get the most attention

Without a DevOps person, individual developers set up new instances whenever they need them, ignoring the bigger picture: Existing resources, how they’re used, and what the organization plans. Cloud providers make pushing a few buttons and deploying your app effortless. And that’s great, but engineers have less incentive to learn what’s happening inside the instance. You no longer have to worry about networking or firewalls because the cloud provider will do everything for you.

But when you have an experienced DevOps person, you’ll be able to balance the needs of developers with the available cloud resources and have a clear view of current and future use, ensuring your expenses don’t run out of control.”

Transform into Terraform

Alencar doesn’t try to hack the system or poke around the cloud provider’s dashboard when he performs the audit. His primary tool is Terraformer, a CLI tool that generates Terraform (tf/json and tfstate) files from existing infrastructure and supports all major cloud providers and related services (monitoring, identity, version control, etc.).

Alencar: “This tool generates a detailed overview of the existing resources. It imports everything from the cloud provider. We started using it when one of our portfolio companies asked for support but had no documentation or point of contact who could provide information about the current infrastructure. Instead of just going to the browser and checking everything manually, clicking on stuff, Terraformer lets me download everything. Then I can just browse the folders in my local environment.”

An essential step in the process is determining the criticality level and deciding which issues should get the most attention, especially when resources are tight, timelines are short, and the team is understaffed.

Alencar: “There are issues you must fix immediately and others you can solve later. For example, if your databases are publicly accessible, it’s critical. Ideally, you should run it on private subnets. Hard-coded credentials are another security risk you can prevent by implementing CI/CD systems (GitHub or Gitlab), which offer features like secret management. You can store your secrets there or organize them by the environment and ensure you have access to the secrets required in your pipelines for connecting to the cloud provider or building your application.”

Adopt best practices

That’s what Alencar describes as bad practice—common issues you can prevent with common sense: “I come from a development background, and one of the most important things is a mindset or culture of adopting best practices that prevent future trouble.”

Best practices cover four areas:

Infrastructure as Code (IaC)
CI/CD
Observability
Documentation

The latter is often an afterthought outsourced to other stakeholders. But Alencar believes it’s an essential part of daily work – both before you start and after you’ve finished.

“It doesn’t matter if it’s development or infrastructure; check the documentation before implementing something. Secret management, for example, isn’t new or difficult to implement. It’s right there in the documentation. Read, stay up to date on the latest best practices, and follow them.

I’d recommend starting with smaller cloud providers.

Secondly, don’t expect to work alone; if you get hit by a bus tomorrow, someone must take over. When you create things manually, have no documentation, and don’t adopt IaC, it’ll be complicated to do that. We had this problem with one of our portfolio companies: Everything was set up manually, they didn’t have documentation, and we didn’t know what was happening. We spent a lot of time analyzing the whole infrastructure and importing the existing resource into tools like Terraform.”

Are these recommendations adapted to each company you review, or is it a conventional list of best practices? For example, some things may require a dedicated DevOps person – do you consider whether they have one?
“If you’re a startup with a tight budget and can’t hire a dedicated DevOps person, at least hire developers with DevOps skills who know how to set up a local environment. Building Docker images and starting services locally are similar to what’s currently happening in the cloud provider’s servers. Like your local environment, your service is probably running in a VM instance. You should know the basics.”

The (Project) A factor

As an operational VC, Project doesn’t only consult—it also provides practical support. Assuming you hired only developers who can code fast, this offering could save time, money, and heartache. ”A portfolio company reached out to us, asking for help setting up the technical infrastructure and engineering team from scratch. We hired developers and implemented these best practices, and thanks to that, everything is running smoothly. They don’t even have a DevOps person because it just works.“

This can theoretically be a one-time thing.
”Almost. It’s been more than three years since we completed the project, and two months ago, they reached out to me because something stopped working. It turns out they needed to update some of the packages deployed to their Kubernetes cluster. So you have to maintain dependencies, but it doesn’t require DevOps skills. Any developer could do that, especially when the cloud provider notifies you.“

Think small

Speaking of complexity, what’s your take on managed cloud services versus AWS or GCP?
“I’d recommend starting with smaller cloud providers. I’m not a big fan of those big cloud providers, and I think companies use them because it’s trendy: You use AWS because everyone else uses it, although you’re not aware of 90% of the stuff that goes on behind the first page. It may sound reductive, but it’s often like that.

When you ask them WHY they chose AWS, if there’s a particular reason, they’d say that it’s easier to find experienced people, overlooking that this limits their options. What will they do if they later migrate to another cloud provider, fire their DevOps because they have no experience with GCP?”

Alencar’s extensive experience shows that committing to one cloud provider for trivial reasons is bad business. “If you understand infrastructure, where you host doesn’t matter – it’s the same principle. You can use DigitalOcean, or you can deploy your services to Hetzner. You’ll find plenty of documentation, and if you adopt IaC (which you should), you don’t even have to know much about the cloud provider itself—you just read Terraform/Pulumi/Ansible’s documentation and provision your resources. It’s much easier than navigating vendors’ UI.”

Whichever cloud provider you end up using, make sure to consult your trusted DevOps engineer before you deploy.

Appendix: Our Q&A with ChatGPT

Q: What do I need to consider when conducting an infrastructure assessment of AWS?

A: When conducting an infrastructure assessment on AWS, there are several important factors to consider. Here are some key considerations:

Security: Security should be a top priority. Make sure that all data and systems are properly secured. Check that the AWS Security Group and Network Access Control List (NACL) rules are set up correctly, and that IAM (Identity and Access Management) policies are properly configured.
Availability: It is important to ensure that the infrastructure is designed for high availability. This means that the infrastructure should be able to handle failures in individual components without impacting the overall system. Review the architecture, deployment model, and scaling options to ensure that the infrastructure is resilient and can handle the expected workload.
Scalability: It is important to ensure that the infrastructure can scale to meet changing demands. Assess the current infrastructure for its ability to handle peak loads and future growth. Review the Auto Scaling Groups and Elastic Load Balancers to ensure that they are set up correctly and can respond to changes in demand.
Cost Optimization: AWS provides a variety of cost optimization tools and services that can help reduce infrastructure costs. Review the current infrastructure for any areas where costs can be reduced without impacting performance or availability.
Compliance: Depending on the industry and type of data being stored, it may be necessary to comply with certain regulations and standards. Ensure that the infrastructure meets any relevant compliance requirements.
Monitoring and Logging: Monitoring and logging are critical components of any infrastructure. Review the current monitoring and logging solutions in place and ensure that they are set up correctly to provide the necessary insights into the system’s performance and health.
Disaster Recovery: It is important to ensure that the infrastructure has a disaster recovery plan in place. Review the current plan and ensure that it is comprehensive and can be executed in case of a disaster.

Overall, an infrastructure assessment on AWS should consider the above-mentioned factors to ensure that the infrastructure is secure, available, scalable, cost-optimized, compliant, and has a disaster recovery plan in place.

Q: I’m not using NACL rules yet, how would I set these up in AWS?

A: To set up Network Access Control Lists (NACL) in AWS, you can follow these steps:

Go to the AWS Management Console and navigate to the VPC (Virtual Private Cloud) Dashboard.
In the left navigation pane, select “Network ACLs” and click the “Create Network ACL” button.
Give the NACL a name and select the VPC that you want to associate it with.
After creating the NACL, select it from the list of NACLs and click the “Inbound Rules” or “Outbound Rules” tabs to configure the rules.
Click the “Edit” button to create a new rule.
Select the rule number, which determines the order in which the rules are evaluated, and choose the traffic type (e.g. SSH, HTTP, HTTPS).
Specify the source and destination IP addresses or CIDR blocks that the rule applies to.
Choose the action for the rule (e.g. allow, deny).
Repeat these steps for each rule you want to create.

Note: The rules in an NACL are evaluated in order, starting with the lowest numbered rule. If a rule matches the traffic, that rule’s action is taken and no further rules are evaluated. If no rule matches the traffic, the default action (allow or deny) is taken.

After configuring the NACL rules, you can associate the NACL with the appropriate subnets in your VPC. This can be done by selecting the subnet and choosing “Edit Subnet Associations” from the “Actions” menu. From there, you can select the NACL that you want to associate with the subnet.

It’s important to note that NACLs are stateless, meaning that they don’t keep track of the state of connections. Therefore, for protocols like TCP, you’ll need to create separate inbound and outbound rules to allow traffic in both directions.

Q: How would I apply this with Terraform?

A: To apply Network Access Control Lists (NACL) rules in AWS using Terraform, you can follow these steps:

Define a new resource of type “aws_network_acl” in your Terraform configuration file:

resource "aws_network_acl" "my_acl" {
vpc_id = aws_vpc.my_vpc.id
tags = {
Name = "my_acl"
}
}
Code language: JavaScript (javascript)

This creates a new NACL with the name “my_acl” and associates it with the VPC with ID “aws_vpc.my_vpc.id”.

Define inbound and outbound rules for the NACL using the “aws_network_acl_rule” resource type. Here’s an example of an inbound rule that allows HTTP traffic from any source:

resource "aws_network_acl_rule" "http_inbound" {
network_acl_id = aws_network_acl.my_acl.id
rule_number = 100
protocol = "tcp"
rule_action = "allow"
egress = false
cidr_block = "0.0.0.0/0"
from_port = 80
to_port = 80
}
Code language: JavaScript (javascript)

This creates a rule with rule number 100 that allows incoming TCP traffic on port 80 from any source.

Define additional inbound and outbound rules as needed for your application.
Associate the NACL with the appropriate subnets using the “aws_network_acl_association” resource type:

resource "aws_network_acl_association" "my_acl_association" {
subnet_id = aws_subnet.my_subnet.id
network_acl_id = aws_network_acl.my_acl.id
}
Code language: JavaScript (javascript)

This associates the NACL “my_acl” with the subnet with ID “aws_subnet.my_subnet.id”.

Run “terraform init”, “terraform plan”, and “terraform apply” to apply the changes to your AWS infrastructure.

Note that the above examples are just a starting point and you’ll need to adjust them to fit your specific requirements. For example, you’ll need to create different rules for different protocols, update the rule numbers to ensure they’re evaluated in the correct order, and specify the appropriate source and destination CIDR blocks.

High demand for infrastructure assessments

Frequent vulnerabilities

Transform into Terraform

Adopt best practices

The (Project) A factor

Think small

Appendix: Our Q&A with ChatGPT

Related

Unleash your inner DevOps

The mysterious case of the failing AWS Elastic Beanstalk instance

We cut our AWS bill by 65% after moving to Lambda — Part 2: here’s how we did it