Introduction:

Observability is no longer an afterthought, it is an architectural requirement. This article walks through a complete journey,from containerising a microservice demo locally to running it in production on Amazon Elastic Kubernetes Service (EKS) with full observability, alerting and DevSecOps automation.

Objective:

Establish a Test and Production Deployment Pipeline

Validate the OpenTelemetry demo microservices in a local Docker environment (EC2) before transitioning to a production-grade setup.
Minimize costs during development and testing using resource-efficient AWS configurations.

Build a Secure, Scalable Kubernetes Infrastructure

Use Amazon EKS for deploying microservices with scalability (via auto-scaling groups) and security (via IAM, KMS, Pod Identity, VPC).

Integrate Observability and Monitoring

Enable tracing, metrics, and logging using the OpenTelemetry Collector, Prometheus, Grafana, and CloudWatch.
Monitor application health and pod behavior in real time.

Streamline Kubernetes Operations Using Helm

Simplify deployment and configuration management through Helm charts.
Enable versioned deployments, seamless upgrades, and rollbacks with minimal manual intervention.

Automate End-to-End CI/CD with DevSecOps Practices

Implement GitHub Actions to automate build, scan, push, and deploy steps.
Integrate security scanning tools (Trivy, FOSSA, OSSF Scorecard) directly into the CI pipeline.
Ensure robust rollback mechanisms in case of deployment failure.

Enable Alerting and Incident Response

Set up automated alerting using Prometheus and Alertmanager.
Send real-time email notifications for critical issues like pod restarts.

Phase 1: Docker Deployment and Foundational EKS Setup

Phase 1 laid the groundwork for deploying the OpenTelemetry microservices demo application, starting from a local test environment using Docker to a production-grade Kubernetes cluster on Amazon EKS. This phase focused on validating service functionality, optimizing infrastructure, and building a secure and scalable cloud environment.

1.1 Objectives

This phase had two major goals:

Local Validation:

Test the microservices architecture in a local (EC2-hosted) Docker Compose setup to validate functionality, verify configurations, and determine resource requirements. This helped reduce unnecessary cloud costs during development.
Production Infrastructure:

Deploy the validated application to a fully-managed Kubernetes cluster (Amazon EKS) that supports scalability, observability, and security features like IAM-based access, secrets encryption, and network segmentation.

1.2 Implementation

1.2.1 EC2 Test Environment

To run the OpenTelemetry demo locally using Docker Compose, the team provisioned an EC2 instance with the following specifications:
- Instance Type: t2.xlarge
- vCPUs: 4
- RAM: 16 GB
- Storage: 30 GB General Purpose SSD (gp2)
- Class: On-demand
The test environment helped simulate microservices behavior and understand performance thresholds. Key takeaways:
The application performed well on a t2.xlarge instance.
Smaller instance types led to performance degradation and service failures.
On-demand pricing was chosen over spot instances for stability during evaluation.
The microservices were deployed with Docker Compose using:

cd opentelemetry-demo/
docker compose up

Verification steps included:
Confirming all services were up with docker ps.
Checking individual service logs with docker logs to detect misconfigurations or startup errors.
Exposing the EC2 instance to the internet and accessing the application via its public IP and configured port (e.g., 8080).

1.2.2 EKS Production Environment

After successful validation, the team transitioned to the production deployment on Amazon EKS with the following components:

Networking Design:
- Public Subnets: Hosted ingress components such as the ALB Ingress Controller.
- Private Subnets: Hosted the actual worker nodes running Kubernetes workloads.
EKS Node Group Configuration:
- Instance Type: t2.xlarge
- Auto Scaling Setup:
  - Minimum Nodes: 1
  - Desired Nodes: 2
  - Maximum Nodes: 3
Scaling Behavior: The cluster auto-scales up to three nodes under heavy load and always maintains at least one active node.
Use of Spot Instances:

Stateless or loosely coupled services were deployed on spot instances to optimize cost without affecting stability.
Add-ons and Integrations:
- VPC-CNI Addon:
  
  Provides pod networking by assigning VPC IPs directly to pods and allowing integration with security groups and AWS PrivateLink.
- EBS-CSI Addon:
  
  Allows dynamic provisioning of EBS volumes for stateful workloads, supporting snapshot-based backup and recovery.
- CloudWatch Agent Addon:
  
  Enables Container Insights, collecting metrics and logs from containers and sending them to CloudWatch for centralized observability.
- Pod Identity (IRSA):
  
  Pods request temporary IAM credentials from a DaemonSet-based Pod Identity Agent, eliminating the need for long-term secrets or manual AWS credential management.
- ALB Ingress Controller Support:
  
  A specific pod identity was created and annotated to allow ALB controllers to deploy ingress resources using proper IAM permissions.
KMS Integration:

Secrets in Kubernetes and data in EBS volumes are encrypted using AWS Key Management Service (KMS) for enhanced data security.

1.3 Verification

1.3.1 EC2 Instance

Confirmed that the Docker Compose environment functioned as expected:
Services started successfully
Logs showed proper service-to-service communication
Application accessible via public IP

1.3.2 EKS Cluster Setup

Used aws eks update-kubeconfig to configure kubectl.
Cloned and deployed the infrastructure from GitHub: https://github.com/arbaaz29/eks_terraform
Deployed all Kubernetes manifests using:

cd eks_terraform/k8s
kubectl apply -f .

Verified the following in both the default and otel-demo namespaces:
- Running Pods
- Services
- Deployments
- Logs for key microservices (e.g., frontend-proxy, Grafana, Jaeger, OpenSearch)

1.4 Key Observations and Outcomes

1.4.1 EC2-Based Testing

Docker Compose was a lightweight, fast method to validate the microservices.
t2.xlarge proved sufficient for the memory and CPU requirements of the services.
Network accessibility issues (e.g., frontend-proxy not publicly reachable) were resolved by updating security groups to allow traffic on port 8080.

1.4.2 EKS Production Deployment

The EKS environment supported auto-scaling, KMS-based encryption, CloudWatch observability, and secure access using pod identities.
Services were accessible through the ALB Ingress, and the system scaled reliably under simulated load.

1.5 Challenges and Resolutions

Challenge	Resolution
EC2 access to frontend-proxy blocked	Updated security group to allow traffic on port 8080
Pods could not use AWS services (e.g., EBS, ALB)	Created and annotated IAM policies with Pod Identity for appropriate service accounts
Deployment manifest misconfiguration (e.g., product-catalog, Grafana)	Fixed configMaps and applied corrections in the deployment manifests
Subnets not recognized for Kubernetes resources	Added proper Kubernetes resource tags to the subnets
GitHub user lacked EKS access	Added user to access entries with `eksclusteradmin` permissions

Notes

Ensure IAM roles and policies follow the principle of least privilege.
Check if service accounts have proper annotations so that respective IAM roles can be associated with them.
Confirm subnet tagging aligns with the EKS requirements for cluster and load balancer integration.
Maintain an access control list for GitHub users with justifications for elevated permissions.

Phase 2: Integrating Helm for Kubernetes Deployment

After successfully deploying and verifying the OpenTelemetry microservices on Amazon EKS using raw Kubernetes manifests, the next logical step was to streamline and simplify the deployment process. Phase 2 focused on using Helm, the package manager for Kubernetes, to manage and automate deployments, upgrades, and rollbacks.

2.1 Objective

Reduce complexity by eliminating the need to apply multiple manifest files manually.
Enable configuration reusability through templated Helm values.
Simplify updates and rollbacks using Helm’s built-in features.
By adopting Helm, the team could package all resources into a single installable unit and maintain greater control over environment configurations and version history.

2.2 Implementation

2.2.1 Adding Helm Repository

To begin, the team added the official OpenTelemetry Helm chart repository:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

This pulled in the latest Helm charts for OpenTelemetry components, including:
Frontend, backend, and telemetry services
OpenTelemetry Collector
Jaeger, Kafka, Prometheus exporters, etc.
The use of official charts ensured that best practices were followed and configurations remained compatible with Kubernetes standards.

2.2.2 Deploying the Application Using Helm

To isolate the Helm-based deployment from the manually deployed environment, a new namespace was created:

kubectl create namespace otel-helm-demo

Then the OpenTelemetry demo was deployed using the Helm chart:

helm install otel-demo open-telemetry/opentelemetry-demo -n otel-helm-demo

Verification steps included:

kubectl get pods -n otel-helm-demo
kubectl get service -n otel-helm-demo

This confirmed that:
- All Kubernetes resources (pods, services, deployments) were created correctly.
- The Helm chart encapsulated all necessary microservices in a single, consistent deployment process.

2.2.3 Upgrade and Rollback

To simulate real-world usage and verify Helm’s lifecycle management features, the team tested an upgrade and rollback scenario.

Upgrade Scenario
- The replica count of the frontend-proxy component was increased from the default to 3:

      helm upgrade otel-demo open-telemetry/opentelemetry-demo \
      -n otel-helm-demo \
      --set components.frontend-proxy.replicas=3

Verification:

kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo

This confirmed the increased number of frontend-proxy pods and a new revision entry in Helm’s release history.
Rollback Scenario:
- To test rollback capability, the team reverted to the previous revision:

helm rollback otel-demo 1 -n otel-helm-demo
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
kubectl describe deployment frontend -n otel-helm-demo

This successfully returned the deployment to its initial configuration without any manual cleanup or reconfiguration.

2.3 Challenges and Solutions

Challenge	Resolution
EC2 access to frontend-proxy blocked	Updated security group to allow traffic on port 8080
Pods could not use AWS services (e.g., EBS, ALB)	Created and annotated IAM policies with Pod Identity for appropriate service accounts
Helm values misconfiguration (e.g., product-catalog, Grafana)	Fixed configMaps and applied corrections in the Helm values
Subnets not recognized for Kubernetes resources	Added proper Kubernetes resource tags to the subnets
GitHub user lacked EKS access	Added user to access entries with `eksclusteradmin` permissions
Used incorrect override path: `frontend-proxy.replicaCount`	Corrected to `components.frontend-proxy.replicas` by consulting the chart’s documentation
Pods crashed post-upgrade due to missing configuration values	Reviewed and updated `values.yaml` structure, and used `--set` flags to apply overrides inline during upgrade

Notes

Always verify Helm override paths against the chart’s structure and documentation.
Use --dry-run and helm template to preview changes before applying them.

2.4 Conclusion

Helm proved to be a powerful tool for managing Kubernetes applications. Its advantages included:
Declarative management of complex deployments using reusable values files.
Single-command upgrades without touching individual manifests.
Built-in rollback support that provided operational safety in case of failed changes.
Namespace isolation, allowing multiple environments or versions to coexist without conflict.
The transition from kubectl apply to helm install significantly reduced manual overhead and improved reliability, making the system more production-ready.

Phase 3: Alerting Service and Notifications

After establishing deployment and observability foundations, Phase 3 focused on real-time alerting to detect application health issues, particularly around pod restarts. This phase introduced monitoring and alerting mechanisms using the Prometheus Stack, Alertmanager, and Kubernetes ConfigMaps, with email notifications configured via SMTP.

3.1 Objective

Enable automated alerts for abnormal pod behavior, especially frequent restarts.
Notify the team via email when such issues occur, enabling rapid detection and response.
Integrate Prometheus and Alertmanager into the existing Kubernetes monitoring setup for centralized management.
This phase strengthened the operational observability of the EKS cluster and ensured that problems could be acted upon in near real-time.

3.2 Implementation

3.2.1 Deploying the Prometheus Stack with Helm

The monitoring stack included:
- Prometheus for metrics collection
- Alertmanager for sending alerts
- kube-state-metrics for Kubernetes state data
To deploy these components, the official Helm chart from the prometheus-community repository was used:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

This deployed all necessary resources into a dedicated monitoring namespace, helping with logical separation and resource governance.

3.2.2 Creating the Alerting Rule for Pod Restarts

To detect frequent container restarts, a custom Prometheus alert rule was defined in a file named alerts.yaml:

groups:
- name: pod-restarts
  rules:
  - alert: PodRestartTooHigh
    expr: increase(kube_pod_container_status_restarts_total[5m]) > 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "High restart count detected"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted more than 3 times in the last 5 minutes."

This rule triggers an alert when any container in a pod restarts more than 3 times within a 5-minute window, persisting for at least one minute.
The alert rule was applied using a Kubernetes ConfigMap:

kubectl create configmap prometheus-alerts --from-file=alerts.yaml -n monitoring

This ConfigMap could then be mounted into the Prometheus deployment via Helm values (if dynamic configuration reload was enabled).

3.2.3 Configuring Alertmanager for Email Notifications

To route alerts via email, Alertmanager was configured to use Gmail’s SMTP service. The configuration was defined in alertmanager.yaml:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password_file: '/etc/secrets/smtp_password'

route:
  receiver: 'Mail Alert'
  repeat_interval: 30s
  group_wait: 15s
  group_interval: 15s

receivers:
- name: 'Mail Alert'
  email_configs:
  - to: '[email protected]'
    headers:
      subject: 'Pod stuck in restart state'

This file was converted to a Kubernetes Secret:

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring

The password for Gmail SMTP was provided via a mounted file (smtp_password) for secure authentication.

3.2.4 Testing the Alert

To validate the alerting system, a crash-looping pod was manually created using:

kubectl run crashloop-demo --image=busybox --restart=Always -- /bin/sh -c "exit 1"

This caused the pod to continuously restart, increasing the restart counter.
Prometheus, using the alert rule defined earlier, detected this behavior. Once the increase() function’s threshold was crossed and persisted for one minute, an alert was triggered.
The alert appeared in the Alertmanager UI under the configured route.
An email notification was sent to the specified address with full metadata, including:
- Alert name
- Namespace
- Affected pod
- Restart count
- Timestamps and severity

3.3 Deliverables

The following outcomes and artifacts were successfully produced:
- Visible alert in Alertmanager
- Under the “Mail Alert” route with severity critical.
- Prometheus query graph
- Showed increasing values of kube_pod_container_status_restarts_total.
- Fired alert instance
- Prometheus executed the rule and triggered the alert.
- Email notification received
- Delivered by Gmail SMTP with a descriptive subject and message body.
- Supporting configuration files
- alerts.yaml for Prometheus rules
- alertmanager.yaml for email routing
- Kubernetes Secret for secure password injection

3.4 Summary and Impact

The implementation of real-time alerting brought several operational benefits:
- Early Detection: Crash-looping pods and other anomalies are flagged almost instantly.
- Rapid Response: Email alerts reach stakeholders without requiring constant dashboard monitoring.
- Production Readiness: The system now includes observability not only through dashboards, but through active notifications.
This phase added an essential layer of resilience, helping the team respond to failures before they escalate into service outages.

Phase 4: CI/CD Integration with DevSecOps Enhancements

With the infrastructure, observability, and alerting systems in place, Phase 4 of the project focused on automating the software delivery pipeline using GitHub Actions. The goal was to implement a robust Continuous Integration and Continuous Deployment (CI/CD) system, bolstered by DevSecOps best practices such as automated vulnerability scanning, license checks, rollback mechanisms, and secure secret management.

4.1 CI/CD Pipeline Overview

The CI/CD workflow was built using GitHub Actions, and was triggered on code pushes to the main branch. It performed the following steps in sequence:

Step	Description
Checkout Code	Pull the latest source code from GitHub
Configure AWS Credentials	Authenticate GitHub runner to access AWS using GitHub Secrets
Login to Amazon ECR	Use Docker CLI to log in to Elastic Container Registry
Set Environment Variables	Dynamically generate `.env` file with image tags and ECR URIs
Build Docker Images	Build all microservices using `docker-compose`
Push Images to ECR	Upload container images to AWS ECR
Install Trivy	Install Trivy CLI for vulnerability scanning
Scan Images	Run scans on each image and fail on HIGH/CRITICAL CVEs
Update kubeconfig	Authenticate `kubectl` to the target EKS cluster
Patch YAML Manifests	Automatically update Kubernetes manifests with new image tags
Commit Updated YAMLs	Push the updated manifests back to the GitHub repository
Deploy to EKS	Apply all manifests using `kubectl apply`
Deploy Monitoring Configs	Apply configurations for `kube-state-metrics` and alerting rules
Rollback on Failure	If any step fails, trigger `kubectl rollout undo` for all deployments

This end-to-end process ensures that each change in the source repository automatically goes through build, scan, deploy, and monitor steps with rollback support.

4.2 Rollback Mechanism

One of the key production-readiness features in this phase was automated rollback. The workflow used GitHub Actions’ if: failure() condition to trigger:

kubectl rollout undo deployment/<service-name> -n <namespace>

This command restored each service to its previously stable replica set.
A failure was simulated during testing by applying an invalid image tag, which correctly triggered the rollback behavior, ensuring that no broken deployments reached users.

4.3 Secret Management

Sensitive data such as AWS credentials, API keys, and cluster names were stored securely in GitHub Secrets, and accessed in the workflow using:

${{ secrets.<KEY_NAME> }}

Examples of secrets used:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
EKS_CLUSTER_NAME
FOSSA_API_KEY
This practice eliminated the need for storing plaintext credentials in code or configuration files, aligning with industry security best practices.

4.4 DevSecOps Integrations

To ensure code quality, security, and license compliance, several DevSecOps tools were integrated directly into the CI pipeline:

4.4.1 FOSSA
- Purpose: Scan for license violations and known open-source vulnerabilities.
- Integration: Triggered via FOSSA GitHub Action.
- Outcome: Completed successfully with no issues detected.
4.4.2 Gradle Wrapper Validation
- Purpose: Check that gradle-wrapper.jar and gradle-wrapper.properties are valid and not tampered with.
- Trigger: PR or push events.
- Outcome: Successfully validated using a test commit under the correct path.
4.4.3 OSSF Scorecard
- Purpose: Assess the security posture of the GitHub repository.
- Features Checked: Branch protection, dependency update automation, token permissions, and more.
- Integration: Results uploaded to GitHub’s Code Scanning dashboard.
- Schedule: Triggered on push and weekly.

4.5 Challenges and Solutions

Challenge	Resolution
Inconsistent Docker image tagging	Used `.env` file with dynamic GitHub Actions variables to standardize tags
FOSSA action failed due to team misconfiguration	Removed `team` parameter and used auto-detection
Trivy scan failed due to bad image reference	Corrected the image tagging format
Gradle wrapper validation didn’t trigger	Created a dummy commit in the monitored path to validate integration
Kubernetes YAMLs not updated for each image	Used `sed` to auto-update image tags in all deployment files

4.6 Execution Results

Artifacts and verifications from successful pipeline executions included:
- GitHub Actions workflow logs showing successful build, scan, and deployment
- Trivy scan logs showing no high/critical vulnerabilities
- Confirmation of image push to ECR
- Visual confirmation of rollback behavior (if triggered)
- Updated deployment manifests committed to GitHub
- Running pods confirmed via kubectl get pods
- Live application access via ALB Ingress

Introduction:#

Objective:#

Phase 1: Docker Deployment and Foundational EKS Setup#

1.1 Objectives#

1.2 Implementation#

1.2.1 EC2 Test Environment#

1.2.2 EKS Production Environment#

1.3 Verification#

1.3.1 EC2 Instance#

1.3.2 EKS Cluster Setup#

1.4 Key Observations and Outcomes#

1.4.1 EC2-Based Testing#

1.4.2 EKS Production Deployment#

1.5 Challenges and Resolutions#

Notes#

Phase 2: Integrating Helm for Kubernetes Deployment#

2.1 Objective#

2.2 Implementation#

2.2.1 Adding Helm Repository#

2.2.2 Deploying the Application Using Helm#

2.2.3 Upgrade and Rollback#

2.3 Challenges and Solutions#

Notes#

2.4 Conclusion#

Phase 3: Alerting Service and Notifications#

3.1 Objective#

3.2 Implementation#

3.2.1 Deploying the Prometheus Stack with Helm#

3.2.2 Creating the Alerting Rule for Pod Restarts#

3.2.3 Configuring Alertmanager for Email Notifications#

3.2.4 Testing the Alert#

3.3 Deliverables#

3.4 Summary and Impact#

Phase 4: CI/CD Integration with DevSecOps Enhancements#

4.1 CI/CD Pipeline Overview#

4.2 Rollback Mechanism#

4.3 Secret Management#

4.4 DevSecOps Integrations#

4.4.1 FOSSA#

4.4.2 Gradle Wrapper Validation#

4.4.3 OSSF Scorecard#

4.5 Challenges and Solutions#

4.6 Execution Results#

Github Repo:#

Introduction:

Objective:

Phase 1: Docker Deployment and Foundational EKS Setup

1.1 Objectives

1.2 Implementation

1.2.1 EC2 Test Environment

1.2.2 EKS Production Environment

1.3 Verification

1.3.1 EC2 Instance

1.3.2 EKS Cluster Setup

1.4 Key Observations and Outcomes

1.4.1 EC2-Based Testing

1.4.2 EKS Production Deployment

1.5 Challenges and Resolutions

Notes

Phase 2: Integrating Helm for Kubernetes Deployment

2.1 Objective

2.2 Implementation

2.2.1 Adding Helm Repository

2.2.2 Deploying the Application Using Helm

2.2.3 Upgrade and Rollback

2.3 Challenges and Solutions

Notes

2.4 Conclusion

Phase 3: Alerting Service and Notifications

3.1 Objective

3.2 Implementation

3.2.1 Deploying the Prometheus Stack with Helm

3.2.2 Creating the Alerting Rule for Pod Restarts

3.2.3 Configuring Alertmanager for Email Notifications

3.2.4 Testing the Alert

3.3 Deliverables

3.4 Summary and Impact

Phase 4: CI/CD Integration with DevSecOps Enhancements

4.1 CI/CD Pipeline Overview

4.2 Rollback Mechanism

4.3 Secret Management

4.4 DevSecOps Integrations

4.4.1 FOSSA

4.4.2 Gradle Wrapper Validation

4.4.3 OSSF Scorecard

4.5 Challenges and Solutions

4.6 Execution Results

Github Repo: