Introduction:
Observability is no longer an afterthought, it is an architectural requirement. This article walks through a complete journey,from containerising a microservice demo locally to running it in production on Amazon Elastic Kubernetes Service (EKS) with full observability, alerting and DevSecOps automation.
Objective:
- Establish a Test and Production Deployment Pipeline
-
Validate the OpenTelemetry demo microservices in a local Docker environment (EC2) before transitioning to a production-grade setup.
-
Minimize costs during development and testing using resource-efficient AWS configurations.
- Build a Secure, Scalable Kubernetes Infrastructure
- Use Amazon EKS for deploying microservices with scalability (via auto-scaling groups) and security (via IAM, KMS, Pod Identity, VPC).
- Integrate Observability and Monitoring
-
Enable tracing, metrics, and logging using the OpenTelemetry Collector, Prometheus, Grafana, and CloudWatch.
-
Monitor application health and pod behavior in real time.
- Streamline Kubernetes Operations Using Helm
-
Simplify deployment and configuration management through Helm charts.
-
Enable versioned deployments, seamless upgrades, and rollbacks with minimal manual intervention.
- Automate End-to-End CI/CD with DevSecOps Practices
-
Implement GitHub Actions to automate build, scan, push, and deploy steps.
-
Integrate security scanning tools (Trivy, FOSSA, OSSF Scorecard) directly into the CI pipeline.
-
Ensure robust rollback mechanisms in case of deployment failure.
- Enable Alerting and Incident Response
-
Set up automated alerting using Prometheus and Alertmanager.
-
Send real-time email notifications for critical issues like pod restarts.
Phase 1: Docker Deployment and Foundational EKS Setup
Phase 1 laid the groundwork for deploying the OpenTelemetry microservices demo application, starting from a local test environment using Docker to a production-grade Kubernetes cluster on Amazon EKS. This phase focused on validating service functionality, optimizing infrastructure, and building a secure and scalable cloud environment.
1.1 Objectives
This phase had two major goals:
-
Local Validation:
Test the microservices architecture in a local (EC2-hosted) Docker Compose setup to validate functionality, verify configurations, and determine resource requirements. This helped reduce unnecessary cloud costs during development.
-
Production Infrastructure:
Deploy the validated application to a fully-managed Kubernetes cluster (Amazon EKS) that supports scalability, observability, and security features like IAM-based access, secrets encryption, and network segmentation.
1.2 Implementation
1.2.1 EC2 Test Environment
-
To run the OpenTelemetry demo locally using Docker Compose, the team provisioned an EC2 instance with the following specifications:
-
Instance Type: t2.xlarge
-
vCPUs: 4
-
RAM: 16 GB
-
Storage: 30 GB General Purpose SSD (gp2)
-
Class: On-demand
-
-
The test environment helped simulate microservices behavior and understand performance thresholds. Key takeaways:
-
The application performed well on a t2.xlarge instance.
-
Smaller instance types led to performance degradation and service failures.
-
On-demand pricing was chosen over spot instances for stability during evaluation.
-
The microservices were deployed with Docker Compose using:
cd opentelemetry-demo/
docker compose up
-
Verification steps included:
-
Confirming all services were up with docker ps.
-
Checking individual service logs with docker logs
to detect misconfigurations or startup errors. -
Exposing the EC2 instance to the internet and accessing the application via its public IP and configured port (e.g., 8080).
1.2.2 EKS Production Environment
After successful validation, the team transitioned to the production deployment on Amazon EKS with the following components:
-
Networking Design:
-
Public Subnets: Hosted ingress components such as the ALB Ingress Controller.
-
Private Subnets: Hosted the actual worker nodes running Kubernetes workloads.
-
-
EKS Node Group Configuration:
-
Instance Type: t2.xlarge
-
Auto Scaling Setup:
-
Minimum Nodes: 1
-
Desired Nodes: 2
-
Maximum Nodes: 3
-
-
-
Scaling Behavior: The cluster auto-scales up to three nodes under heavy load and always maintains at least one active node.
-
Use of Spot Instances:
Stateless or loosely coupled services were deployed on spot instances to optimize cost without affecting stability.
-
Add-ons and Integrations:
-
VPC-CNI Addon:
Provides pod networking by assigning VPC IPs directly to pods and allowing integration with security groups and AWS PrivateLink.
-
EBS-CSI Addon:
Allows dynamic provisioning of EBS volumes for stateful workloads, supporting snapshot-based backup and recovery.
-
CloudWatch Agent Addon:
Enables Container Insights, collecting metrics and logs from containers and sending them to CloudWatch for centralized observability.
-
Pod Identity (IRSA):
Pods request temporary IAM credentials from a DaemonSet-based Pod Identity Agent, eliminating the need for long-term secrets or manual AWS credential management.
-
ALB Ingress Controller Support:
A specific pod identity was created and annotated to allow ALB controllers to deploy ingress resources using proper IAM permissions.
-
-
KMS Integration:
Secrets in Kubernetes and data in EBS volumes are encrypted using AWS Key Management Service (KMS) for enhanced data security.
1.3 Verification
1.3.1 EC2 Instance
-
Confirmed that the Docker Compose environment functioned as expected:
-
Services started successfully
-
Logs showed proper service-to-service communication
-
Application accessible via public IP
1.3.2 EKS Cluster Setup
-
Used aws eks update-kubeconfig to configure kubectl.
-
Cloned and deployed the infrastructure from GitHub: https://github.com/arbaaz29/eks_terraform
-
Deployed all Kubernetes manifests using:
cd eks_terraform/k8s
kubectl apply -f .
- Verified the following in both the default and otel-demo namespaces:
- Running Pods
- Services
- Deployments
- Logs for key microservices (e.g., frontend-proxy, Grafana, Jaeger, OpenSearch)
1.4 Key Observations and Outcomes
1.4.1 EC2-Based Testing
-
Docker Compose was a lightweight, fast method to validate the microservices.
-
t2.xlarge proved sufficient for the memory and CPU requirements of the services.
-
Network accessibility issues (e.g., frontend-proxy not publicly reachable) were resolved by updating security groups to allow traffic on port 8080.
1.4.2 EKS Production Deployment
-
The EKS environment supported auto-scaling, KMS-based encryption, CloudWatch observability, and secure access using pod identities.
-
Services were accessible through the ALB Ingress, and the system scaled reliably under simulated load.
1.5 Challenges and Resolutions
Challenge | Resolution |
---|---|
EC2 access to frontend-proxy blocked | Updated security group to allow traffic on port 8080 |
Pods could not use AWS services (e.g., EBS, ALB) | Created and annotated IAM policies with Pod Identity for appropriate service accounts |
Deployment manifest misconfiguration (e.g., product-catalog, Grafana) | Fixed configMaps and applied corrections in the deployment manifests |
Subnets not recognized for Kubernetes resources | Added proper Kubernetes resource tags to the subnets |
GitHub user lacked EKS access | Added user to access entries with eksclusteradmin permissions |
Notes
- Ensure IAM roles and policies follow the principle of least privilege.
- Check if service accounts have proper annotations so that respective IAM roles can be associated with them.
- Confirm subnet tagging aligns with the EKS requirements for cluster and load balancer integration.
- Maintain an access control list for GitHub users with justifications for elevated permissions.
Phase 2: Integrating Helm for Kubernetes Deployment
- After successfully deploying and verifying the OpenTelemetry microservices on Amazon EKS using raw Kubernetes manifests, the next logical step was to streamline and simplify the deployment process. Phase 2 focused on using Helm, the package manager for Kubernetes, to manage and automate deployments, upgrades, and rollbacks.
2.1 Objective
-
Reduce complexity by eliminating the need to apply multiple manifest files manually.
-
Enable configuration reusability through templated Helm values.
-
Simplify updates and rollbacks using Helm’s built-in features.
-
By adopting Helm, the team could package all resources into a single installable unit and maintain greater control over environment configurations and version history.
2.2 Implementation
2.2.1 Adding Helm Repository
- To begin, the team added the official OpenTelemetry Helm chart repository:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
-
This pulled in the latest Helm charts for OpenTelemetry components, including:
-
Frontend, backend, and telemetry services
-
OpenTelemetry Collector
-
Jaeger, Kafka, Prometheus exporters, etc.
-
The use of official charts ensured that best practices were followed and configurations remained compatible with Kubernetes standards.
2.2.2 Deploying the Application Using Helm
To isolate the Helm-based deployment from the manually deployed environment, a new namespace was created:
kubectl create namespace otel-helm-demo
- Then the OpenTelemetry demo was deployed using the Helm chart:
helm install otel-demo open-telemetry/opentelemetry-demo -n otel-helm-demo
- Verification steps included:
kubectl get pods -n otel-helm-demo
kubectl get service -n otel-helm-demo
-
This confirmed that:
-
All Kubernetes resources (pods, services, deployments) were created correctly.
-
The Helm chart encapsulated all necessary microservices in a single, consistent deployment process.
-
2.2.3 Upgrade and Rollback
To simulate real-world usage and verify Helm’s lifecycle management features, the team tested an upgrade and rollback scenario.
-
Upgrade Scenario
- The replica count of the frontend-proxy component was increased from the default to 3:
helm upgrade otel-demo open-telemetry/opentelemetry-demo \
-n otel-helm-demo \
--set components.frontend-proxy.replicas=3
- Verification:
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
-
This confirmed the increased number of frontend-proxy pods and a new revision entry in Helm’s release history.
-
Rollback Scenario:
- To test rollback capability, the team reverted to the previous revision:
helm rollback otel-demo 1 -n otel-helm-demo
kubectl get pods -n otel-helm-demo
helm history otel-demo -n otel-helm-demo
kubectl describe deployment frontend -n otel-helm-demo
- This successfully returned the deployment to its initial configuration without any manual cleanup or reconfiguration.
2.3 Challenges and Solutions
Challenge | Resolution |
---|---|
EC2 access to frontend-proxy blocked | Updated security group to allow traffic on port 8080 |
Pods could not use AWS services (e.g., EBS, ALB) | Created and annotated IAM policies with Pod Identity for appropriate service accounts |
Helm values misconfiguration (e.g., product-catalog, Grafana) | Fixed configMaps and applied corrections in the Helm values |
Subnets not recognized for Kubernetes resources | Added proper Kubernetes resource tags to the subnets |
GitHub user lacked EKS access | Added user to access entries with eksclusteradmin permissions |
Used incorrect override path: frontend-proxy.replicaCount |
Corrected to components.frontend-proxy.replicas by consulting the chart’s documentation |
Pods crashed post-upgrade due to missing configuration values | Reviewed and updated values.yaml structure, and used --set flags to apply overrides inline during upgrade |
Notes
- Always verify Helm override paths against the chart’s structure and documentation.
- Use
--dry-run
andhelm template
to preview changes before applying them.
2.4 Conclusion
-
Helm proved to be a powerful tool for managing Kubernetes applications. Its advantages included:
-
Declarative management of complex deployments using reusable values files.
-
Single-command upgrades without touching individual manifests.
-
Built-in rollback support that provided operational safety in case of failed changes.
-
Namespace isolation, allowing multiple environments or versions to coexist without conflict.
-
The transition from kubectl apply to helm install significantly reduced manual overhead and improved reliability, making the system more production-ready.
Phase 3: Alerting Service and Notifications
- After establishing deployment and observability foundations, Phase 3 focused on real-time alerting to detect application health issues, particularly around pod restarts. This phase introduced monitoring and alerting mechanisms using the Prometheus Stack, Alertmanager, and Kubernetes ConfigMaps, with email notifications configured via SMTP.
3.1 Objective
-
Enable automated alerts for abnormal pod behavior, especially frequent restarts.
-
Notify the team via email when such issues occur, enabling rapid detection and response.
-
Integrate Prometheus and Alertmanager into the existing Kubernetes monitoring setup for centralized management.
-
This phase strengthened the operational observability of the EKS cluster and ensured that problems could be acted upon in near real-time.
3.2 Implementation
3.2.1 Deploying the Prometheus Stack with Helm
-
The monitoring stack included:
-
Prometheus for metrics collection
-
Alertmanager for sending alerts
-
kube-state-metrics for Kubernetes state data
-
-
To deploy these components, the official Helm chart from the prometheus-community repository was used:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
- This deployed all necessary resources into a dedicated monitoring namespace, helping with logical separation and resource governance.
3.2.2 Creating the Alerting Rule for Pod Restarts
- To detect frequent container restarts, a custom Prometheus alert rule was defined in a file named alerts.yaml:
groups:
- name: pod-restarts
rules:
- alert: PodRestartTooHigh
expr: increase(kube_pod_container_status_restarts_total[5m]) > 3
for: 1m
labels:
severity: critical
annotations:
summary: "High restart count detected"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted more than 3 times in the last 5 minutes."
-
This rule triggers an alert when any container in a pod restarts more than 3 times within a 5-minute window, persisting for at least one minute.
-
The alert rule was applied using a Kubernetes ConfigMap:
kubectl create configmap prometheus-alerts --from-file=alerts.yaml -n monitoring
- This ConfigMap could then be mounted into the Prometheus deployment via Helm values (if dynamic configuration reload was enabled).
3.2.3 Configuring Alertmanager for Email Notifications
- To route alerts via email, Alertmanager was configured to use Gmail’s SMTP service. The configuration was defined in alertmanager.yaml:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password_file: '/etc/secrets/smtp_password'
route:
receiver: 'Mail Alert'
repeat_interval: 30s
group_wait: 15s
group_interval: 15s
receivers:
- name: 'Mail Alert'
email_configs:
- to: '[email protected]'
headers:
subject: 'Pod stuck in restart state'
This file was converted to a Kubernetes Secret:
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
- The password for Gmail SMTP was provided via a mounted file (smtp_password) for secure authentication.
3.2.4 Testing the Alert
- To validate the alerting system, a crash-looping pod was manually created using:
kubectl run crashloop-demo --image=busybox --restart=Always -- /bin/sh -c "exit 1"
-
This caused the pod to continuously restart, increasing the restart counter.
-
Prometheus, using the alert rule defined earlier, detected this behavior. Once the increase() function’s threshold was crossed and persisted for one minute, an alert was triggered.
-
The alert appeared in the Alertmanager UI under the configured route.
-
An email notification was sent to the specified address with full metadata, including:
-
Alert name
-
Namespace
-
Affected pod
-
Restart count
-
Timestamps and severity
-
3.3 Deliverables
-
The following outcomes and artifacts were successfully produced:
-
Visible alert in Alertmanager
-
Under the “Mail Alert” route with severity critical.
-
Prometheus query graph
-
Showed increasing values of kube_pod_container_status_restarts_total.
-
Fired alert instance
-
Prometheus executed the rule and triggered the alert.
-
Email notification received
-
Delivered by Gmail SMTP with a descriptive subject and message body.
-
Supporting configuration files
-
alerts.yaml for Prometheus rules
-
alertmanager.yaml for email routing
-
Kubernetes Secret for secure password injection
-
3.4 Summary and Impact
-
The implementation of real-time alerting brought several operational benefits:
-
Early Detection: Crash-looping pods and other anomalies are flagged almost instantly.
-
Rapid Response: Email alerts reach stakeholders without requiring constant dashboard monitoring.
-
Production Readiness: The system now includes observability not only through dashboards, but through active notifications.
-
-
This phase added an essential layer of resilience, helping the team respond to failures before they escalate into service outages.
Phase 4: CI/CD Integration with DevSecOps Enhancements
- With the infrastructure, observability, and alerting systems in place, Phase 4 of the project focused on automating the software delivery pipeline using GitHub Actions. The goal was to implement a robust Continuous Integration and Continuous Deployment (CI/CD) system, bolstered by DevSecOps best practices such as automated vulnerability scanning, license checks, rollback mechanisms, and secure secret management.
4.1 CI/CD Pipeline Overview
- The CI/CD workflow was built using GitHub Actions, and was triggered on code pushes to the main branch. It performed the following steps in sequence:
Step | Description |
---|---|
Checkout Code | Pull the latest source code from GitHub |
Configure AWS Credentials | Authenticate GitHub runner to access AWS using GitHub Secrets |
Login to Amazon ECR | Use Docker CLI to log in to Elastic Container Registry |
Set Environment Variables | Dynamically generate .env file with image tags and ECR URIs |
Build Docker Images | Build all microservices using docker-compose |
Push Images to ECR | Upload container images to AWS ECR |
Install Trivy | Install Trivy CLI for vulnerability scanning |
Scan Images | Run scans on each image and fail on HIGH/CRITICAL CVEs |
Update kubeconfig | Authenticate kubectl to the target EKS cluster |
Patch YAML Manifests | Automatically update Kubernetes manifests with new image tags |
Commit Updated YAMLs | Push the updated manifests back to the GitHub repository |
Deploy to EKS | Apply all manifests using kubectl apply |
Deploy Monitoring Configs | Apply configurations for kube-state-metrics and alerting rules |
Rollback on Failure | If any step fails, trigger kubectl rollout undo for all deployments |
- This end-to-end process ensures that each change in the source repository automatically goes through build, scan, deploy, and monitor steps with rollback support.
4.2 Rollback Mechanism
One of the key production-readiness features in this phase was automated rollback. The workflow used GitHub Actions’ if: failure() condition to trigger:
kubectl rollout undo deployment/<service-name> -n <namespace>
-
This command restored each service to its previously stable replica set.
-
A failure was simulated during testing by applying an invalid image tag, which correctly triggered the rollback behavior, ensuring that no broken deployments reached users.
4.3 Secret Management
- Sensitive data such as AWS credentials, API keys, and cluster names were stored securely in GitHub Secrets, and accessed in the workflow using:
${{ secrets.<KEY_NAME> }}
-
Examples of secrets used:
-
AWS_ACCESS_KEY_ID
-
AWS_SECRET_ACCESS_KEY
-
EKS_CLUSTER_NAME
-
FOSSA_API_KEY
-
This practice eliminated the need for storing plaintext credentials in code or configuration files, aligning with industry security best practices.
4.4 DevSecOps Integrations
-
To ensure code quality, security, and license compliance, several DevSecOps tools were integrated directly into the CI pipeline:
4.4.1 FOSSA
-
Purpose: Scan for license violations and known open-source vulnerabilities.
-
Integration: Triggered via FOSSA GitHub Action.
-
Outcome: Completed successfully with no issues detected.
4.4.2 Gradle Wrapper Validation
-
Purpose: Check that gradle-wrapper.jar and gradle-wrapper.properties are valid and not tampered with.
-
Trigger: PR or push events.
-
Outcome: Successfully validated using a test commit under the correct path.
4.4.3 OSSF Scorecard
-
Purpose: Assess the security posture of the GitHub repository.
-
Features Checked: Branch protection, dependency update automation, token permissions, and more.
-
Integration: Results uploaded to GitHub’s Code Scanning dashboard.
-
Schedule: Triggered on push and weekly.
-
4.5 Challenges and Solutions
Challenge | Resolution |
---|---|
Inconsistent Docker image tagging | Used .env file with dynamic GitHub Actions variables to standardize tags |
FOSSA action failed due to team misconfiguration | Removed team parameter and used auto-detection |
Trivy scan failed due to bad image reference | Corrected the image tagging format |
Gradle wrapper validation didn’t trigger | Created a dummy commit in the monitored path to validate integration |
Kubernetes YAMLs not updated for each image | Used sed to auto-update image tags in all deployment files |
4.6 Execution Results
-
Artifacts and verifications from successful pipeline executions included:
-
GitHub Actions workflow logs showing successful build, scan, and deployment
-
Trivy scan logs showing no high/critical vulnerabilities
-
Confirmation of image push to ECR
-
Visual confirmation of rollback behavior (if triggered)
-
Updated deployment manifests committed to GitHub
-
Running pods confirmed via kubectl get pods
-
Live application access via ALB Ingress
-