Monitoring, Logging & Cost Management
1. Monitoring Overview
1.1 Why Monitoring is Needed
- Ensure system availability
- Early detection of performance issues
- Capacity planning
- Cost optimization
- Security anomaly detection
1.2 Service Mapping
| Function |
AWS |
GCP |
| Metric Monitoring |
CloudWatch |
Cloud Monitoring |
| Log Collection |
CloudWatch Logs |
Cloud Logging |
| Tracing |
X-Ray |
Cloud Trace |
| Dashboards |
CloudWatch Dashboards |
Cloud Monitoring Dashboards |
| Alerting |
CloudWatch Alarms + SNS |
Alerting Policies |
| Cost Management |
Cost Explorer, Budgets |
Billing, Budgets |
2. AWS CloudWatch
2.1 Metrics
# List EC2 metrics
aws cloudwatch list-metrics --namespace AWS/EC2
# Get metric data
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T23:59:59Z \
--period 300 \
--statistics Average
# Publish custom metric
aws cloudwatch put-metric-data \
--namespace MyApp \
--metric-name RequestCount \
--value 100 \
--unit Count \
--dimensions Environment=Production
Key Metrics:
| Service |
Metric |
Description |
| EC2 |
CPUUtilization |
CPU usage |
| EC2 |
NetworkIn/Out |
Network traffic |
| RDS |
DatabaseConnections |
DB connections |
| RDS |
FreeStorageSpace |
Remaining storage |
| ALB |
RequestCount |
Request count |
| ALB |
TargetResponseTime |
Response time |
| Lambda |
Invocations |
Invocation count |
| Lambda |
Duration |
Execution time |
2.2 Alarms
# Create CPU alarm
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu \
--alarm-description "CPU over 80%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:alerts
# List alarms
aws cloudwatch describe-alarms
# Check alarm history
aws cloudwatch describe-alarm-history \
--alarm-name high-cpu
2.3 Dashboards
# Create dashboard
aws cloudwatch put-dashboard \
--dashboard-name MyDashboard \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
],
"title": "EC2 CPU",
"period": 300
}
}
]
}'
3. AWS CloudWatch Logs
3.1 Log Group Management
# Create log group
aws logs create-log-group --log-group-name /myapp/production
# Set retention policy
aws logs put-retention-policy \
--log-group-name /myapp/production \
--retention-in-days 30
# List log streams
aws logs describe-log-streams --log-group-name /myapp/production
# Query logs
aws logs filter-log-events \
--log-group-name /myapp/production \
--filter-pattern "ERROR" \
--start-time 1704067200000 \
--end-time 1704153600000
3.2 Log Insights
# Start log query
aws logs start-query \
--log-group-name /myapp/production \
--start-time 1704067200 \
--end-time 1704153600 \
--query-string 'fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20'
# Get query results
aws logs get-query-results --query-id QUERY_ID
3.3 Send Logs from EC2
# Install CloudWatch Agent (Amazon Linux)
sudo yum install -y amazon-cloudwatch-agent
# Configuration file
cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/myapp/*.log",
"log_group_name": "/myapp/production",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}
EOF
# Start agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
-s
4. GCP Cloud Monitoring
4.1 Metrics
# List metrics
gcloud monitoring metrics list --filter="metric.type:compute.googleapis.com"
# Read metric data (limited in gcloud, API/console recommended)
gcloud monitoring metrics read \
"compute.googleapis.com/instance/cpu/utilization" \
--project=PROJECT_ID
Key Metrics:
| Service |
Metric |
Description |
| Compute |
cpu/utilization |
CPU usage |
| Compute |
network/received_bytes |
Received traffic |
| Cloud SQL |
database/disk/utilization |
Disk usage |
| Cloud Run |
request_count |
Request count |
| GKE |
node/cpu/utilization |
Node CPU |
4.2 Alerting Policies
# Create notification channel (email)
gcloud alpha monitoring channels create \
--display-name="Email Alerts" \
--type=email \
--channel-labels=email_address=admin@example.com
# Create alerting policy
gcloud alpha monitoring policies create \
--display-name="High CPU Alert" \
--condition-display-name="CPU > 80%" \
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
--condition-threshold-value=0.8 \
--condition-threshold-comparison=COMPARISON_GT \
--condition-threshold-duration=300s \
--notification-channels=projects/PROJECT/notificationChannels/CHANNEL_ID
5. GCP Cloud Logging
5.1 Log Queries
# Query logs
gcloud logging read 'resource.type="gce_instance"' \
--limit=10 \
--format=json
# Error logs only
gcloud logging read 'severity>=ERROR' \
--limit=20
# Specific time range
gcloud logging read 'timestamp>="2024-01-01T00:00:00Z"' \
--limit=100
# Create log sink (export to Cloud Storage)
gcloud logging sinks create my-sink \
storage.googleapis.com/my-log-bucket \
--log-filter='resource.type="gce_instance"'
5.2 Log-based Metrics
# Create error count metric
gcloud logging metrics create error-count \
--description="Count of errors" \
--log-filter='severity>=ERROR'
# List metrics
gcloud logging metrics list
6. Cost Management
6.1 AWS Cost Explorer
# Query monthly cost
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# Cost by service
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics UnblendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--output table
6.2 AWS Budgets
# Create monthly budget
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "Monthly-100USD",
"BudgetLimit": {"Amount": "100", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{"SubscriptionType": "EMAIL", "Address": "admin@example.com"}
]
}
]'
# List budgets
aws budgets describe-budgets --account-id 123456789012
6.3 GCP Billing
# List billing accounts
gcloud billing accounts list
# Link project to billing
gcloud billing projects link PROJECT_ID \
--billing-account=BILLING_ACCOUNT_ID
# Create budget
gcloud billing budgets create \
--billing-account=BILLING_ACCOUNT_ID \
--display-name="Monthly Budget" \
--budget-amount=100USD \
--threshold-rule=percent=0.8,basis=CURRENT_SPEND \
--all-updates-rule-pubsub-topic=projects/PROJECT/topics/budget-alerts
7. Cost Optimization Strategies
7.1 Compute Optimization
| Strategy |
AWS |
GCP |
| Reserved Instances |
Reserved Instances |
Committed Use |
| Spot/Preemptible |
Spot Instances |
Spot/Preemptible VMs |
| Auto Scaling |
Auto Scaling |
Managed Instance Groups |
| Right Sizing |
AWS Compute Optimizer |
Recommender |
# AWS recommendations
aws compute-optimizer get-ec2-instance-recommendations
# GCP recommendations
gcloud recommender recommendations list \
--project=PROJECT_ID \
--location=global \
--recommender=google.compute.instance.MachineTypeRecommender
7.2 Storage Optimization
# S3 storage class transitions
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration '{
"Rules": [{
"ID": "Archive old data",
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"}
]
}]
}'
# GCP lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket
7.3 Cost Savings Checklist
□ Clean up unused resources
- Stopped instances (storage costs continue)
- Unattached EBS/PD volumes
- Old snapshots
- Unused Elastic IP / static IP
□ Right sizing
- Analyze instance utilization
- Check over-provisioning
- Apply rightsizing recommendations
□ Reserved capacity
- Reserved instances for stable workloads
- Review 1-year/3-year commitments
□ Use spot/preemptible
- Batch jobs, dev environments
- Interrupt-tolerant workloads
□ Storage optimization
- Apply lifecycle policies
- Use appropriate storage class
- Clean up unnecessary data
□ Network costs
- Communicate within same AZ/region
- Use CDN
- Optimize NAT Gateway traffic
8. Tag-based Cost Tracking
8.1 Tag Strategy
# Terraform example
locals {
common_tags = {
Environment = "production"
Project = "myapp"
CostCenter = "engineering"
Owner = "team-a"
ManagedBy = "terraform"
}
}
resource "aws_instance" "web" {
# ...
tags = local.common_tags
}
# Enable AWS cost allocation tags (in Billing Console)
# Query cost by tag
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=TAG,Key=Project
# GCP cost by label (requires BigQuery export)
SELECT
labels.key,
labels.value,
SUM(cost) as total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) as labels
GROUP BY 1, 2
ORDER BY total_cost DESC
9. Dashboard Example
9.1 Operations Dashboard Layout
┌──────────────────────────────────────────────────────────────┐
│ Operations Dashboard │
├──────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ CPU Usage │ │ Memory Usage │ │ Requests │ │
│ │ [Graph] │ │ [Graph] │ │ [Graph] │ │
│ └─────────────────┘ └─────────────────┘ └──────────────┘ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ Response Time │ │ Error Rate │ │ Active Conn │ │
│ │ [Graph] │ │ [Graph] │ │ [Graph] │ │
│ └─────────────────┘ └─────────────────┘ └──────────────┘ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Recent Alarms / Incidents │ │
│ └────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Cost Summary (This Month) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
10. Alert Configuration Recommendations
10.1 Essential Alerts
| Category |
Condition |
Urgency |
| CPU |
> 80% (5min) |
Medium |
| CPU |
> 95% (2min) |
High |
| Memory |
> 85% |
Medium |
| Disk |
> 80% |
Medium |
| Disk |
> 90% |
High |
| Health Check |
Failed |
High |
| Error Rate |
> 1% |
Medium |
| Error Rate |
> 5% |
High |
| Response Time |
> 2s |
Medium |
| Cost |
> 80% budget |
Medium |
10.2 Notification Channels
# Create AWS SNS topic
aws sns create-topic --name alerts
# Subscribe email
aws sns subscribe \
--topic-arn arn:aws:sns:...:alerts \
--protocol email \
--notification-endpoint admin@example.com
# Slack webhook (via Lambda)
# PagerDuty, Opsgenie, etc. integration
11. Next Steps
References