Monitoring and Alerting: Prometheus Grafana Setup
Welcome to TopperBlog! 👋
I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.
🎯 What I Write About:
• AI/ML Engineering & LLMs
• Web3 & Blockchain Development
• System Design & Architecture
• Interview Preparation (FAANG)
• Freelancing & Remote Work
• Modern Tech Stacks (Next.js, React, Rust, TypeScript)
• Performance Optimization & Best Practices
💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.
📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.
🌐 Let's connect and grow together in this amazing tech journey!
#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering
Monitoring and Alerting: Prometheus and Grafana Setup
Metadata
{
"seo_title": "Prometheus and Grafana Setup Guide for TypeScript Developers",
"meta_description": "Learn how to implement production-ready monitoring with Prometheus and Grafana in TypeScript applications. Includes metrics collection, alerting rules, and best practices.",
"primary_keyword": "Prometheus Grafana setup",
"secondary_keywords": [
"TypeScript monitoring",
"application metrics",
"Prometheus client",
"Grafana dashboards",
"alerting rules",
"observability stack",
"metrics collection",
"production monitoring"
],
"tags": [
"monitoring",
"observability",
"prometheus",
"grafana",
"typescript",
"devops",
"alerting"
],
"search_intent": "Implementation guide for developers seeking to add production-grade monitoring and alerting to TypeScript applications",
"content_role": "Technical tutorial providing practical implementation patterns for observability infrastructure"
}
The Problem: Why Your Application Needs Proper Monitoring
In 2026, deploying applications without comprehensive monitoring is like flying blind. You're shipping code to production, users are interacting with your services, and somewhere in that complex distributed system, things are breaking—but you won't know until customers complain or revenue drops.
Modern applications face unprecedented complexity. Microservices architectures, serverless functions, containerized deployments, and multi-cloud strategies create environments where traditional logging simply isn't enough. You need real-time visibility into system behavior, performance metrics, and the ability to detect anomalies before they cascade into outages.
The stakes are higher than ever. A single minute of downtime can cost thousands of dollars. Performance degradation affects user experience and conversion rates. Memory leaks that go undetected can bring down entire clusters. Without proper monitoring, you're constantly reactive—firefighting issues after they've already impacted users.
Developers need answers to critical questions: Is my API response time degrading? Are error rates spiking? Is memory consumption trending upward? Which endpoints are experiencing the most traffic? Are my database queries slowing down? Without instrumentation and visualization, these questions remain unanswered until it's too late.
Why Traditional Monitoring Approaches Fall Short
Many teams start with basic logging solutions or cloud provider dashboards, but these approaches quickly reveal their limitations in production environments.
Log-based monitoring requires parsing through massive volumes of text data. While logs are essential for debugging specific issues, they're inefficient for understanding system-wide patterns. Searching logs for performance trends is like trying to understand ocean currents by examining individual water molecules.
Cloud provider dashboards offer basic metrics but lack customization and cross-platform visibility. If you're running services across AWS, GCP, and on-premises infrastructure, you're juggling multiple monitoring tools with inconsistent interfaces and no unified view.
APM tools like New Relic or Datadog provide excellent insights but come with significant costs that scale with your infrastructure. For many teams, especially startups and mid-sized companies, these costs become prohibitive as they grow.
Custom monitoring scripts often emerge as teams try to fill gaps. These homegrown solutions become maintenance nightmares—brittle, poorly documented, and abandoned when the original developer leaves.
The fundamental issue is that these approaches weren't designed for modern cloud-native architectures. They lack the flexibility, scalability, and cost-effectiveness required for contemporary distributed systems.
The Modern Solution: Prometheus and Grafana with TypeScript
Prometheus and Grafana have become the de facto standard for cloud-native monitoring. Prometheus excels at metrics collection and storage with a powerful query language (PromQL), while Grafana provides stunning visualizations and alerting capabilities. Together, they form an open-source observability stack that rivals commercial solutions.
Setting Up Prometheus
First, install the Prometheus client library for Node.js:
npm install prom-client express @types/express
Create a metrics service that exposes application metrics:
// metrics.service.ts
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
export class MetricsService {
private readonly registry: Registry;
public readonly httpRequestDuration: Histogram;
public readonly httpRequestTotal: Counter;
public readonly activeConnections: Gauge;
public readonly errorTotal: Counter;
constructor() {
this.registry = new Registry();
// Collect default metrics (CPU, memory, etc.)
collectDefaultMetrics({ register: this.registry });
// HTTP request duration histogram
this.httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
});
// HTTP request counter
this.httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Active connections gauge
this.activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Error counter
this.errorTotal = new Counter({
name: 'errors_total',
help: 'Total number of errors',
labelNames: ['type', 'severity']
});
this.registry.registerMetric(this.httpRequestDuration);
this.registry.registerMetric(this.httpRequestTotal);
this.registry.registerMetric(this.activeConnections);
this.registry.registerMetric(this.errorTotal);
}
getMetrics(): Promise<string> {
return this.registry.metrics();
}
}
Integrate metrics into your Express application:
// app.ts
import express from 'express';
import { MetricsService } from './metrics.service';
const app = express();
const metrics = new MetricsService();
// Metrics middleware
app.use((req, res, next) => {
const start = Date.now();
metrics.activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
metrics.httpRequestDuration.observe(
{ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode },
duration
);
metrics.httpRequestTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
});
metrics.activeConnections.dec();
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', 'text/plain');
res.send(await metrics.getMetrics());
});
// Your application routes
app.get('/api/users', async (req, res) => {
try {
// Your business logic
res.json({ users: [] });
} catch (error) {
metrics.errorTotal.inc({ type: 'database', severity: 'high' });
res.status(500).json({ error: 'Internal server error' });
}
});
app.listen(3000, () => console.log('Server running on port 3000'));
Configuring Prometheus Server
Create a prometheus.yml configuration file:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'nodejs-app'
static_configs:
- targets: ['app:3000']
labels:
environment: 'production'
service: 'api'
Define alerting rules in alerts.yml:
groups:
- name: application_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(errors_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "95th percentile response time is high"
description: "P95 latency is {{ $value }}s"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 > 512
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage exceeds 512MB"
Setting Up Grafana
Create a docker-compose.yml for the complete stack:
version: '3.8'
services:
app:
build: .
ports:
- "3000:3000"
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoring
networks:
monitoring:
volumes:
prometheus-data:
grafana-data:
Common Pitfalls and How to Avoid Them
High cardinality metrics: Avoid using unbounded label values like user IDs or timestamps. This creates millions of unique time series, overwhelming Prometheus. Instead, use bounded categories or aggregate data before exposing metrics.
Missing metric types: Use the correct metric type. Counters for cumulative values, gauges for current values, histograms for distributions, and summaries for pre-calculated percentiles. Misusing types leads to incorrect queries and visualizations.
Inadequate retention: Default Prometheus retention is 15 days. For production systems, configure longer retention or implement remote storage solutions like Thanos or Cortex for long-term metrics storage.
Alert fatigue: Overly sensitive alerts train teams to ignore notifications. Set appropriate thresholds, use for clauses to avoid transient spikes, and implement alert routing to ensure the right people receive relevant notifications.
Lack of documentation: Metrics without context are useless. Use the help parameter to document what each metric measures and include units in metric names (e.g., _seconds, _bytes).
Best Practices for Production Monitoring
Implement the RED method: Track Rate (requests per second), Errors (failed requests), and Duration (latency) for all services. These three metrics provide immediate insight into service health.
Use consistent naming conventions: Follow Prometheus naming best practices. Use snake_case, include units as suffixes, and prefix metrics with your application namespace (e.g., myapp_http_requests_total).
Monitor business metrics: Beyond technical metrics, track business KPIs like user signups, transactions completed, or revenue generated. This connects technical performance to business outcomes.
Set up SLOs and SLIs: Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) based on user experience. Alert on SLO violations rather than arbitrary thresholds.
Implement progressive alerting: Use multiple severity levels. Warning alerts for degraded performance, critical alerts for outages. Route them to appropriate channels—Slack for warnings, PagerDuty for critical issues.
Create runbooks: Every alert should link to a runbook explaining what the alert means, potential causes, and remediation steps. This reduces mean time to resolution (MTTR).
Test your monitoring: Regularly verify that alerts fire correctly. Implement chaos engineering practices to ensure your monitoring catches real issues.
Frequently Asked Questions
Q: How much overhead does Prometheus monitoring add to my application? A: Minimal. The prom-client library adds negligible latency (typically <1ms per request). The metrics endpoint scrape happens out-of-band, so it doesn't impact request processing. However, avoid creating excessive metrics or using high-cardinality labels.
Q: Should I use push or pull-based metrics collection? A: Prometheus uses a pull model by default, which is ideal for most scenarios. It provides better control over scrape intervals and makes service discovery easier. Use push-based collection (via Pushgateway) only for short-lived jobs or batch processes that don't run long enough to be scraped.
Q: How do I monitor multiple environments (dev, staging, production)?
A: Use labels to differentiate environments. Add an environment label to all metrics and configure separate Prometheus instances or use federation to aggregate metrics from multiple sources. Create environment-specific dashboards in Grafana.
Q: What's the difference between Prometheus and traditional time-series databases? A: Prometheus is optimized for operational monitoring with a powerful query language and built-in alerting. Traditional TSDBs like InfluxDB offer more flexibility but require additional tooling for alerting. Prometheus's pull model and service discovery make it ideal for dynamic cloud environments.
Q: How do I handle sensitive data in metrics? A: Never include PII, passwords, or sensitive business data in metric labels or values. Use aggregated, anonymized data. If you need to track user-specific metrics, use hashed identifiers and implement proper access controls on your Grafana instance.
Q: Can I use Prometheus with serverless functions? A: Yes, but it requires adaptation. Since serverless functions are ephemeral, use a push-based approach with Pushgateway or export metrics to CloudWatch/Stackdriver and use exporters to make them available to Prometheus.
Q: How do I scale Prometheus for large deployments? A: Implement federation to create a hierarchical structure where local Prometheus instances scrape services and global instances aggregate data. Consider Thanos or Cortex for horizontal scaling and long-term storage. Use recording rules to pre-compute expensive queries.
Conclusion
Implementing Prometheus and Grafana monitoring transforms how you understand and operate your applications. This observability stack provides real-time visibility into system behavior, enables proactive issue detection, and empowers data-driven decision-making—all without the cost burden of commercial APM solutions.
The TypeScript implementation patterns shown here provide a solid foundation for production monitoring. By exposing meaningful metrics, configuring intelligent alerts, and creating informative dashboards, you shift from reactive firefighting to proactive system management.
Remember that monitoring is not a one-time setup but an evolving practice. As your application grows, continuously refine your metrics, adjust alert thresholds, and enhance dashboards based on operational experience. The investment in proper monitoring pays dividends through reduced downtime, faster incident resolution, and improved system reliability.
Start small—instrument your critical paths first, set up basic dashboards, and configure essential alerts. As you gain confidence with the stack, expand coverage to include business metrics, implement SLOs, and build sophisticated visualizations. Your future self, and your on-call team, will thank you.