Debugging & Logging - Probes

Monitor server health with probes, thresholds, and automated alerts

Overview

Probes are monitoring endpoints and health check mechanisms in ColdFusion that continuously track server performance and resource usage. They enable proactive monitoring by detecting issues before they impact users, triggering alerts when configurable thresholds are exceeded, and providing real-time visibility into server health metrics.

Properly configured probes are essential for production environments, enabling operations teams to respond quickly to performance degradation, resource exhaustion, or other issues. Probes can monitor CPU usage, memory consumption, request queue depth, active sessions, and various other metrics, sending notifications or taking automated actions when problems occur.

Probe Types and Metrics

Configure monitoring for critical server resources and performance indicators.

CPU Usage Monitoring

MetricPercentage of CPU utilization by ColdFusion process
ThresholdConfigurable percentage (e.g., alert at 80%)
DurationSustained high CPU vs. temporary spikes
Use CaseDetect runaway requests, inefficient code, high load
RecommendedWarning: 70%, Critical: 85%
Best Practice: Configure two-tier alerting with warning level before critical to allow proactive response before service impact.

Memory Usage Monitoring

MetricJVM heap usage percentage
Threshold TypesUsed memory, free memory, garbage collection frequency
Critical LevelsAlert before OutOfMemoryError occurs
GC MonitoringTrack garbage collection time and frequency
RecommendedWarning: 75% heap, Critical: 90%
Critical: Monitor both heap usage and GC frequency. Frequent garbage collection (even with available heap) indicates memory pressure.

Request Queue Monitoring

MetricNumber of queued requests waiting for processing
IndicatorServer overload or slow requests blocking threads
ImpactQueue buildup causes response time degradation
RecommendedWarning: 10 queued, Critical: 25
Root CausesInsufficient threads, slow database, external service delays
Tip: A persistent queue (even if small) indicates you need more capacity or faster request processing. Ideally, queue should be zero most of the time.

Other Metrics

Active Sessions

Metric
Number of active user sessions
Use Cases
Detect session explosion, potential memory issues
Security
Unusual spikes may indicate attack

Track session growth over time for capacity planning and security monitoring.

Active Threads

Metric
Number of threads actively processing requests
Max Threads
Compare to configured maximum
Threshold
Warning at 80% utilization

Alert when thread exhaustion occurs - all threads busy indicates capacity issues.

Database Connection Pool

Metric
Active vs. available database connections
Monitor
Each datasource independently
Detect
Connection leaks and pool exhaustion

Alert when no connections available or trigger connection pool reset.

Request Execution Time

Metric
Average and maximum request duration
Threshold
Alert on requests over 5-10 seconds
Correlation
Link to other metrics (DB time, queue depth)

Identify slow pages and detect performance degradation over time.

Configuring Probe Endpoints

Set up HTTP endpoints and APIs for external monitoring and health checks.

Health Check URLs

Purpose
HTTP endpoints returning server health status
Response Format
HTTP status code (200 OK, 503 Service Unavailable)
Example URL
/CFIDE/probe.cfm or custom endpoint

Restrict access by IP or require authentication. Used by load balancers and external monitoring.

Built-in Monitoring API

Access
Programmatic access to all metrics
Permissions
Requires administrator credentials
Format
Structured data (arrays, structs)

CF Administrator API for custom monitoring scripts and dashboards.

Custom Probe CFCs

Implementation
Create custom CFC for application-specific checks
Checks
Database connectivity, external service availability, cache health
Return Status
Boolean healthy/unhealthy or detailed metrics

Keep probe logic lightweight and fast (under 100ms execution time).

Threshold Configuration

Setting Appropriate Thresholds

  • Baseline Measurement: Monitor normal operations first
  • Peak vs. Average: Account for traffic spikes
  • Warning vs. Critical: Two-tier alerting (warning at 75%, critical at 90%)
  • Sustained vs. Spike: Require threshold exceeded for duration (e.g., 2 minutes)
  • Seasonal Patterns: Adjust for known high-traffic periods
  • Growth Accommodation: Review and adjust thresholds quarterly

CPU Thresholds

  • Warning: 70-75% sustained for 2+ minutes
  • Critical: 85-90% sustained for 5+ minutes
  • Spike Tolerance: Brief 95%+ acceptable during batch jobs
  • Multi-Core: Monitor per-core if possible

Memory Thresholds

  • Warning: 75% heap usage
  • Critical: 90% heap usage or frequent full GC
  • GC Threshold: Alert if GC taking >10% of CPU time
  • Prevention: Set thresholds to prevent OOM errors

Request Queue Thresholds

  • Warning: 5-10 queued requests
  • Critical: 25+ queued requests
  • Zero Queue Goal: Ideally, queue should be empty most of the time
  • Capacity Planning: Persistent queue indicates need for scaling

Alerting Configuration

Email Notifications

  • Configuration: Recipient addresses for alerts
  • Content: Metric value, threshold, timestamp, server identifier
  • Frequency: Rate limiting to prevent email flooding
  • Escalation: Different recipients for warning vs. critical
  • Recovery: Send notification when issue resolved

SNMP Traps

  • Purpose: Integration with enterprise monitoring systems
  • Configuration: SNMP manager address and community string
  • Trap OIDs: Unique identifiers for each metric type
  • Severity Mapping: Map thresholds to SNMP severity levels
  • Standards: Follow SNMP v2c or v3 protocols

Custom Actions

  • Script Execution: Run custom scripts on threshold breach
  • HTTP Webhooks: POST to external API or service
  • Auto-Remediation: Restart services, clear caches, kill long-running requests
  • Logging: Write detailed entries to probe log file
  • Integration: Trigger PagerDuty, Slack, or other notification services

Alert Tuning

  • Reduce Noise: Eliminate false positive alerts
  • Rate Limiting: One alert per issue per time period
  • Quiet Hours: Suppress non-critical alerts during off-hours
  • Dependency Awareness: Don't alert on secondary issues from primary problem
  • Escalation Path: Progressive notifications if issue persists

Integration with Monitoring Tools

Load Balancer Health Checks

  • Purpose: Automatically remove unhealthy servers from rotation
  • Health Check URL: Probe endpoint returning 200 when healthy
  • Check Frequency: Every 5-30 seconds
  • Failure Threshold: Remove after 2-3 consecutive failures
  • Recovery: Re-add to pool after successful checks
  • Drain Mode: Stop sending new requests but allow existing to complete

APM Integration

  • FusionReactor: Native CF APM with extensive probes
  • New Relic: Cloud APM with custom metric support
  • AppDynamics: Enterprise APM platform
  • Datadog: Infrastructure and application monitoring
  • Custom Metrics: Push CF probe data to APM systems

Log Aggregation

  • Splunk: Index and search CF probe logs
  • ELK Stack: Elasticsearch, Logstash, Kibana
  • CloudWatch: AWS native log monitoring
  • Structured Logging: JSON format for easy parsing
  • Correlation: Link probe events with application logs

Dashboards and Visualization

  • Grafana: Open-source dashboard platform
  • Custom Dashboards: Build application-specific views
  • Real-Time: Live updating metrics
  • Historical: Trend analysis and capacity planning
  • Alerts: Visual indication of threshold breaches

Best Practices

Production Monitoring

  • Enable probes on all production servers
  • Configure alerts to on-call team members
  • Set warning thresholds before critical levels
  • Test alert delivery regularly
  • Document alert response procedures
  • Review probe logs during post-mortems
  • Correlate probe alerts with application errors
  • Use probes for capacity planning decisions

Threshold Configuration

  • Start with conservative thresholds
  • Monitor for false positives and adjust
  • Account for normal traffic patterns
  • Set different thresholds for different times (peak vs. off-peak)
  • Review and update thresholds quarterly
  • Document threshold decisions and rationale

Alert Response

  • Create runbooks for common alert scenarios
  • Define escalation procedures
  • Track mean time to acknowledge (MTTA)
  • Track mean time to resolution (MTTR)
  • Conduct post-incident reviews
  • Continuously improve response procedures

Security Considerations

  • Restrict access to probe endpoints by IP
  • Require authentication for detailed metrics
  • Don't expose sensitive data in health checks
  • Use HTTPS for probe endpoints
  • Monitor for probe endpoint abuse or DDoS
  • Rate-limit probe endpoint access

Performance Impact

  • Keep probe logic lightweight (< 10ms execution time)
  • Cache probe results if appropriate (30-60 seconds)
  • Avoid heavy operations in health checks
  • Don't query large datasets for probes
  • Test probe performance under load
  • Monitor probe execution time itself

Common Issues and Solutions

False Positive Alerts

  • Symptom: Alerts triggered without actual problems
  • Causes: Thresholds too aggressive, temporary spikes normal
  • Solution: Increase thresholds or require sustained breach
  • Duration: Require metric above threshold for 2-5 minutes
  • Review: Analyze historical data to set better thresholds

Missed Issues

  • Symptom: Problems occur without probe alerts
  • Causes: Thresholds too permissive, wrong metrics monitored
  • Solution: Lower thresholds or add additional probes
  • Gap Analysis: Review incidents and ensure relevant probes exist
  • Comprehensive: Monitor all critical resource types

Alert Fatigue

  • Symptom: Team ignores alerts due to high volume
  • Causes: Too many low-priority alerts, duplicate notifications
  • Solution: Tune thresholds, implement rate limiting
  • Prioritization: Only alert on actionable issues
  • Escalation: Route non-critical alerts differently

Probe Endpoint Unavailable

  • Symptom: Health checks fail incorrectly
  • Causes: Web server issue, IP restriction, CF restart
  • Solution: Ensure probe endpoint is lightweight and reliable
  • Redundancy: Multiple health check mechanisms
  • Testing: Verify probe accessibility during deployments

Delayed Notifications

  • Symptom: Alerts arrive minutes or hours after issue
  • Causes: Probe check interval too long, email delays
  • Solution: Increase probe frequency, use faster notification methods
  • Real-Time: Use push notifications or webhooks instead of email
  • Monitoring: Monitor notification delivery time

Advanced Patterns

Composite Health Scores

  • Combine multiple metrics into overall health score
  • Weight metrics by criticality
  • Single number indicating overall system health (0-100)
  • Easy consumption by non-technical stakeholders

Predictive Alerting

  • Analyze metric trends to predict issues
  • Alert on trajectory toward threshold (will hit in X minutes)
  • Machine learning for anomaly detection
  • Proactive intervention before user impact

Auto-Remediation

  • Automatically respond to certain alert conditions
  • Clear caches, restart services, kill long-running requests
  • Implement circuit breakers for external dependencies
  • Log all automatic actions for audit trail
  • Notify team of automatic remediation actions

Dependency Tracking

  • Map dependencies between services
  • Suppress secondary alerts when primary service fails
  • Visualize cascade effects
  • Focus response on root cause

Metrics to Monitor

Server-Level Metrics

  • CPU usage (process and system)
  • Memory usage (heap and non-heap)
  • Disk I/O and space
  • Network bandwidth
  • Thread count and state

Application-Level Metrics

  • Request queue depth
  • Active request count
  • Average/max request duration
  • Request throughput (requests/second)
  • Error rate and types

Database Metrics

  • Connection pool usage
  • Query execution time
  • Database server health
  • Connection errors
  • Query queue depth

Session Metrics

  • Active session count
  • Session creation rate
  • Session timeout rate
  • Average session size
  • Total session memory

Cache Metrics

  • Template cache hit ratio
  • Query cache hit ratio
  • Object cache size and hits
  • Cache eviction rate
  • Cache memory usage

Related Resources