Debugging & Logging - Probes
Monitor server health with probes, thresholds, and automated alerts
Overview
Probes are monitoring endpoints and health check mechanisms in ColdFusion that continuously track server performance and resource usage. They enable proactive monitoring by detecting issues before they impact users, triggering alerts when configurable thresholds are exceeded, and providing real-time visibility into server health metrics.
Properly configured probes are essential for production environments, enabling operations teams to respond quickly to performance degradation, resource exhaustion, or other issues. Probes can monitor CPU usage, memory consumption, request queue depth, active sessions, and various other metrics, sending notifications or taking automated actions when problems occur.
Probe Types and Metrics
Configure monitoring for critical server resources and performance indicators.
CPU Usage Monitoring
Memory Usage Monitoring
Request Queue Monitoring
Other Metrics
Active Sessions
- Metric
- Number of active user sessions
- Use Cases
- Detect session explosion, potential memory issues
- Security
- Unusual spikes may indicate attack
Track session growth over time for capacity planning and security monitoring.
Active Threads
- Metric
- Number of threads actively processing requests
- Max Threads
- Compare to configured maximum
- Threshold
- Warning at 80% utilization
Alert when thread exhaustion occurs - all threads busy indicates capacity issues.
Database Connection Pool
- Metric
- Active vs. available database connections
- Monitor
- Each datasource independently
- Detect
- Connection leaks and pool exhaustion
Alert when no connections available or trigger connection pool reset.
Request Execution Time
- Metric
- Average and maximum request duration
- Threshold
- Alert on requests over 5-10 seconds
- Correlation
- Link to other metrics (DB time, queue depth)
Identify slow pages and detect performance degradation over time.
Configuring Probe Endpoints
Set up HTTP endpoints and APIs for external monitoring and health checks.
Health Check URLs
- Purpose
- HTTP endpoints returning server health status
- Response Format
- HTTP status code (200 OK, 503 Service Unavailable)
- Example URL
/CFIDE/probe.cfmor custom endpoint
Restrict access by IP or require authentication. Used by load balancers and external monitoring.
Built-in Monitoring API
- Access
- Programmatic access to all metrics
- Permissions
- Requires administrator credentials
- Format
- Structured data (arrays, structs)
CF Administrator API for custom monitoring scripts and dashboards.
Custom Probe CFCs
- Implementation
- Create custom CFC for application-specific checks
- Checks
- Database connectivity, external service availability, cache health
- Return Status
- Boolean healthy/unhealthy or detailed metrics
Keep probe logic lightweight and fast (under 100ms execution time).
Threshold Configuration
Setting Appropriate Thresholds
- Baseline Measurement: Monitor normal operations first
- Peak vs. Average: Account for traffic spikes
- Warning vs. Critical: Two-tier alerting (warning at 75%, critical at 90%)
- Sustained vs. Spike: Require threshold exceeded for duration (e.g., 2 minutes)
- Seasonal Patterns: Adjust for known high-traffic periods
- Growth Accommodation: Review and adjust thresholds quarterly
CPU Thresholds
- Warning: 70-75% sustained for 2+ minutes
- Critical: 85-90% sustained for 5+ minutes
- Spike Tolerance: Brief 95%+ acceptable during batch jobs
- Multi-Core: Monitor per-core if possible
Memory Thresholds
- Warning: 75% heap usage
- Critical: 90% heap usage or frequent full GC
- GC Threshold: Alert if GC taking >10% of CPU time
- Prevention: Set thresholds to prevent OOM errors
Request Queue Thresholds
- Warning: 5-10 queued requests
- Critical: 25+ queued requests
- Zero Queue Goal: Ideally, queue should be empty most of the time
- Capacity Planning: Persistent queue indicates need for scaling
Alerting Configuration
Email Notifications
- Configuration: Recipient addresses for alerts
- Content: Metric value, threshold, timestamp, server identifier
- Frequency: Rate limiting to prevent email flooding
- Escalation: Different recipients for warning vs. critical
- Recovery: Send notification when issue resolved
SNMP Traps
- Purpose: Integration with enterprise monitoring systems
- Configuration: SNMP manager address and community string
- Trap OIDs: Unique identifiers for each metric type
- Severity Mapping: Map thresholds to SNMP severity levels
- Standards: Follow SNMP v2c or v3 protocols
Custom Actions
- Script Execution: Run custom scripts on threshold breach
- HTTP Webhooks: POST to external API or service
- Auto-Remediation: Restart services, clear caches, kill long-running requests
- Logging: Write detailed entries to probe log file
- Integration: Trigger PagerDuty, Slack, or other notification services
Alert Tuning
- Reduce Noise: Eliminate false positive alerts
- Rate Limiting: One alert per issue per time period
- Quiet Hours: Suppress non-critical alerts during off-hours
- Dependency Awareness: Don't alert on secondary issues from primary problem
- Escalation Path: Progressive notifications if issue persists
Integration with Monitoring Tools
Load Balancer Health Checks
- Purpose: Automatically remove unhealthy servers from rotation
- Health Check URL: Probe endpoint returning 200 when healthy
- Check Frequency: Every 5-30 seconds
- Failure Threshold: Remove after 2-3 consecutive failures
- Recovery: Re-add to pool after successful checks
- Drain Mode: Stop sending new requests but allow existing to complete
APM Integration
- FusionReactor: Native CF APM with extensive probes
- New Relic: Cloud APM with custom metric support
- AppDynamics: Enterprise APM platform
- Datadog: Infrastructure and application monitoring
- Custom Metrics: Push CF probe data to APM systems
Log Aggregation
- Splunk: Index and search CF probe logs
- ELK Stack: Elasticsearch, Logstash, Kibana
- CloudWatch: AWS native log monitoring
- Structured Logging: JSON format for easy parsing
- Correlation: Link probe events with application logs
Dashboards and Visualization
- Grafana: Open-source dashboard platform
- Custom Dashboards: Build application-specific views
- Real-Time: Live updating metrics
- Historical: Trend analysis and capacity planning
- Alerts: Visual indication of threshold breaches
Best Practices
Production Monitoring
- Enable probes on all production servers
- Configure alerts to on-call team members
- Set warning thresholds before critical levels
- Test alert delivery regularly
- Document alert response procedures
- Review probe logs during post-mortems
- Correlate probe alerts with application errors
- Use probes for capacity planning decisions
Threshold Configuration
- Start with conservative thresholds
- Monitor for false positives and adjust
- Account for normal traffic patterns
- Set different thresholds for different times (peak vs. off-peak)
- Review and update thresholds quarterly
- Document threshold decisions and rationale
Alert Response
- Create runbooks for common alert scenarios
- Define escalation procedures
- Track mean time to acknowledge (MTTA)
- Track mean time to resolution (MTTR)
- Conduct post-incident reviews
- Continuously improve response procedures
Security Considerations
- Restrict access to probe endpoints by IP
- Require authentication for detailed metrics
- Don't expose sensitive data in health checks
- Use HTTPS for probe endpoints
- Monitor for probe endpoint abuse or DDoS
- Rate-limit probe endpoint access
Performance Impact
- Keep probe logic lightweight (< 10ms execution time)
- Cache probe results if appropriate (30-60 seconds)
- Avoid heavy operations in health checks
- Don't query large datasets for probes
- Test probe performance under load
- Monitor probe execution time itself
Common Issues and Solutions
False Positive Alerts
- Symptom: Alerts triggered without actual problems
- Causes: Thresholds too aggressive, temporary spikes normal
- Solution: Increase thresholds or require sustained breach
- Duration: Require metric above threshold for 2-5 minutes
- Review: Analyze historical data to set better thresholds
Missed Issues
- Symptom: Problems occur without probe alerts
- Causes: Thresholds too permissive, wrong metrics monitored
- Solution: Lower thresholds or add additional probes
- Gap Analysis: Review incidents and ensure relevant probes exist
- Comprehensive: Monitor all critical resource types
Alert Fatigue
- Symptom: Team ignores alerts due to high volume
- Causes: Too many low-priority alerts, duplicate notifications
- Solution: Tune thresholds, implement rate limiting
- Prioritization: Only alert on actionable issues
- Escalation: Route non-critical alerts differently
Probe Endpoint Unavailable
- Symptom: Health checks fail incorrectly
- Causes: Web server issue, IP restriction, CF restart
- Solution: Ensure probe endpoint is lightweight and reliable
- Redundancy: Multiple health check mechanisms
- Testing: Verify probe accessibility during deployments
Delayed Notifications
- Symptom: Alerts arrive minutes or hours after issue
- Causes: Probe check interval too long, email delays
- Solution: Increase probe frequency, use faster notification methods
- Real-Time: Use push notifications or webhooks instead of email
- Monitoring: Monitor notification delivery time
Advanced Patterns
Composite Health Scores
- Combine multiple metrics into overall health score
- Weight metrics by criticality
- Single number indicating overall system health (0-100)
- Easy consumption by non-technical stakeholders
Predictive Alerting
- Analyze metric trends to predict issues
- Alert on trajectory toward threshold (will hit in X minutes)
- Machine learning for anomaly detection
- Proactive intervention before user impact
Auto-Remediation
- Automatically respond to certain alert conditions
- Clear caches, restart services, kill long-running requests
- Implement circuit breakers for external dependencies
- Log all automatic actions for audit trail
- Notify team of automatic remediation actions
Dependency Tracking
- Map dependencies between services
- Suppress secondary alerts when primary service fails
- Visualize cascade effects
- Focus response on root cause
Metrics to Monitor
Server-Level Metrics
- CPU usage (process and system)
- Memory usage (heap and non-heap)
- Disk I/O and space
- Network bandwidth
- Thread count and state
Application-Level Metrics
- Request queue depth
- Active request count
- Average/max request duration
- Request throughput (requests/second)
- Error rate and types
Database Metrics
- Connection pool usage
- Query execution time
- Database server health
- Connection errors
- Query queue depth
Session Metrics
- Active session count
- Session creation rate
- Session timeout rate
- Average session size
- Total session memory
Cache Metrics
- Template cache hit ratio
- Query cache hit ratio
- Object cache size and hits
- Cache eviction rate
- Cache memory usage