From Trace to Insight: A Closed-Loop Observability Practice for Go Projects 
——Don't be an engineer who "sees but doesn't understand"
Last time I talked about "Stop blindly integrating OTel", and many people asked: Then what should we do? We've integrated trace, metrics, and logs, but when problems arise, we still rely on grep + intuition (making guesses).
This article discusses: How to move from "integration" to "insight" in observability closed-loop.
This is for Go teams with multiple services that have already integrated OTel, aiming to transform these three components from "data reporting" to "decision driving".
Quick Navigation 
Core Concepts:
- Closed-Loop Overview - Understanding the overall architecture
- Trace Best Practices - Naming conventions, sampling strategies, Baggage
- Metrics System - RED/USE methodology, SLO alerting
- Logs Normalization - Unified fields, automatic correlation, sampling strategies
Implementation Guidance:
- Engineering Examples - Ready-to-use complete code
- Common Pitfalls Checklist - Avoiding 23 common issues
- Alert Configuration - Prometheus alert rules
- Docker Environment - One-click deployment of complete observability stack
Advanced Scenarios:
- Baggage Propagation - Multi-tenant business context
- Message Queue Context - Kafka/RabbitMQ tracing
- Log Sampling - Controlling noise in high QPS paths
Closed-Loop Overview: The Interconnected Path of Three Signals 
A complete observability closed-loop should flow like this:
[Collection] Standardized collection of Trace/Metrics/Logs
   ↓
[Correlation] Connecting the three signals through trace_id/span_id
   ↓
[Alerting] Metrics triggering SLO/SLI alerts
   ↓
[Retrospection] Automatically aggregating relevant Traces + correlated Logs context
   ↓
[Review] Generating improvement items → code/configuration changes
   ↓
[Verification] Rechecking metrics/alerts to ensure resolution → forming evidence chainKey Integration Points:
- Metrics alerts → automatic retrieval of traces for the corresponding time period
- Trace details → one-click navigation to correlated logs
- Logs aggregation → reverse location to spans
- Alert cards → automatically attached with runbooks, responsible personnel, and SLAs
Only when you form this kind of "data → reasoning → action → verification" cycle can you truly enter the observability closed-loop.
Step 1: Making Traces Actually Readable 
Most Go service traces look like a plate of spaghetti: hundreds of spans with no hierarchy or business meaning. This is because tracing is too "technical", focusing on functions rather than business processes.
Bad Example:
span := tracer.Start(ctx, "ProcessRequest")Good Example:
span := tracer.Start(ctx, "OrderService.PlaceOrder")The difference: the first tells you which function was called, the second tells you what the system is doing.
1.1 Resource and Semantic Standards 
All trace data must carry standard resource attributes for cross-service aggregation and filtering:
import (
    "context"
    "fmt"
    
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/sdk/resource"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
res, err := resource.New(context.Background(),
    resource.WithAttributes(
        semconv.ServiceName("order-svc"),
        semconv.ServiceVersion("1.2.3"),
        semconv.DeploymentEnvironment("prod"),
        attribute.String("team", "payments"),
    ),
)
if err != nil {
    return fmt.Errorf("failed to create resource: %w", err)
}1.2 Span Naming and Error Recording Standards 
Naming Convention: Use <Domain>.<Service>.<Action> format, keep consistent within the team.
// Recommended format examples
span := tracer.Start(ctx, "Payment.OrderService.CreateOrder")
span := tracer.Start(ctx, "Inventory.StockService.ReserveItem")Error Recording Standards: Use RecordError + SetStatus, distinguishing between business validation failures and system errors.
import "go.opentelemetry.io/otel/codes"
defer func() {
    if err != nil {
        span.RecordError(err)
        if errors.Is(err, ErrBusinessValidation) {
            span.SetStatus(codes.Error, "business_validation_failed")
        } else {
            span.SetStatus(codes.Error, "system_error")
        }
    } else {
        span.SetStatus(codes.Ok, "")
    }
}()1.3 Propagation and Middleware 
HTTP and gRPC Automatic Injection:
import (
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
)
// HTTP Server
mux := http.NewServeMux()
handler := otelhttp.NewHandler(mux, "http.server")
// gRPC Server
grpcServer := grpc.NewServer(
    grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
)Cross-Goroutine Propagation: Always explicitly pass context, otherwise traces will be broken.
// Wrong: context not passed
go func() {
    span := tracer.Start(context.Background(), "async.task") // ❌ Broken trace
}()
// Correct: explicitly pass context
go func(ctx context.Context) {
    span := tracer.Start(ctx, "async.task") // ✅ Maintains trace
    defer span.End()
}(ctx)Common Edge Cases:
// 1. HTTP Client calls (common oversight)
req, err := http.NewRequestWithContext(ctx, "GET", url, nil) // ✅
// Instead of http.NewRequest(), which cannot propagate trace context
// 2. Scheduled task scenarios
ticker := time.NewTicker(5 * time.Second)
for range ticker.C {
    // ❌ Creating new context every time, breaking traces
    // ctx := context.Background()
    
    // ✅ Derive a context with timeout from the root context
    taskCtx, cancel := context.WithTimeout(rootCtx, 30*time.Second)
    go processTask(taskCtx)
    cancel()
}
// 3. Database queries
// ✅ Use context-aware methods
rows, err := db.QueryContext(ctx, query, args...)
// Instead of db.Query()1.4 Sampling Strategies: From Static to Dynamic 
Static Sampling (simple scenarios):
import "go.opentelemetry.io/otel/sdk/trace"
tp := trace.NewTracerProvider(
    trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 10% sampling rate
)Dynamic Sampling (advanced):
- Tail-based sampling: Collect all first, then decide to keep based on latency/errors
- Route-based sampling: Low sampling for health checks, high sampling for core business paths
- Error full sampling: 100% retention for any traces containing Error status
// Pseudocode example: Custom sampler
type SmartSampler struct{}
func (s *SmartSampler) ShouldSample(p trace.SamplingParameters) trace.SamplingResult {
    // Full retention for error spans
    if hasError(p.Attributes) {
        return trace.SamplingResult{Decision: trace.RecordAndSample}
    }
    // Low sampling for health checks
    if isHealthCheck(p.Name) {
        return trace.SamplingResult{Decision: trace.Drop}
    }
    // Proportional sampling for other paths
    return trace.TraceIDRatioBased(0.1).ShouldSample(p)
}1.5 Baggage: Cross-Service Transmission of Business Context 
In microservice architectures, besides trace_id, we often need to transmit business identifiers (such as tenant_id, user_id). Baggage is OTel's standard solution.
import "go.opentelemetry.io/otel/baggage"
// Upstream service: Inject business context
member, _ := baggage.NewMember("tenant.id", tenantID)
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)
// Downstream service: Extract business context
bag := baggage.FromContext(ctx)
tenantID := bag.Member("tenant.id").Value()
// Practical scenario: Multi-tenant SaaS
func (s *OrderService) CreateOrder(ctx context.Context, req *OrderRequest) error {
    // Extract tenant information from baggage
    tenantID := baggage.FromContext(ctx).Member("tenant.id").Value()
    
    // Record to span attributes (for easy querying)
    span := trace.SpanFromContext(ctx)
    span.SetAttributes(attribute.String("tenant.id", tenantID))
    
    // Record to logs (for correlation analysis)
    logger.Ctx(ctx).Info("creating order", zap.String("tenant_id", tenantID))
    
    // Business logic...
    return s.repo.Create(ctx, tenantID, req)
}Notes:
- Baggage is propagated through HTTP headers, be mindful of size limitations (recommended < 1KB)
- Sensitive information (such as user phone numbers) is prohibited from being placed in Baggage; it should be placed in Span attributes
- Baggage keys should follow team-wide unified standards (e.g., tenant.idinstead oftenantId)
1.6 Message Queue Context Propagation 
In asynchronous message scenarios (Kafka, RabbitMQ, NATS, etc.), trace context needs to be propagated through message headers.
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
    "github.com/segmentio/kafka-go"
)
// Kafka Producer: Inject trace context
// writer is an initialized *kafka.Writer, injected according to your project's actual setup
func PublishEvent(ctx context.Context, topic string, event []byte) error {
    tracer := otel.Tracer("kafka-producer")
    ctx, span := tracer.Start(ctx, "kafka.publish")
    defer span.End()
    
    // Create Kafka message
    msg := kafka.Message{
        Topic: topic,
        Value: event,
    }
    
    // Inject trace context into message headers
    propagator := otel.GetTextMapPropagator()
    carrier := propagation.MapCarrier{}
    propagator.Inject(ctx, carrier)
    
    // Convert carrier to Kafka Headers
    for k, v := range carrier {
        msg.Headers = append(msg.Headers, kafka.Header{
            Key:   k,
            Value: []byte(v),
        })
    }
    
    // Send message
    return writer.WriteMessages(ctx, msg)
}
// Kafka Consumer: Extract trace context
func ConsumeEvent(msg kafka.Message) error {
    // Extract trace context from message headers
    carrier := propagation.MapCarrier{}
    for _, h := range msg.Headers {
        carrier.Set(h.Key, string(h.Value))
    }
    
    propagator := otel.GetTextMapPropagator()
    ctx := propagator.Extract(context.Background(), carrier)
    
    // Create new span (inherits upstream trace)
    tracer := otel.Tracer("kafka-consumer")
    ctx, span := tracer.Start(ctx, "kafka.consume")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("messaging.system", "kafka"),
        attribute.String("messaging.destination", msg.Topic),
        attribute.Int("messaging.partition", msg.Partition),
    )
    
    // Process business logic (pass ctx)
    return handleEvent(ctx, msg.Value)
}RabbitMQ Scenario:
import (
    "github.com/streadway/amqp"
    "go.opentelemetry.io/otel/propagation"
)
// RabbitMQ Producer
// channel is an initialized *amqp.Channel, injected according to your project's actual setup
func PublishToQueue(ctx context.Context, exchange, routingKey string, body []byte) error {
    propagator := otel.GetTextMapPropagator()
    carrier := propagation.MapCarrier{}
    propagator.Inject(ctx, carrier)
    
    // Convert to AMQP Headers
    headers := amqp.Table{}
    for k, v := range carrier {
        headers[k] = v
    }
    
    return channel.Publish(exchange, routingKey, false, false, amqp.Publishing{
        Headers: headers,
        Body:    body,
    })
}
// RabbitMQ Consumer
func HandleDelivery(d amqp.Delivery) error {
    carrier := propagation.MapCarrier{}
    for k, v := range d.Headers {
        if str, ok := v.(string); ok {
            carrier.Set(k, str)
        }
    }
    
    propagator := otel.GetTextMapPropagator()
    ctx := propagator.Extract(context.Background(), carrier)
    
    // Process message
    return processMessage(ctx, d.Body)
}Common pitfalls to avoid:
- Make span names close to business semantics, not just function names
- Control trace depth within 8 layers; anything more becomes noise
- Add business attributes to critical paths, such as span.SetAttributes(attribute.String("order.id", id))
- Use otelhttp/otelgrpcfor automatic injection, reduce manual instrumentation
- Explicitly pass contextacross goroutines to avoid broken traces
- Full sampling for error paths, low sampling for health checks
Step 2: Metrics is Not Just a Collector, But a Signaling System 
Many teams have an overwhelming amount of metrics but lack decision value. A truly mature metrics system should have three layers of meaning:
- Infrastructure Layer: CPU, memory, goroutine, GC statistics. 
- Application Layer: Request rates, error rates, latency distribution. 
- Business Layer: Order creation rate, active device count, task latency. 
The most critical is the third layer. Business metrics are the bridge for teams to understand their systems. Relying solely on technical metrics, you'll never know whether "slow user checkout" is due to Redis issues or inefficient code logic.
2.1 Metric Type Selection: RED/USE Methodology 
RED Method (for request-oriented services):
- Rate: Request rate → Counter
- Errors: Error rate → Counter
- Duration: Latency distribution → Histogram
USE Method (for resource-oriented services):
- Utilization: Resource utilization → Gauge
- Saturation: Resource saturation → Gauge
- Errors: Error count → Counter
2.2 Latency Histograms + Exemplars (Correlated with Traces) 
Meters (Counter/Histogram/Gauge) should be created once during initialization, not repeatedly in requests.
import (
    "go.opentelemetry.io/otel/metric"
)
lat := meter.Float64Histogram(
    "http.server.duration",
    metric.WithUnit("s"),
    metric.WithDescription("HTTP request duration"),
)
start := time.Now()
// ... process request ...
lat.Record(ctx, time.Since(start).Seconds(),
    metric.WithAttributes(
        attribute.String("route", "/orders"),
        attribute.String("method", "POST"),
        attribute.Int("status", 200),
    ),
)Enabling Exemplars: In Prometheus/Grafana, clicking on a histogram bucket can directly jump to the corresponding trace, achieving "metrics → trace" one-click navigation.
2.3 Label Governance: Controlling Cardinality 
Bad Example (high cardinality labels):
// ❌ user_id/request_id will cause metrics explosion
metric.WithAttributes(
    attribute.String("user_id", uid),        // Million-level cardinality
    attribute.String("request_id", reqID),   // Infinite cardinality
)Correct Approach:
// ✅ Only retain low cardinality dimensions
metric.WithAttributes(
    attribute.String("route", "/orders"),    // Limited routes
    attribute.String("method", "POST"),      // Limited methods
    attribute.Int("status_class", 2),        // 2xx/4xx/5xx
)
// user_id/request_id should be recorded in Trace attributes or Logs2.4 SLI/SLO and Burn Rate Alerts 
Define SLO: 99.9% of requests have latency < 500ms (30-day window)
Burn Rate Alerts (multi-window):
- Quick response (1-hour window): Error rate > 14.4x budget consumption rate → immediate alert
- Slow response (6-hour window): Error rate > 6x budget consumption rate → secondary alert
This allows detecting anomalies before SLOs are fully depleted, avoiding alert storms.
Prometheus Alert Rules Example:
# prometheus/alerts/slo.yml
groups:
  - name: slo_burn_rate_alerts
    interval: 30s
    rules:
      # Quick burn rate alert (1-hour window)
      - alert: HighErrorBurnRate_1h
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 0.0144
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "SLO Quick Burn (1-hour window)"
          description: "Error rate {{ $value | humanizePercentage }}, exceeding 14.4x budget consumption"
          runbook_url: "https://wiki.company.com/runbook/high-error-rate"
          
      # Slow burn rate alert (6-hour window)
      - alert: MediumErrorBurnRate_6h
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > 0.006
        for: 15m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "SLO Continuous Burn (6-hour window)"
          description: "Error rate {{ $value | humanizePercentage }}, exceeding 6x budget consumption"
          
      # P99 latency SLO alert
      - alert: HighLatencyBurnRate
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_server_duration_bucket[5m])) by (le, route)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "P99 Latency Exceeds SLO (500ms)"
          description: "P99 latency for route {{ $labels.route }}: {{ $value }}s"Note the metric name mapping: http.server.duration in Prometheus becomes http_server_duration_seconds_bucket (dots become underscores, unit suffix added).
Key practical points:
- Define triggering conditions for each metric, such as error rate > 1% within 5 minutes
- Don't call directly for alerts; first push traces to locate root causes
- Don't misuse Counter/Histogram/Gauge types
- Enable Exemplars for latency metrics for one-click trace navigation
- Limit label cardinality: user_id/request_id should go to logs, not metrics
- Use burn rate alerts (fast/slow double window) for SLOs, don't wait for complete failure
Step 3: Logs Should Be Normalized, Not Just Accumulated 
Go developers often make the mistake of "logging everything": mixing fmt.Println, log.Printf, and zap.Sugar(). Once OTel + Loki is integrated, the result is an explosion of log noise.
What you need is not more logs, but integrated context:
- Logs should carry trace_id. 
- Traces should allow backtracking to logs when expanded. 
- When alerts occur, automatically aggregate related log contexts. 
3.1 Unified Log Field Standards 
All logs must contain the following standard fields:
| Field | Type | Required | Description | 
|---|---|---|---|
| trace_id | string | ✅ | Correlated Trace | 
| span_id | string | ✅ | Correlated Span | 
| service.name | string | ✅ | Service name | 
| env | string | ✅ | Environment (prod/staging) | 
| version | string | ✅ | Service version | 
| level | string | ✅ | Log level | 
| message | string | ✅ | Log content | 
| tenant_id | string | ❌ | Tenant identifier (multi-tenant scenarios) | 
| user_id | string | ❌ | User identifier (needs masking/hashing) | 
3.2 Context-Based Logger 
Not Recommended (manually concatenating trace fields):
logger := zap.L().With(
    zap.String("trace_id", trace.SpanContextFromContext(ctx).TraceID().String()),
    zap.String("span_id", trace.SpanContextFromContext(ctx).SpanID().String()),
)
logger.Info("user payment timeout", zap.String("order_id", oid))Recommended (using otelzap or similar libraries):
import "github.com/uptrace/opentelemetry-go-extra/otelzap"
// Initialization (once)
logger := otelzap.New(zap.L())
// Usage (automatically injects trace_id/span_id)
logger.Ctx(ctx).Info("user payment timeout",
    zap.String("order_id", oid),
    zap.String("amount", "99.99"),
)3.3 Privacy and Compliance: PII Masking 
Sensitive fields must be masked or hashed:
// ❌ Plaintext recording of sensitive information
logger.Info("user login", zap.String("phone", "13800138000"))
// ✅ Hashing or partial masking
logger.Info("user login", zap.String("phone_hash", hashPhone("13800138000")))
logger.Info("user login", zap.String("phone_masked", "138****8000"))Whitelist Mechanism: Only record predefined business fields, prohibit directly printing complete request/response bodies.
3.4 Automatic Correlation Views Between Logs and Traces 
Configure in Grafana/Tempo:
- From Trace details page → automatically query logs with corresponding trace_id(Loki)
- From Loki log entries → one-click navigation to corresponding Trace (Tempo)
Grafana Configuration Example (datasource correlation):
# grafana datasources
- name: Tempo
  type: tempo
  uid: tempo
  jsonData:
    tracesToLogs:
      datasourceUid: 'loki'
      tags: ['trace_id']3.5 Log Sampling: Controlling Hot-Path Noise 
Production environment log volume can be very large, especially for high QPS hot paths. Reasonable log sampling strategies are crucial.
import (
    "go.uber.org/zap/zapcore"
    "github.com/uptrace/opentelemetry-go-extra/otelzap"
)
// Zap dynamic sampling configuration
core := zapcore.NewSamplerWithOptions(
    zapcore.NewCore(encoder, writer, zapcore.InfoLevel),
    time.Second,    // Sampling time window
    100,            // Initial allowed log count in window
    10,             // Allowed log count per second afterward
)
base := zap.New(core)
logger := otelzap.New(base)
// Practical scenario: Hot path noise reduction
func (h *HealthHandler) Check(ctx context.Context) error {
    // ✅ Health checks only log errors
    if err := h.checkDatabase(ctx); err != nil {
        logger.Ctx(ctx).Error("health check failed", zap.Error(err))
        return err
    }
    // Successful health checks don't log (to avoid noise)
    return nil
}
// Core business path: Full logging
func (s *OrderService) CreateOrder(ctx context.Context, req *OrderRequest) error {
    logger.Ctx(ctx).Info("order creation started", zap.String("order_id", req.ID))
    // ... business logic
    logger.Ctx(ctx).Info("order creation completed")
    return nil
}Sampling Strategy Recommendations:
- Health checks: Only log failures, silence success
- High-frequency queries: Sample at 1:100 or 1:1000
- Write operations: Full recording (create, update, delete)
- Error paths: 100% full recording
Key implementation points:
- Use a unified logging library, recommended zap+otelzap
- Logs should automatically carry trace_id/span_id, no manual concatenation
- Define field whitelists, don't log entire request bodies
- Sensitive fields (phone numbers/card numbers/passwords) must be masked
- Configure bidirectional navigation between Trace ↔ Logs in Grafana
- Sample hot paths, only log errors for health checks
Step 4: Observability ≠ Three Things 
Many people think "trace + metrics + logs = observability". Wrong. These three are just "input signals"; true observability comes from "feedback".
That means:
- Alerts should trigger reverse data retrieval (e.g., trace retrospection). 
- Trace analysis should guide metrics optimization. 
- Metrics anomalies should drive log aggregation. 
Only when this kind of "data → reasoning → adjustment" cycle is formed can a team truly enter a "observability culture".
4.1 Automated Feedback Action List 
When an alert is triggered, the system should automatically perform the following actions:
- Automatically aggregate relevant data - Retrieve traces from 10 minutes before and after the alert time (including error spans)
- Aggregate log contexts from related services (correlated via trace_id)
- Obtain Metrics trends from upstream/downstream dependencies (Redis/DB/external APIs)
 
- Alert Card Enhancement - Runbook: Predefined troubleshooting manual links
- Responsible Person: Automatically @ on-duty engineer
- SLA: Response time limits (P0:15min/P1:1h)
- One-click Incident Creation: Automatically generate incidents and associate Traces/Logs
 
- Review Closed-Loop - Incident Occurs → Root Cause Analysis (Trace+Logs) ↓ Improvement Items (Action Items) ↓ Code/Configuration Changes (PR/Config) ↓ Regression Verification (Check if metrics/alerts have disappeared) ↓ Evidence Chain Formation (Post-Mortem document)
4.2 Three-Signal Collaboration Scenario Example 
Scenario: Order service P99 latency spikes
- Metrics Alert: http.server.durationP99 increases from 200ms to 2s
- Automatic Trace Retrospection: Discover OrderService.QueryInventoryspan with 1.8s latency
- Correlated Logs: Aggregate logs for this span,发现大量 inventory service timeout
- Root Cause Identification: Inventory service database slow query
- Improvement Item: Add index + set query timeout
- Verification: After release, P99 latency returns to 180ms, alert disappears
Here's how to make it work:
- Alert rules should link to runbooks and on-duty personnel
- Alert triggering should automatically pull Traces + Logs
- Write reviews for each failure (screenshots + improvement items)
- Check metrics recovery after changes
- Solidify "alert → analysis → improvement → verification" into SOP
Step 5: Transition from Engineering to Culture 
Ultimately, a project's observability doesn't depend on frameworks, but on culture. You need to ensure every engineer can answer three questions:
- Can my code be observed? 
- When anomalies are observed, can I know the reason? 
- Can I turn this insight into action? 
These three questions are more valuable than any exporter or dashboard.
5.1 Team Capability Self-Assessment Checklist 
Level 1: Integration Phase
- All services have integrated Trace/Metrics/Logs
- Unified Exporter configuration
- Basic Dashboard visualization
Level 2: Correlation Phase
- Trace/Logs automatically correlated through trace_id
- Metrics alerts can jump to corresponding Traces
- Team can quickly locate "which service has issues"
Level 3: Closed-Loop Phase
- Alerts automatically aggregate context (Trace+Logs+dependency Metrics)
- Each incident has review + improvement items + regression verification
- Team evolves from "passive response" to "proactive prediction"
5.2 Three Pillars of Observability Culture 
- Transparency: Health status and SLO achievement rates of all services are publicly visible
- Collaboration: Alerts trigger cross-team collaboration rather than "blame games"
- Iteration: Weekly reviews of observability improvement items (e.g., reducing false positives, optimizing sampling)
Step 6: Engineering Implementation: Minimal Running Examples 
The following is a minimal configuration skeleton for integrating OpenTelemetry into Go projects, ready for direct copying and use.
6.1 Initializing TracerProvider + MeterProvider 
package observability
import (
    "context"
    "fmt"
    "time"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
// InitObservability initializes Tracer and Meter (production-ready complete example)
func InitObservability(ctx context.Context, serviceName, version, env string) (func(), error) {
    // 1. Create resource (note error handling)
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion(version),
            semconv.DeploymentEnvironment(env),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create resource: %w", err)
    }
    // 2. Initialize TracerProvider
    traceExporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint("localhost:4318"),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create trace exporter: %w", err)
    }
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), // 10% sampling
    )
    otel.SetTracerProvider(tp)
    // 3. Initialize MeterProvider
    metricExporter, err := otlpmetrichttp.New(ctx,
        otlpmetrichttp.WithEndpoint("localhost:4318"),
        otlpmetrichttp.WithInsecure(),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create metric exporter: %w", err)
    }
    mp := metric.NewMeterProvider(
        metric.WithReader(
            metric.NewPeriodicReader(metricExporter,
                metric.WithInterval(10*time.Second), // Export every 10 seconds
            ),
        ),
        metric.WithResource(res),
    )
    otel.SetMeterProvider(mp)
    // 4. Return cleanup function (must be called, otherwise data loss)
    return func() {
        shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        
        // Shut down TracerProvider first
        if err := tp.Shutdown(shutdownCtx); err != nil {
            fmt.Printf("Error shutting down tracer provider: %v\n", err)
        }
        
        // Then shut down MeterProvider
        if err := mp.Shutdown(shutdownCtx); err != nil {
            fmt.Printf("Error shutting down meter provider: %v\n", err)
        }
    }, nil
}6.2 HTTP/gRPC Middleware Integration 
package main
import (
    "context"
    "log"
    "net/http"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
    "google.golang.org/grpc"
)
func main() {
    ctx := context.Background()
    
    // Initialize observability (proper error handling)
    cleanup, err := observability.InitObservability(ctx, "order-service", "1.0.0", "prod")
    if err != nil {
        log.Fatalf("Failed to initialize observability: %v", err)
    }
    defer cleanup() // Ensure resources are cleaned up when program exits
    // HTTP Server
    mux := http.NewServeMux()
    mux.HandleFunc("/orders", handleOrders)
    mux.HandleFunc("/health", handleHealth)
    
    // Wrap with OTel middleware
    handler := otelhttp.NewHandler(mux, "http.server")
    
    log.Println("Starting HTTP server on :8080")
    if err := http.ListenAndServe(":8080", handler); err != nil {
        log.Fatalf("HTTP server failed: %v", err)
    }
    // gRPC Server (if needed)
    grpcServer := grpc.NewServer(
        grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
        grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor()),
    )
    // ... register services ...
}6.3 Recommended Local Observability Stack 
Option 1: Grafana Stack (production-ready docker-compose)
# docker-compose.yml
version: '3.8'
services:
  # Tempo - Distributed tracing
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./config/tempo.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    ports:
      - "4318:4318"   # OTLP HTTP
      - "4317:4317"   # OTLP gRPC
      - "3200:3200"   # Tempo UI
    networks:
      - observability
  # Prometheus - Metrics storage
  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./config/alerts:/etc/prometheus/alerts
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - observability
  # Loki - Log aggregation
  loki:
    image: grafana/loki:latest
    command: ["-config.file=/etc/loki/config.yaml"]
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki
    ports:
      - "3100:3100"
    networks:
      - observability
  # Grafana - Visualization
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./config/grafana/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    networks:
      - observability
    depends_on:
      - tempo
      - prometheus
      - loki
volumes:
  tempo-data:
  prometheus-data:
  loki-data:
  grafana-data:
networks:
  observability:
    driver: bridgeMinimum Configuration Files:
# config/tempo.yaml
server:
  http_listen_port: 3200
distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
# config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
# Load alert rules
rule_files:
  - "/etc/prometheus/alerts/*.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
# config/loki.yaml
auth_enabled: false
server:
  http_listen_port: 3100
ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks
# config/grafana/datasources.yml
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['trace_id']
        
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    uid: prometheus
    isDefault: true
    
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    uid: loki
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          # If logs are JSON, match fields like "trace_id":"<hex>"
          matcherRegex: '"trace_id":"([a-f0-9]+)"'
          name: TraceID
          url: '$${__value.raw}'
          # You can also store trace_id as a label and use labels in tracesToLogs configurationStart Commands:
# Create configuration file directories
mkdir -p config/grafana config/alerts
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Access Grafana: http://localhost:3000
# Access Prometheus: http://localhost:9090
# Access Tempo: http://localhost:3200Option 2: Using commercial SaaS (Datadog/New Relic/Honeycomb)
Suitable for teams with >50 members or those who don't want to maintain infrastructure, pay-as-you-go, ready to use out of the box.
6.4 Environment Variable Configuration 
Configure environment variables according to OTel standard specifications:
# .env - OpenTelemetry standard environment variables
# Service identification (set uniformly via OTEL_RESOURCE_ATTRIBUTES)
OTEL_RESOURCE_ATTRIBUTES=service.name=order-service,service.version=1.2.3,deployment.environment=prod,team=payments
# OTLP Exporter configuration (must be a complete URL)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
# Or set Trace and Metric endpoints separately
# OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4318
# OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://prometheus:4318
# Sampling configuration
OTEL_TRACES_SAMPLER=traceidratio        # Sampler type
OTEL_TRACES_SAMPLER_ARG=0.1             # 10% sampling rate
# Propagator configuration (W3C Trace Context by default)
OTEL_PROPAGATORS=tracecontext,baggage   # Support trace and baggage propagation
# SDK disable (for debugging)
# OTEL_SDK_DISABLED=false
# Log level
OTEL_LOG_LEVEL=infoUsing environment variables in code:
import (
    "os"
    
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
)
func InitFromEnv(ctx context.Context) (func(), error) {
    // 1. Read configuration from environment variables (agreement to fill host:port, without http://)
    endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
    if endpoint == "" {
        endpoint = "localhost:4318" // Default value
    }
    
    // 2. Resource attributes automatically read from OTEL_RESOURCE_ATTRIBUTES
    res, err := resource.New(ctx,
        resource.WithFromEnv(),   // Automatically read environment variables
        resource.WithTelemetrySDK(), // Add SDK information
        resource.WithHost(),         // Add host information
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create resource: %w", err)
    }
    
    // 3. Exporter configuration
    exporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint(endpoint), // Like "otel-collector:4318"
        otlptracehttp.WithInsecure(),
    )
    
    // ... other initialization logic
}Docker Compose Environment Variable Example:
services:
  order-service:
    image: order-service:latest
    environment:
      - OTEL_RESOURCE_ATTRIBUTES=service.name=order-service,service.version=1.2.3,deployment.environment=prod
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_TRACES_SAMPLER=traceidratio
      - OTEL_TRACES_SAMPLER_ARG=0.1
      - OTEL_PROPAGATORS=tracecontext,baggageStep 7: Common Pitfalls Checklist 
During actual implementation, here are the most common pitfalls, please check them one by one:
7.1 Trace Related 
- Forgetting to set service.nameorservice.version(making it impossible to distinguish services)
- Excessive span creation in library/utility functions (causing trace depth explosion)
- Hot path spans too deep (> 10 layers) or too many (> 100 spans)
- Not calling span.End()ordefer span.End()(causing span leaks)
- Not calling TracerProvider.Shutdown()(causing data loss)
- Forgetting to pass contextacross goroutines (breaking trace links)
7.2 Metrics Related 
- Using user_id/request_idas metric labels (causing cardinality explosion)
- Improper histogram bucket settings (e.g., only [0.1, 1, 10], unable to distinguish between 10ms and 100ms)
- Using Counter as Gauge (e.g., "total requests" mistakenly using Gauge)
- Forgetting to add units to metrics (e.g., http.durationshould specifysorms)
- Not enabling Exemplars (missing "metrics → trace" one-click capability)
7.3 Logs Related 
- Mixing logging libraries (fmt.Println+log+zap)
- Logs not automatically correlated with traces (manual concatenation of trace_iderror-prone)
- Frequently constructing new Logger in hot paths (should reuse logger.With())
- Sensitive information not masked (phone numbers/card numbers/passwords recorded in plaintext)
- Debug level logging enabled in production (causing log explosion)
7.4 Integration Related 
- Using synchronous Exporters to block requests (should use BatchSpanProcessor)
- OTLP Exporter not setting timeout (network jitter causing service freeze)
- Not setting maximum queue length (risk of memory leaks)
- Forgetting to close Exporter during local development (resource leaks)
7.5 Performance Related 
- Relying on traces for "performance analysis" (should supplement with pprof/eBPF profiling)
- 100% sampling in production (cost explosion + performance impact)
- Not setting sampling rate environment variables (mixing same configuration for development/production)
7.6 Concurrency Safety and Resource Management 
Core components of the OTel SDK are concurrency-safe:
// The following objects can be shared globally, create them during initialization
var (
    tracer = otel.Tracer("my-service")
    meter  = otel.Meter("my-service")
    requestsCounter metric.Int64Counter
)
func init() {
    var err error
    requestsCounter, err = meter.Int64Counter("http.server.requests")
    if err != nil { panic(err) }
}
// Direct use in high concurrency
type Handler struct{}
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "handle_request")
    defer span.End()
    requestsCounter.Add(ctx, 1, metric.WithAttributes(attribute.String("route", "/orders")))
}But Spans themselves are not concurrency-safe
// Wrong: Multiple goroutines operating on the same span
ctx, span := tracer.Start(ctx, "parent")
go func() {
    span.SetAttributes(attribute.String("key", "value")) // Data race
}()
go func() {
    span.RecordError(err) // Data race
}()
// Correct: Each goroutine creates its own span
ctx, parentSpan := tracer.Start(ctx, "parent")
defer parentSpan.End()
go func(ctx context.Context) {
    _, childSpan := tracer.Start(ctx, "child1")
    defer childSpan.End()
    childSpan.SetAttributes(attribute.String("key", "value"))
}(ctx)
go func(ctx context.Context) {
    _, childSpan := tracer.Start(ctx, "child2")
    defer childSpan.End()
    childSpan.RecordError(err)
}(ctx)Resource Cleanup Best Practices:
func main() {
    ctx := context.Background()
    
    // Initialization
    cleanup, err := observability.InitObservability(ctx, "svc", "1.0", "prod")
    if err != nil {
        log.Fatal(err)
    }
    
    // Method 1: Defer cleanup (recommended for short-lived processes)
    defer cleanup()
    
    // Method 2: Signal capture cleanup (recommended for long-lived services)
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
    
    go func() {
        <-sigCh
        log.Println("Shutting down gracefully...")
        cleanup() // Ensure data export is complete
        os.Exit(0)
    }()
    
    // Start service...
}In Conclusion 
The end goal of observability closed-loop is not to make dashboards fancy, but to make team decision-making faster. When trace, metrics, and logs work together, you can move from "emergency response" to "problem prediction".
Remember three keywords:
- Standardization: Unified resource attributes, field naming, error recording
- Correlation: Connecting the three signals through trace_id for one-click navigation
- Closed-loop: Alert → retrospection → improvement → verification, forming an evidence chain
Observability is not a "one-time project", but a continuously evolving engineering culture. When every engineer can confidently answer "Can my code be observed?", you have succeeded.

