# Production Readiness Report — Microservices (server, bridge, decoder, location) **Scope:** `cmd/server`, `cmd/bridge`, `cmd/decoder`, `cmd/location` and all packages they import. **Date:** 2025-03-05. --- ## Overall grade: **5.5 / 10** The codebase has a **solid structure** and **consistent patterns** across the four services, but **security**, **reliability**, and **operational hardening** are not yet at production level. With the changes suggested below, it can be brought to a 7–8/10 for a production deployment. --- ## 1. Summary by dimension | Dimension | Grade | Notes | | --------------------------- | ----- | ------------------------------------------------------------------------------------------------- | | **Structure & readability** | 7/10 | Clear app lifecycle (New/Run/Shutdown), good package layout, some naming/duplication issues. | | **Reliability** | 5/10 | Graceful shutdown and Kafka cleanup are good; missing retries, commit semantics, and DB/timeouts. | | **Security** | 3/10 | No API auth, TLS/DB and client TLS weakened, CORS permissive, no rate limiting. | | **Observability** | 6/10 | slog, request ID, logging middleware, health/ready; no metrics/tracing. | | **Correctness** | 6/10 | Path vs body ID bugs in update endpoints, `context.Background()` in hot paths. | --- ## 2. What’s in good shape - **Unified app pattern:** All four services use the same lifecycle: `New(cfg)` → optional `Init(ctx)` (server only) → `Run(ctx)` → `Shutdown()`, with `signal.NotifyContext` for graceful shutdown. - **Graceful shutdown:** HTTP server shutdown, Kafka readers/writers and MQTT disconnect are explicitly closed; `sync.WaitGroup` used for consumer goroutines. - **Structured logging:** `slog` with JSON handler and file + stderr; request logging with method, path, status, duration, bytes. - **HTTP middleware:** Recovery (panic → 500), request ID, CORS, logging applied in a clear chain. - **Health endpoints:** `/health` (liveness) and `/ready` (DB ping) for the server. - **Kafka usage:** Centralized `KafkaManager` with RWMutex, separate readers/writers, group IDs per service. - **Shared state:** `AppState` in `common/appcontext` is thread-safe (RWMutex) and used consistently. - **Config:** Env-based config with service-specific loaders (`LoadServer`, `LoadBridge`, etc.) and `getEnvPanic` for required vars. - **API responses:** Centralized `response.JSON`, `Error`, `BadRequest`, `InternalError`, `NotFound` with consistent JSON shape. - **OpenAPI:** Routes reference `api/openapi.yaml` (OpenAPI 3.0), which helps readability and contract clarity. --- ## 3. Critical issues ### 3.1 Security - **No authentication or authorization on the HTTP API** All server routes (`/reslevis/*`, `/configs/beacons`, etc.) are unauthenticated. Anyone who can reach the server can read/update/delete gateways, zones, trackers, parser configs, settings, alerts, and tracks. - **Database connection uses `sslmode=disable`** In `internal/pkg/database/database.go`, DSN is built with `sslmode=disable`. In production, DB connections should use TLS and `sslmode=verify-full` (or equivalent) with CA verification. - **TLS verification disabled for outbound HTTP** - `internal/pkg/apiclient/updatedb.go`: `TLSClientConfig: &tls.Config{InsecureSkipVerify: true}`. - `internal/pkg/location/inference.go`: same, and **`NewDefaultInferencer(skipTLSVerify bool)` ignores the parameter** and always uses `InsecureSkipVerify: true`. - **CORS defaults to `*`** In `internal/pkg/api/middleware/cors.go`, when `origins` is nil/empty, `origins = []string{"*"}`. Production should restrict origins to known front-end origins. - **Logger file mode `0666`** In `internal/pkg/logger/logger.go`, `os.OpenFile(..., 0666)` makes the log file world-readable and -writable. Prefer `0600` or `0640`. - **No rate limiting or request body size limits** No protection against abuse or large-body DoS; consider middleware for max body size and rate limiting (per IP or per key). **Recommendations:** - Add authentication/authorization middleware (e.g. JWT or API key validation) for all non-health API routes; keep `/health` and optionally `/ready` public. - Make DB TLS configurable via env (e.g. `DB_SSLMODE`, `DB_SSLROOTCERT`) and use `sslmode=verify-full` in production. - Use `cfg.TLSInsecureSkipVerify` (or equivalent) for all outbound HTTP clients; fix `NewDefaultInferencer` to respect the parameter. - Configure CORS with explicit allowed origins (and optionally credentials) from config. - Set log file mode to `0600` (or `0640` if a group needs read). - Add middleware to limit request body size (e.g. `http.MaxBytesReader`) and consider rate limiting for API routes. --- ### 3.2 Reliability - **Kafka consumer: decode errors and commit semantics** In `internal/pkg/kafkaclient/consumer.go`, when `json.Unmarshal` fails, the code logs and `continue`s without committing. Depending on reader config, this can cause repeated redelivery of bad messages or ambiguous semantics. Production should either skip and commit, or send to a dead-letter path and commit. Answer: because readers are using consumer groups messages are auto commited, meaning bad unmarshal still commits as the message was technically read - **No retries on Kafka produce** Event loops (server, bridge, decoder, location) call `WriteMessages` once; transient Kafka errors are not retried. Consider retry with backoff (and optional circuit breaker) for critical topics. Answer: the Writer object is already holding the default configuration for timeout, backoff and retries, but I still added some extra configurations - **Database: no explicit pool or timeouts** `database.Connect` uses GORM defaults. For production, set `MaxOpenConns`, `MaxIdleConns`, and connection/timeout settings (e.g. `SetConnMaxLifetime`) on the underlying `*sql.DB`. - **UpdateDB and Init: failures only logged** In `internal/app/server/app.go`, `apiclient.UpdateDB` errors are only logged; Init continues. Consider failing Init (or marking “degraded”) if sync is required for correct operation, or add retries/backoff. - **Use of `context.Background()` in async paths** e.g. `internal/pkg/bridge/handler.go` and `internal/pkg/decoder/process.go` use `context.Background()` for Kafka writes. Prefer passing the request/event context (or a derived timeout) so shutdown and timeouts propagate. **Recommendations:** - Define a clear policy for Kafka consumer errors (skip+commit vs DLQ); avoid silent continue without commit unless intended. - Add retry (with backoff) for Kafka produce in critical paths; consider a small wrapper or helper. - Configure DB pool and timeouts in `database.Connect` (and optionally make them configurable via config). - Decide whether UpdateDB is mandatory for startup; if yes, fail Init on error or retry; if no, document and consider a “degraded” readiness state. - Pass context from the caller (or a timeout context) into Kafka write calls instead of `context.Background()`. --- ### 3.3 Correctness and consistency - **Update endpoints: path `id` vs body `id`** - **GatewayUpdateController** (`internal/pkg/controller/gateways_controller.go`): Uses `mux.Vars(r)["id"]` only to check existence with `First(..., "id = ?", id)`, then decodes body into `gateway` and calls `Save(&gateway)`. The updated record is identified by `gateway.ID` from the body, not the path. A client can send a different ID in the body and update another resource. - **ZoneUpdateController**: Route is `updateZone` (no `{id}` in path); uses `zone.ID` from body only. If the API contract expects path-based ID, this is inconsistent. Recommendation: For update-by-id, use the path parameter as the single source of truth: load by path `id`, decode body into a DTO or partial struct, then update only allowed fields for that id (e.g. selective updates or merge then update by path id). - **TrackerUpdateController** Uses body `tracker.ID` for lookup and save; route has no `{id}` in path. If other update endpoints use path `{id}`, align behavior and documentation. **Recommendations:** - Standardize update semantics: either path `{id}` only (body has no id or it must match path) or document “body id is canonical” and ensure no IDOR. - Prefer path-based resource identification for updates/deletes and bind body to allowed fields only. --- ## 4. Minor issues and improvements - **Logging:** Replace `fmt.Println` in `internal/pkg/apiclient/auth.go` and any `internal/pkg/apiclient/updatedb.go` / inference paths with `slog` (or structured logger) so logs are consistent and configurable. - **Token lifecycle:** `DefaultInferencer` caches token in a struct field with no expiry or refresh; token may be used after expiry. Use token expiry from auth response and refresh when needed. - **BeaconLookup naming:** In `appcontext`, `BeaconExists(id string)` is used with MAC (e.g. in bridge handler). Rename parameter to `mac` (or the method to `LookupIDByMAC`) to avoid confusion. - **Bridge/location/decoder:** No `/health` or `/ready` endpoints. For orchestration (e.g. Kubernetes), consider a small HTTP server or at least a process-level health check so the platform can restart unhealthy instances. - **Dependencies:** `go.mod` is clear; consider auditing indirect deps and keeping them updated (e.g. `go list -m -u all` and Dependabot/Renovate). --- ## 5. Propositions (prioritized) ### P0 — Before production 1. **Add API authentication/authorization** Protect all non-health routes (e.g. JWT or API key middleware); document required claims/scopes if using JWT. 2. **Enable and verify DB TLS** Make `sslmode` (and optional root cert) configurable; use `verify-full` (or equivalent) in production. 3. **Respect TLS config for outbound HTTP** Use config (e.g. `TLSInsecureSkipVerify`) for apiclient and location inferencer; fix `NewDefaultInferencer(skipTLSVerify bool)` to use the parameter. 4. **Fix update controllers** Use path `id` as source of truth for update (and optionally delete); ensure body cannot override resource id in an unsafe way. 5. **Tighten CORS and log file permissions** Explicit allowed origins from config; set log file mode to `0600` (or `0640`). ### P1 — High 6. **Kafka consumer error policy** Define and implement skip+commit or DLQ for bad messages; avoid infinite redelivery of poison messages. 7. **DB connection pool and timeouts** Set `MaxOpenConns`, `MaxIdleConns`, `ConnMaxLifetime`, and timeouts in `database.Connect`. 8. **Request body size limit** Middleware (e.g. `http.MaxBytesReader`) for API routes to prevent large-body DoS. 9. **Replace fmt.Println with slog** In apiclient and any remaining places; ensure response body is closed after read (e.g. `defer res.Body.Close()` in auth/data clients if not already). ### P2 — Medium 10. **Retries for Kafka produce** Retry with backoff (and optionally circuit breaker) for critical `WriteMessages` calls. 11. **Context propagation** Pass request/event context (or bounded context) into Kafka writes instead of `context.Background()`. 12. **Token refresh in Inferencer** Use expiry from auth response and refresh token before inference calls. 13. **Health for bridge/decoder/location** Add minimal health/readiness (HTTP or signal file) for orchestration and load balancers. ### P3 — Nice to have 14. **Metrics and tracing** Add metrics (e.g. request duration, Kafka lag, error counts) and optional distributed tracing (e.g. OTel). 15. **Rate limiting** Per-IP or per-token rate limiting on API routes. 16. **Structured validation** Use a validator (e.g. go-playground/validator) for request bodies and path params (IDs, limits). 17. **Documentation** Short runbooks for deploy, config env vars, and dependency on Kafka/DB/MQTT and external auth/API. --- ## 6. Files reviewed (representative) - **Entrypoints:** `cmd/server/main.go`, `cmd/bridge/main.go`, `cmd/decoder/main.go`, `cmd/location/main.go` - **Apps:** `internal/app/server/*`, `internal/app/bridge/app.go`, `internal/app/decoder/app.go`, `internal/app/location/app.go` - **Config:** `internal/pkg/config/config.go` - **Infra:** `internal/pkg/database/database.go`, `internal/pkg/kafkaclient/manager.go`, `internal/pkg/kafkaclient/consumer.go`, `internal/pkg/logger/logger.go` - **API:** `internal/pkg/api/handler/health.go`, `internal/pkg/api/middleware/*`, `internal/pkg/api/response/response.go`, `internal/app/server/routes.go` - **Controllers:** `internal/pkg/controller/*.go` (gateways, trackers, zone, parser, settings, alerts, tracks, trackerzones) - **Services:** `internal/pkg/service/beacon_service.go`, `internal/pkg/apiclient/*.go`, `internal/pkg/bridge/mqtt.go`, `internal/pkg/bridge/handler.go`, `internal/pkg/decoder/process.go`, `internal/pkg/location/inference.go`, `internal/pkg/common/appcontext/context.go` --- ## 7. Conclusion The microservices are **well-structured and readable**, with **consistent lifecycle and shutdown**. The main gaps are **security (no API auth, weak TLS usage)** and **reliability (Kafka and DB tuning, retries, context usage)**. Addressing the **P0 and P1** items above would bring the system much closer to production grade (around **7–8/10**); adding **P2/P3** would further improve operability and resilience.