13 KiB

Raw Blame 履歴

Production Readiness Report — Microservices (server, bridge, decoder, location)

Scope: cmd/server, cmd/bridge, cmd/decoder, cmd/location and all packages they import.
Date: 2025-03-05.

Overall grade: 5.5 / 10

The codebase has a solid structure and consistent patterns across the four services, but security, reliability, and operational hardening are not yet at production level. With the changes suggested below, it can be brought to a 7–8/10 for a production deployment.

1. Summary by dimension

Dimension	Grade	Notes
Structure & readability	7/10	Clear app lifecycle (New/Run/Shutdown), good package layout, some naming/duplication issues.
Reliability	5/10	Graceful shutdown and Kafka cleanup are good; missing retries, commit semantics, and DB/timeouts.
Security	3/10	No API auth, TLS/DB and client TLS weakened, CORS permissive, no rate limiting.
Observability	6/10	slog, request ID, logging middleware, health/ready; no metrics/tracing.
Correctness	6/10	Path vs body ID bugs in update endpoints, `context.Background()` in hot paths.

2. What’s in good shape

Unified app pattern: All four services use the same lifecycle: New(cfg) → optional Init(ctx) (server only) → Run(ctx) → Shutdown(), with signal.NotifyContext for graceful shutdown.
Graceful shutdown: HTTP server shutdown, Kafka readers/writers and MQTT disconnect are explicitly closed; sync.WaitGroup used for consumer goroutines.
Structured logging: slog with JSON handler and file + stderr; request logging with method, path, status, duration, bytes.
HTTP middleware: Recovery (panic → 500), request ID, CORS, logging applied in a clear chain.
Health endpoints: /health (liveness) and /ready (DB ping) for the server.
Kafka usage: Centralized KafkaManager with RWMutex, separate readers/writers, group IDs per service.
Shared state: AppState in common/appcontext is thread-safe (RWMutex) and used consistently.
Config: Env-based config with service-specific loaders (LoadServer, LoadBridge, etc.) and getEnvPanic for required vars.
API responses: Centralized response.JSON, Error, BadRequest, InternalError, NotFound with consistent JSON shape.
OpenAPI: Routes reference api/openapi.yaml (OpenAPI 3.0), which helps readability and contract clarity.

3. Critical issues

3.1 Security

No authentication or authorization on the HTTP API
All server routes (/reslevis/*, /configs/beacons, etc.) are unauthenticated. Anyone who can reach the server can read/update/delete gateways, zones, trackers, parser configs, settings, alerts, and tracks.
Database connection uses sslmode=disable
In internal/pkg/database/database.go, DSN is built with sslmode=disable. In production, DB connections should use TLS and sslmode=verify-full (or equivalent) with CA verification.
TLS verification disabled for outbound HTTP
- internal/pkg/apiclient/updatedb.go: TLSClientConfig: &tls.Config{InsecureSkipVerify: true}.
- internal/pkg/location/inference.go: same, and NewDefaultInferencer(skipTLSVerify bool) ignores the parameter and always uses InsecureSkipVerify: true.
CORS defaults to *
In internal/pkg/api/middleware/cors.go, when origins is nil/empty, origins = []string{"*"}. Production should restrict origins to known front-end origins.
Logger file mode 0666
In internal/pkg/logger/logger.go, os.OpenFile(..., 0666) makes the log file world-readable and -writable. Prefer 0600 or 0640.
No rate limiting or request body size limits
No protection against abuse or large-body DoS; consider middleware for max body size and rate limiting (per IP or per key).

Recommendations:

Add authentication/authorization middleware (e.g. JWT or API key validation) for all non-health API routes; keep /health and optionally /ready public.
Make DB TLS configurable via env (e.g. DB_SSLMODE, DB_SSLROOTCERT) and use sslmode=verify-full in production.
Use cfg.TLSInsecureSkipVerify (or equivalent) for all outbound HTTP clients; fix NewDefaultInferencer to respect the parameter.
Configure CORS with explicit allowed origins (and optionally credentials) from config.
Set log file mode to 0600 (or 0640 if a group needs read).
Add middleware to limit request body size (e.g. http.MaxBytesReader) and consider rate limiting for API routes.

3.2 Reliability

Kafka consumer: decode errors and commit semantics
In internal/pkg/kafkaclient/consumer.go, when json.Unmarshal fails, the code logs and continues without committing. Depending on reader config, this can cause repeated redelivery of bad messages or ambiguous semantics. Production should either skip and commit, or send to a dead-letter path and commit.

Answer: because readers are using consumer groups messages are auto commited, meaning bad unmarshal still commits as the message was technically read
No retries on Kafka produce
Event loops (server, bridge, decoder, location) call WriteMessages once; transient Kafka errors are not retried. Consider retry with backoff (and optional circuit breaker) for critical topics.

Answer: the Writer object is already holding the default configuration for timeout, backoff and retries, but I still added some extra configurations
Database: no explicit pool or timeouts
database.Connect uses GORM defaults. For production, set MaxOpenConns, MaxIdleConns, and connection/timeout settings (e.g. SetConnMaxLifetime) on the underlying *sql.DB.
UpdateDB and Init: failures only logged
In internal/app/server/app.go, apiclient.UpdateDB errors are only logged; Init continues. Consider failing Init (or marking “degraded”) if sync is required for correct operation, or add retries/backoff.
Use of context.Background() in async paths
e.g. internal/pkg/bridge/handler.go and internal/pkg/decoder/process.go use context.Background() for Kafka writes. Prefer passing the request/event context (or a derived timeout) so shutdown and timeouts propagate.

Recommendations:

Define a clear policy for Kafka consumer errors (skip+commit vs DLQ); avoid silent continue without commit unless intended.
Add retry (with backoff) for Kafka produce in critical paths; consider a small wrapper or helper.
Configure DB pool and timeouts in database.Connect (and optionally make them configurable via config).
Decide whether UpdateDB is mandatory for startup; if yes, fail Init on error or retry; if no, document and consider a “degraded” readiness state.
Pass context from the caller (or a timeout context) into Kafka write calls instead of context.Background().

3.3 Correctness and consistency

Update endpoints: path id vs body id
- GatewayUpdateController (internal/pkg/controller/gateways_controller.go): Uses mux.Vars(r)["id"] only to check existence with First(..., "id = ?", id), then decodes body into gateway and calls Save(&gateway). The updated record is identified by gateway.ID from the body, not the path. A client can send a different ID in the body and update another resource.
- ZoneUpdateController: Route is updateZone (no {id} in path); uses zone.ID from body only. If the API contract expects path-based ID, this is inconsistent.
  Recommendation: For update-by-id, use the path parameter as the single source of truth: load by path id, decode body into a DTO or partial struct, then update only allowed fields for that id (e.g. selective updates or merge then update by path id).
TrackerUpdateController
Uses body tracker.ID for lookup and save; route has no {id} in path. If other update endpoints use path {id}, align behavior and documentation.

Recommendations:

Standardize update semantics: either path {id} only (body has no id or it must match path) or document “body id is canonical” and ensure no IDOR.
Prefer path-based resource identification for updates/deletes and bind body to allowed fields only.

4. Minor issues and improvements

Logging: Replace fmt.Println in internal/pkg/apiclient/auth.go and any internal/pkg/apiclient/updatedb.go / inference paths with slog (or structured logger) so logs are consistent and configurable.
Token lifecycle: DefaultInferencer caches token in a struct field with no expiry or refresh; token may be used after expiry. Use token expiry from auth response and refresh when needed.
BeaconLookup naming: In appcontext, BeaconExists(id string) is used with MAC (e.g. in bridge handler). Rename parameter to mac (or the method to LookupIDByMAC) to avoid confusion.
Bridge/location/decoder: No /health or /ready endpoints. For orchestration (e.g. Kubernetes), consider a small HTTP server or at least a process-level health check so the platform can restart unhealthy instances.
Dependencies: go.mod is clear; consider auditing indirect deps and keeping them updated (e.g. go list -m -u all and Dependabot/Renovate).

5. Propositions (prioritized)

P0 — Before production

Add API authentication/authorization
Protect all non-health routes (e.g. JWT or API key middleware); document required claims/scopes if using JWT.
Enable and verify DB TLS
Make sslmode (and optional root cert) configurable; use verify-full (or equivalent) in production.
Respect TLS config for outbound HTTP
Use config (e.g. TLSInsecureSkipVerify) for apiclient and location inferencer; fix NewDefaultInferencer(skipTLSVerify bool) to use the parameter.
Fix update controllers
Use path id as source of truth for update (and optionally delete); ensure body cannot override resource id in an unsafe way.
Tighten CORS and log file permissions
Explicit allowed origins from config; set log file mode to 0600 (or 0640).

P1 — High

Kafka consumer error policy
Define and implement skip+commit or DLQ for bad messages; avoid infinite redelivery of poison messages.
DB connection pool and timeouts
Set MaxOpenConns, MaxIdleConns, ConnMaxLifetime, and timeouts in database.Connect.
Request body size limit
Middleware (e.g. http.MaxBytesReader) for API routes to prevent large-body DoS.
Replace fmt.Println with slog
In apiclient and any remaining places; ensure response body is closed after read (e.g. defer res.Body.Close() in auth/data clients if not already).

P2 — Medium

Retries for Kafka produce
Retry with backoff (and optionally circuit breaker) for critical WriteMessages calls.
Context propagation
Pass request/event context (or bounded context) into Kafka writes instead of context.Background().
Token refresh in Inferencer
Use expiry from auth response and refresh token before inference calls.
Health for bridge/decoder/location
Add minimal health/readiness (HTTP or signal file) for orchestration and load balancers.

P3 — Nice to have

Metrics and tracing
Add metrics (e.g. request duration, Kafka lag, error counts) and optional distributed tracing (e.g. OTel).
Rate limiting
Per-IP or per-token rate limiting on API routes.
Structured validation
Use a validator (e.g. go-playground/validator) for request bodies and path params (IDs, limits).
Documentation
Short runbooks for deploy, config env vars, and dependency on Kafka/DB/MQTT and external auth/API.

6. Files reviewed (representative)

Entrypoints: cmd/server/main.go, cmd/bridge/main.go, cmd/decoder/main.go, cmd/location/main.go
Apps: internal/app/server/*, internal/app/bridge/app.go, internal/app/decoder/app.go, internal/app/location/app.go
Config: internal/pkg/config/config.go
Infra: internal/pkg/database/database.go, internal/pkg/kafkaclient/manager.go, internal/pkg/kafkaclient/consumer.go, internal/pkg/logger/logger.go
API: internal/pkg/api/handler/health.go, internal/pkg/api/middleware/*, internal/pkg/api/response/response.go, internal/app/server/routes.go
Controllers: internal/pkg/controller/*.go (gateways, trackers, zone, parser, settings, alerts, tracks, trackerzones)
Services: internal/pkg/service/beacon_service.go, internal/pkg/apiclient/*.go, internal/pkg/bridge/mqtt.go, internal/pkg/bridge/handler.go, internal/pkg/decoder/process.go, internal/pkg/location/inference.go, internal/pkg/common/appcontext/context.go

7. Conclusion

The microservices are well-structured and readable, with consistent lifecycle and shutdown. The main gaps are security (no API auth, weak TLS usage) and reliability (Kafka and DB tuning, retries, context usage). Addressing the P0 and P1 items above would bring the system much closer to production grade (around 7–8/10); adding P2/P3 would further improve operability and resilience.

13 KiB Raw Blame 履歴

Production Readiness Report — Microservices (server, bridge, decoder, location)

Overall grade: 5.5 / 10

1. Summary by dimension

2. What’s in good shape

3. Critical issues

3.1 Security

3.2 Reliability

3.3 Correctness and consistency

4. Minor issues and improvements

5. Propositions (prioritized)

P0 — Before production

P1 — High

P2 — Medium

P3 — Nice to have

6. Files reviewed (representative)

7. Conclusion

13 KiB

Raw Blame 履歴