Platform Engineering builds the core platforms, tooling, and paved roads that Bloomberg engineers rely on to ship reliable, secure, and high-performing systems at scale.
The AI App Enablement & Observability team accelerates how AI products are built across Bloomberg Industry Group. Our mission is to make AI systems reliable, performant, cost-efficient, and continuously improving through platform tooling, deep observability, and automated feedback loops.
We build developer-facing platforms and workflows that enable teams to experiment, deploy, and operate AI and agent-based systems with confidence. This includes LLM gateways, agent platforms, benchmarking systems, telemetry pipelines, and self-improving infrastructure that closes the loop between observability and action. We emphasise strong developer experience, intuitive APIs/SDKs, and end-to-end ownership.
What’s in it for you?
You will help define how Bloomberg Industry Group builds and operates AI systems at scale by working on platforms that:
-
Accelerate AI product development through reusable tooling and paved roads
-
Provide end-to-end observability across AI systems (models, agents, pipelines, applications)
-
Enable self-improving systems through telemetry-driven feedback loops
-
Optimise cost, performance, and reliability of AI workloads
-
Support both production AI systems and internal engineering agents
You’ll collaborate across AI product, infrastructure, and platform teams to deliver foundational systems.
We’ll trust you to:
Platform & Enablement
-
Build and evolve AI platform tooling (e.g., developer workflows, benchmarking systems)
-
Design developer-friendly APIs, SDKs, and interfaces
-
Contribute to systems across the Model Development Lifecycle (experimentation, deployment, evaluation)
Observability & Telemetry
-
Build and operate observability platforms and telemetry pipelines (logs, metrics, traces, events)
-
Provide visibility into latency, token usage, cost, quality, drift, and reliability
-
Define instrumentation standards, schemas, and conventions
-
Implement distributed tracing using modern approaches (e.g., OpenTelemetry)
AI System Insights & Debugging
-
Enable end-to-end debugging of AI and agent workflows (model calls, tool usage, retrieval, orchestration)
-
Build benchmarking, regression detection, and performance analysis capabilities
-
Support observability for both production systems and internal engineering agents
Closed-loop Optimization & Automation
-
Develop systems that turn telemetry into action (automated experimentation, regression detection, alerting)
-
Build feedback loops that continuously improve model quality and system behavior
-
Enable self-healing and self-optimising workflows
Cost, Performance & Reliability
-
Build tooling for cost visibility, forecasting, and optimization
-
Define SLOs, alerting, and performance tuning practices
-
Improve reliability and scalability of AI infrastructure
Ownership & Collaboration
-
Own projects end-to-end (RFCs, architecture, implementation, rollout, production support)
-
Partner with AI teams to drive adoption of platform tooling and standards
-
Produce high-quality documentation and improve developer experience
You’ll need to have:
-
Demonstrated experience building production software or platform systems
-
Strong engineering fundamentals with distributed systems or backend platforms
-
Experience or strong interest in observability and debugging complex systems
-
Experience or strong interest in AI/ML systems, LLMs, or agent-based architectures
-
Strong ownership mindset and ability to drive ambiguous problems to production
-
Hands-on experience with modern agentic coding tools (e.g., Claude Code, Codex CLI, Cursor) and multi-model workflows
-
Working knowledge of agent architecture internals (context engineering, tool loops, sub-agent orchestration)
We’d love to see:
- Experience with OpenTelemetry and modern observability ecosystems, including instrumentation, collectors, exporters, and tools like Prometheus, Grafana, and tracing/log systems
- Experience designing and operating telemetry pipelines, including sampling, retention, cardinality, and cost tradeoffs, as well as integrating observability into CI/CD and developer workflows
- Familiarity with AI/agent frameworks, including instrumentation of LLM calls, tool usage, workflows, and evaluation signals (quality metrics, benchmarking, regression detection)
- Experience building cost monitoring, forecasting, and optimization systems for AI workloads
- Familiarity with cloud and infrastructure tooling (e.g., AWS, Azure, Kubernetes, Terraform)
- Experience with agentic infrastructure concepts such as MCP servers, hooks, skills, subagents, sandboxing, and persistent memory patterns
- Active engagement with the agentic engineering frontier, including emerging patterns (e.g., harness vs. model, review debt, feedback loops)
- Demonstrated agent-native development practices (iterating with agents using testing, verification, and feedback loops)
- Strong security awareness for autonomous systems, including sandboxing, prompt injection risks, credential exposure, and guardrails