I build production-grade agentic AI systems and data pipelines. Currently growing my skills in building self-healing agent workflows, observability infrastructure, and human-in-the-loop systems.
Expertise and interests:
Agent Systems & Orchestration
- Multi-agent orchestration with LangGraph/ Claude Agents SDK for complex workflows—routing between specialized agents, implementing feedback loops, and managing context across agent interactions
- Self-healing agent pipelines with automated error recovery, retry logic, and fallback strategies
- Human-in-the-loop (HITL) workflows
- Agent communication protocols including Model Context Protocol (MCP) for standardized tool integration
- Real-time monitoring and alerting for LLM operations with latency, cost, and error rate tracking
Scalable AI Infrastructure
- Batch inference pipelines for multimodal document understanding at scale
- Hybrid search systems combining semantic and lexical retrieval with OpenSearch
- Production FastAPI services with async patterns for high-concurrency LLM operations
- Cost-aware architecture with caching, prompt optimization, and intelligent model routing
Stack: Python, FastAPI, LangGraph, Claude/OpenAI Agents SDK, AWS (Bedrock, EKS, SageMaker, Step Functions, DynamoDB), OpenSearch, Redis, observability tools
📍 Berlin, Germany | 🌐 Open to collaborations
Reliability Engineering for AI: Design agent systems with self-healing capabilities, automated recovery workflows, and comprehensive observability—no black-box deployments
Evaluation-Driven Development: Every AI feature ships with quantitative evals, regression tests, and continuous monitoring—deployed systems improve over time through structured feedback
Building agentic systems that need production-grade reliability, observability, and human oversight? Let's discuss architecture for systems that don't just work on demo day—they scale, self-heal, and improve continuously.



