Responsibilities
As a Manufacturing Security System Infrastructure & Platform Engineer within the DevOps team, you will be responsible for building, managing, and maintaining the highly secure on-premise Kubernetes platform and underlying infrastructure that hosts our critical microservices. You will ensure the platform is available, performant, secure, and scalable, working closely with the CI/CD and Application Reliability Engineer to provide a stable base for deployments. Your main tasks will include:
- On-Premise Infrastructure Management: Design, deploy, supervise, and manage the on-premise Kubernetes-based microservices infrastructure. This includes hands-on management of core infrastructure components such as:
- Deploying and maintaining clustered storage solutions (e.g., Rook for Ceph).
- Managing secure container image storage and distribution (e.g., Harbor).
- Configuring and managing ingress controllers (e.g., ingress-nginx) and bare-metal load balancers (e.g., MetalLB).
- Implementing and managing a service mesh (e.g., Linkerd) for enhanced communication, security (mTLS), and observability at the platform level.
- Ensuring the availability, scalability, and optimal performance of the platform itself.
Database Management: Deploying, configuring, and maintaining database systems (MariaDB, Redis) and database proxies (MaxScale) that support the microservices, ensuring their security and high availability.
- Platform Monitoring & Logging: Setting up and managing the infrastructure-level monitoring and logging components. This includes collecting system logs (e.g., Fluent Bit from nodes and pods), ensuring logs are stored and accessible (e.g., Elastic Search), and configuring platform-level metrics collection (e.g., Prometheus for node/cluster metrics) and visualization (e.g., Grafana dashboards for infrastructure health).
- Infrastructure Security: Implementing and enforcing security measures at the infrastructure level in compliance with ISO27001.
This includes securing the Kubernetes control plane and nodes, configuring
network policies, securing the container registry (Harbor), and managing
security aspects of the service mesh (Linkerd). - Performance & Capacity Planning: Analyzing infrastructure performance (CPU, memory, network, storage utilization) using monitoring data and conducting capacity planning for future growth.
Proposing and implementing improvements to optimize resource usage of the
platform. - Incident Response (Infrastructure): Providing support and maintenance for the production infrastructure, focusing on diagnosing and resolving issues related to the Kubernetes cluster, storage, network, and database systems.
- Collaboration: Collaborating closely with the CI/CD & Application Reliability Engineer, development, and architecture teams to ensure the platform meets the needs of the applications and deployment processes.
- Infrastructure Automation: Contributing to and using Ansible scripts for the automation of infrastructure setup, configuration, and patching.
- Configuration Management: Managing the configuration of the infrastructure components using Infrastructure as Code (IaC) principles.
- Documentation: Creating and maintaining detailed documentation of the infrastructure architecture, configurations, and operational procedures.
Qualifications
Required Skills:
- Extensive experience managing and supporting on-premise Linux and Kubernetes environments.
- Proven hands-on experience deploying and managing critical infrastructure components within Kubernetes, including:
- Storage solutions: Rook (Ceph).
- Container Registries: Harbor.
- Networking: Ingress controllers (ingress-nginx), Bare-metal Load Balancers (MetalLB).
- Service Mesh: Linkerd (specifically from an infrastructure perspective).
- Experience deploying, managing, and securing database systems (MariaDB, Redis) and database proxies (MaxScale).
- Experience setting up, configuring, and managing infrastructure-level monitoring, logging, and alerting using tools such as Prometheus, Grafana, Elastic Search, and Fluent Bit.
- Strong understanding of Linux system administration and networking.
- Experience in distributed system architecture and high availability design principles for infrastructure.
- Strong knowledge of infrastructure and network security practices.
- Skills in infrastructure performance monitoring, analysis, and optimization.
- Familiarity with automation tools like Ansible.
- Ability to work effectively in a collaborative team environment.
- Excellent problem-solving skills, particularly for infrastructure-related incidents.
- Ability to write and speak technical English fluently.
Additional information
Assets :
- Experience with Infrastructure as Code (IaC) tools beyond Ansible (e.g., Terraform).
- Experience working in environments with strict security and compliance requirements (e.g., ISO27001).
- Knowledge of container and infrastructure security scanning and hardening.
Required
Mindset :
- Collaboration and Communication: Work closely with all teams and communicate clearly to ensure smooth processes.
- Automation and Optimization: Constantly seek to automate tasks and improve system efficiency.
- Problem-Solving and Resilience: Remain calm under pressure, quickly resolve incidents, and be curious to learn and adopt new technologies.