Site Reliability Engineer



We are looking for a motivated and talented Site Reliability Engineer to join us from our remote European team to help us monitor, develop, and scale the Cordial platform. Our goal is to provide our clients with a delightful experience in their day to day interaction with the platform and to create trust that the expected jobs and background processes will run without issue. You will work with our DevOps and Product teams to ensure that bugs are squashed, performance is optimized, and blind spots are revealed through comprehensive monitoring. This position is fully remote with no physical Cordial office located in Portugal.


  • Utilize your knowledge of Web, App, Network, Server, Storage and Security technologies to administer, monitor and troubleshoot application and network components in our cloud based environment
  • Actively contribute to Infrastructure Design and Implementation discussions
  • Provide production support for the Product Development teams
  • Participate in an on-call rotation
  • Work with the team to develop and deploy monitoring and alerting architecture, and implement monitoring/logging solutions
  • Troubleshoot complex issues in a timely manner as necessary to maintain the performance and stability of our Production Application environment
  • Help build out SLOs and document and monitor SLAs


  • 3+ years UNIX/Linux Systems (Unix/Linux) & Network Administration (DNS, IPsec, VPN, Load Balancing, process tracing)
  • Experience with AWS (we use EC2, EKS)
  • Experience with monitoring, logging and alerting tools
  • Previous positions held as a SRE and/or DevOps role
  • Software development experience
  • Experience with Docker/containers & Kubernetes
  • Comfortable working in a globally distributed team across time zones
  • Strong teamwork and communication skills
  • A genuine desire to learn new technologies and grow
  • Fluent in verbal and written English


  • Experience with MongoDB
  • Experience deploying and/or maintaining Kubernetes/EKS clusters
  • Experience with Prometheus/Grafana/Datadog
  • Experience implementing SLOs, reliability targets, error budgets