About us:
Paysera is the first fintech company in Lithuania and an EU licensed e-money institution. We provide fast, convenient, and affordable financial services globally. Our services range from a payment gateway for e-shops, a finance management app, and money transfers worldwide.
With over 1 million app installs and growing, we aim to become an industry-leading super app that provides financial and lifestyle services across the globe. At Paysera, we are a start-up minded team, which means we thrive in a fast-paced environment and seek open communication while placing great focus on establishing our core company values. Join our vibrant international team of 500 people across 15 different cities worldwide.
As a Site Reliability Engineer, you'll ensure the IT infrastructure's availability, performance, and security. Collaborating with development teams and system administrators, you'll guide the design and deployment of applications to meet Paysera's reliability standards. The role demands passion for scalable systems, expertise in system architecture, and a commitment to enhancing uptime and service quality. Additionally, we are on the lookout for individuals who are committed to self-improvement and are not afraid to employ innovative AI tools in their daily work to drive progress.
Your key responsibilities:
Define and support Service Level Indicators (SLIs) and Objectives (SLOs) for existing critical components within our mixed on-premises environment;
Make informed decisions on database cluster optimization, and web services configuration, and introduce improvements to these areas;
Enhance the instrumentation and efficiency of daily operations tasks performed by the operations teams;
Drive improvements in change and release processes, transitioning from unregulated CI/CD practices to a more structured change management framework that fosters reliable CI/CD processes;
Operate in an incident-prone environment, working proactively to reduce the frequency and impact of critical incidents. This includes taking part in incident management, contributing to the development of a common operations knowledge base, maintenance operation procedures (MOPs), runbooks, and enhancing monitoring and observability;
Collaborate closely with operations and development teams to enhance the reliability of our infrastructure and software, through education and shared best practices;
Document and categorize knowledge effectively, and train team members to ensure continuity and efficiency of operations;
Communicate effectively with team members and stakeholders, ensuring clear and concise information exchange;
Be ready for the on-call rotations;
Expect to perform routine daily tasks using ChatGPT or a similar tool to enhance efficiency and productivity.
What we're looking for:
Bachelor’s/Master’s degree in Computer Science, Engineering, or a related field;
A minimum of 3 years of experience in Site Reliability Engineering, System Administration, Incident management, or a closely related field;
Demonstrated experience in designing and managing the reliability of large-scale systems;
Familiarity with modern infrastructure technologies and deployment processes;
Strong proficiency in monitoring tools and methodologies: ELK, Grafana, New Relic, Datadog, and Zabbix;
Strong experience with containerization technologies such as Docker and Kubernetes;
Strong problem-solving skills with a proactive approach to issue resolution;
Ability to work efficiently under pressure and manage multiple priorities;
Excellent communication skills, with the ability to explain complex technical issues to non-technical stakeholders;
A collaborative team player with a strong desire to mentor and share knowledge;
Fluency in English;
Proven familiarity and experience with AI tools like ChatGPT and other technologies, demonstrating a capability to seamlessly integrate these into daily tasks.
Nice-to-Have:
Proficiency in PHP, Symfony, Doctrine ORM, and familiarity with drivers for Redis, RabbitMQ, ELK, and Sentry;
Proficiency with High-availability enablement solutions such as Keepalived, Heartbeat, Corosync, Pacemaker, etc;
Understanding and experience with static and dynamic routing protocols and all layers of the OSI model;
Experience with Nginx and FPM and managing MariaDB clusters in an environment with MaxScale as a proxy solution;
Solid understanding of Redis with Sentinel clusters, RabbitMQ clusters, Elasticsearch clusters, and analytics/data warehousing with ClickHouse;
Knowledge of infrastructure as code (IaC), orchestration, and configuration management tools such as Ansible, Helm, GitLab CI/CD, and Webistrano;
Expertise in metrics and visualization tools including Zabbix, Grafana, Prometheus, and InfluxDB;
Experience with New Relic, PagerDuty, Sentry, Graylog with Elastic, and a strong understanding of distributed tracing and APM;
Understanding and experience with Monolithic, Service-oriented, and Microservice Architecture types.