Join Synadia
At Synadia we are pioneering a new way for digital systems to connect and communicate between cloud, on-premise, and edge securely, in real-time, and in any environment. We love open source software (OSS)! We maintain and lead the development of NATS - a next generation distributed communications platform.
Distributed Systems | Performance and Reliability
Employment Type: Full-time
Level: Junior to Intermediate
Location: Remote
Job Summary
This job is not routine and requires creativity, critical thinking, expert troubleshooting, strong collaboration skills, and a desire to try something new. You will work primarily with a senior systems engineer on a long-term mission to improve the performance and reliability of the NATS ecosystem. NATS is natively flexible and composable. To deal with this complexity and large surface area, we apply a holistic approach to identifying performance and consistency issues before any users run into them.
Day to day, the job includes but is not limited to the following activities:
- Design experiments to evaluate the system runtime behavior in a variety of scenarios
- Design benchmarks targeting specific sub-systems
- Develop tools for testing and analysis (e.g.,: load generators, telemetry aggregators, results visualization, etc)
- Setup automation to catch errors and performance regressions proactively, and reproduce complex scenarios at will
- Perform one-off deep-dive investigations to isolate the root cause of performance and reliability issues
- Develop fault-injection techniques and tools to verify the system behaves correctly even when things go wrong (e.g.,: Jepsen, Chaos Monkeys, etc.)
- Leverage formal methods to test system correctness (e.g. execution trace analysis using Elle)
- Design experiments that purposely abuse systems, simulating DDOS attacks and data exfiltration attempts
Job Requirements
- Bachelor’s degree in Computer Science or equivalent
- Passion for distributed systems and cloud infrastructure, specifically aspects of scalability, dependability, and fault tolerance
- Strong Unix/Linux systems-level systems level programming and troubleshooting skills
- Understanding of protocols such as TCP-IP, UDP, TLS, HTTP, and the OSI model
- Critical thinker, effective troubleshooter
- Great communication and documentation
- Proficiency in at least one of the following: Go, Python, Java, Rust, C/C++, Ruby
Preferred Qualifications
- Familiarity with distributed systems literature (consensus protocols, consistency models, replication, etc)
- Familiarity with cloud infrastructure security
- Experience with messaging technologies such as NATS, JMS, MQTT, AMQP, and Kafka
- Experience with distributed tracing and monitoring solutions
- Experience working with cloud providers such as AWS, Azure, or GCP