We are looking for a Senior Site Reliability Engineer to join the Production Operations team which runs SmartDrive’s infrastructure and services. The ideal candidate will demonstrate operational excellence in defining, operating, and monitoring legacy and new services and infrastructure, as well as in responding to incidents and critical situations in a 24x7x365 environment.
This SRE will be contributing to the design and evolution of SmartDrive’s platform based on orchestrated and distributed services in the cloud.
Our products intersect with all the current and exciting trends in the automotive and transportation industries, and our business is nearly doubling every year. Our people and our platforms are the foundation and enabler of that growth.
We are expanding our team looking for technologists who thrive for continuous improvements, and who have a passion for developing highly available and distributed applications in the cloud, and a drive to deliver software products that make a meaningful difference in the lives of others.
The position is to be located at SmartDrive’s HQ in San Diego, CA.
We expect the ideal candidate to be proficient in a large portion of the following technologies and tools (list not exhaustive), in particular the Infrastructure, Backend and Protocols stacks:
- Source Control and Processes: Jira, Confluence, BitBucket
- CI/CD: Jenkins, Spinnaker, Helm, Harbor, NuGet, Octopus, Sonarqube, Clair
- Infrastructure: Docker, Kubernetes (cloud and bare metal), Windows Server, HyperV
- Backend: Kafka, Flink, Cassandra, MongoDB, MS SQL Server, Spark, Redis, Elasticsearch, Kibana, Prometheus, Grafana
- Firmware: ARM SoC, SqlLite, InfluxDB, FFMPEG
- Languages: Go, Python, C/C++, C#, Scala
- Traditional internet and security protocols (HTTP, TLS, PKI)
The Platform Services team runs the entire infrastructure supporting SmartDrive services and customers, powered by on-prem systems in our datacenter and a fast growing public cloud footprint.
The team, whose ownership and drive has no match, is composed of experts in DevOps, DataOps, Production Operations, IT, and Security.
Our team is a global operation with support around the clock to ensure we fulfill the SmartDrive Promise to our customers.
- Ownership in a very agile and fast paced environment of the operations of SmartDrive’s on-prem infrastructure and growing public cloud footprint
- Diagnose and resolve issues related to SmartDrive’s customer facing systems through proactive monitoring, alerting, trending and anomaly detection
- Identify bugs and gaps in the systems, and collaborate with a cross-functional team to document and determine the best path to address and correct
- Drive improvements and evolution of our monitoring and alerting systems through new and updated techniques and tools
- Support maintenance and evolution of SmartDrive’s on-prem datacenter, including network and hardware management, and maintenance of firewalls/load balancers
- Setup and Maintenance of cloud accounts, services and infrastructure
- Ownership of the Production Readiness Review process guiding cross-functional teams towards operational readiness
- Evolve capacity planning models and tools to optimize the use of our footprint, as well as manage operational costs
- Develop with cross-functional teams strategies towards higher organizational throughput via automation and flexible processes
- Document operational runbooks
- Leverage and Train global staff, including customer facing Tech Support teams, to ensure around the clock and efficient support
- Help the team balance and plan longer term projects against the demands of a fast changing environment (i.e. Eliminate Toiling)
- Contribute to the reliable design of continuous deployment solutions that will yield higher availability and uptime for our internal and external customers
- Enforce security policies and best practices in infrastructure and designs
- Participate in on-call rotations necessary in a 24x7x365 organization
YOUR SKILLS AND QUALIFICATIONS
- B.S. or M.S. Degree in Computer Science or equivalent (M.S. highly desirable)
- 5+ years of experience supporting a 24x7x365 global Saas/Paas operation
- 3+ years of experience with public cloud infrastructure and services (AWS preferred)
- You demonstrate proficiency in our stack (see above) with 2+ years of experience in a large portion of the tools and technologies we leverage, in particular the Infrastructure, Backend and Protocols stacks
- Experience maintaining an on-prem datacenter, including network and hardware
- Proficiency with Windows Server and/or SQL Server highly desirable.
- Proficiency with scripting languages (bash, Powershell, perl or equivalent)
- Proficiency with Linux
- Network security experience highly desirable
- You demonstrate a high sense of urgency and ownership, can operate with minimal direction
- You take great pride in helping teams and individuals (there is no ”it’s not my job” syndrome at SmartDrive)
- You have competence in building and designing complex systems leveraging multiple technologies (legacy and new)
- You can demonstrate and ensure consistency in line with industry best practices for operational readiness (we trend towards an approach similar to the Google SRE model)
- You are comfortable in a fast paced and fast changing environment
- You are driven, motivated to learn and ramp up fast
- You are candid and are able to hold your own as the ProdOps representative in cross-functional teams
SmartDrive Systems, the recipient of Frost & Sullivan’s Customer Value Leadership Award for Video Safety Solutions gives fleets and drivers unprecedented driving performance insight and analysis, helping save fuel, expenses and lives. Its video analysis, predictive analytics and personalized performance program help fleets improve driving skills, lower operating costs and deliver significant ROI. With an easy-to-use managed service, fleets and drivers can access and self-manage driving performance anytime, anywhere. The company, which is ranked as one of the fastest growing companies by Deloitte’s Technology Fast 500™, has compiled the world’s largest storehouse of more than 220 million analyzed risky-driving events. SmartDrive Systems is based in San Diego, and employs over 725 people worldwide.