Site Reliability Engineer

We are looking for a very well-rounded, experienced Site Reliability Engineer (SRE) to join a team of SREs dedicated to improving the reliability of our end-to-end platform. We work on petabyte-scale distributed systems -- our core infrastructure receives hundreds of millions of tweets per day and serves tens of billions of API requests. Our other systems serve over 2+ billion search queries per day, render hundreds of millions of ad impressions, and process hundreds of terabytes of log and interaction data daily. This person must dive deep into gnarly operational issues, from the programming, systems, automation, and process perspectives. He/she will understand the challenges around integrating disparate infrastructures into a new facility and new processes and procedures.   Responsibilities ? Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes ? Troubleshoot issues across the entire stack: hardware, software, application and network ? Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization ? Mentor SREs across the organization on best practices for everything from monitoring to troubleshooting complex code issues ? Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services ? Participate in code reviews for projects primarily written in Java and Scala, built on open source libraries such as Finagle, and running on both physical and virtualized platforms ? Represent the SRE organization in design reviews and operational readiness exercises for new and existing services ? Take charge and improve upon our existing production and staging environments ? Build out and maintain tools and processes for automated deployment, management, and monitoring of our application in production, and choose whether to build, adapt (i.e., from open source), and/or buy tools and services for this task ? Contribute to the code base and systems and software architecture as a member of our engineering team, with an eye towards making our software (both application and infrastructure) more scalable, reliable, and performant   Requirements ? Sound fundamentals in operating systems, networking, and distributed systems o Expert familiarity with linux systems administration and management o Familiarity with OS container technology: Docker, LXC, namespaces/cgroups o Deep understanding of: Ethernet, VLAN, IPv4/IPv6, ARP, DHCP, DNS, and TCP o Familiarity with distributed system problems: leader election, consensus, etc. ? Solid understanding of systems and application design, including the operational trade-offs of various designs ? Expert level understanding with at least one public or private cloud technology such as Amazon AWS or OpenStack ? Practical knowledge of various aspects of service design, including messaging protocols & behavior, caching strategies and software design practices    ? Practical, solid knowledge of shell scripting and at least one higher-level language (Python or Ruby preferred) ? Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures ? Excellent knowledge of Linux/UNIX systems administration and performance tuning ? Comfortable configuring DNS, DHCP, and LAN/WAN technologies ? Minimum 7 years of managing services in an internet scale nix environment ? Must work well with and be able to influence myriad personalities at all levels ? Ability to prioritize tasks and work independently ? Must be adaptable and able to focus on the simplest, most efficient & reliable solutions ? Track record of successful practical problem solving, excellent written and interpersonal communication, and documentation skills ? Curiosity and an interest in networking, systems software, and/or distributed systems ? You aren't easily pigeonholed into traditional software engineering or systems administration roles ? Experience as a software engineer, systems administrator, operations engineer, release engineer, or similar role ? Experience with a 24/7 production environment, and you have deployed code to and/or managed 100+ node deployments providing software, platforms, or infrastructure as a service ? Ability to develop clean, tested, and maintainable automation and other tools using (one or more of) Python, Ruby, Perl, or Go ? Experience with statically-typed compiled languages: ideally, you should know at least one object-oriented language (Java, Scala, C++, or C#) and one systems language (C or C++) ? An understanding of at least one build tool/tool chain such as maven, ant/ivy, make, auto-tools, CMake, etc.   Desirable Qualifications ? Practical experience in C/C++ and Python ? Ability to lead technical teams through design and implementation across an organization ? Experience with existing open source projects such as Scribe, ZooKeeper, and Apache Mesos ? B.S. in computer science or similar field ? Previous application operations (a.k.a. "site reliability engineering", "production engineering") experience ? Experience with configuration management tools such as CFEngine, Bcfg2, Puppet, Chef, or Ansible ? Experience with Amazon Web Services, Google Compute Engine, or similar ? Experience with distributed compute (e.g., Spark or Hadoop), storage (relational databases such as Postgres or MySQL, horizontally-scalable non-relational databases such as HBase, Riak, or Cassandra), and search infrastructure (such as ElasticSearch or Solr/Lucene) ? Experience in horizontally scaling a production environment by a factor of magnitude, ideally in a startup or other rapid-growth environment
Salary Range: NA
Minimum Qualification
8 - 10 years

Don't Be Fooled

The fraudster will send a check to the victim who has accepted a job. The check can be for multiple reasons such as signing bonus, supplies, etc. The victim will be instructed to deposit the check and use the money for any of these reasons and then instructed to send the remaining funds to the fraudster. The check will bounce and the victim is left responsible.

More Jobs

Principal Cloud Site Reliability Engineer - RE...
Annapolis Junction, MD Oracle
Site Reliability Engineer
Baltimore, MD Under Armour, Inc.
Senior Site Reliability Engineer
Columbia, MD ASRC Federal Holding Company
Site Reliability Engineer - Fort Meade, MD
Fort George G Meade, MD ABBTech Professional Resources, Inc.
Site Reliability Engineer - REMOTE
Annapolis Junction, MD Oracle