Software Engineer – Distributed Service

Who We Are

We are the engineers on  Singularity. We believe that building a planet-scale AI Supercomputer from the ground-up which addresses the fundamental pain-points of data scientists and AI practitioners and takes AI to the unprecedented scale is an opportunity of a lifetime. If you share the same dream as us, come join us! 


What Is Singularity?

Singularity is a globally distributed, multi-tenant service that provides robust, cost-effective and competitive AI infrastructure (compute, networking and storage) for AI training and inferencing.

Ultimately, democratization of AI is all about enabling data scientists to productively build, scale, experiment, and iterate their models on top of a robust, performant, scalable and cost-effective distributed infrastructure built for AI.

In Singularity, we are constantly seeking to apply the best ideas from AI, ML, distributed systems, distributed databases, machine learning, information retrieval, networking, and security.


What You Will Work On

In this role you will be responsible for building the scheduling sub-system that is responsible for delivering on the SLAs for AI training and inferencing workloads. Specifically, you will be working on building the fault detection mechanisms, topology aware scheduling algorithms, checkpoint/restore, and elasticity capabilities across hardware and software stacks.


Responsibilities

  1. Design and build the reliability sub-system that is responsible for delivering on the SLAs for AI training and inferencing workloads.
  2. Build fault detection mechanisms, topology aware scheduling algorithms, checkpoint/restore, and elasticity capabilities across hardware and software stacks.
  3. Leverage performance and profiling tools to identify hot spots and bottlenecks across hardware and software boundaries: from CPU, GPU, microcode, OS, networking to product code and drive end-to-end job performance.

Qualifications

Required Qualifications

  • BS or higher in Computer Science or related discipline (or equivalent experience)
  • 2+ years of industry experience designing, developing and shipping high quality scalable software and services

Preferred Qualifications

  • Strong design, implementation and testing skills
  • Managed and native code development experience
  • Experience in developing Kubernetes ecosystem is a plus
  • Experience in using / extending PyTorch/TensorFlow is a plus
  • Experience in programming hardware accelerators such as GPUs is a plus
  • Experience with CUDA/NCCL is a plus.
  • Experience with parallel programming (pthreads, MPI, OpenMP, etc) is a plus
  • Experience with diagnosis and debugging systems performance issues, using appropriate tools and techniques 

Great if you have any of the following under your belt:

  • Large scale stateful and stateless services
  • Native Windows or Linux development experience is a plus
  • Performance profiling
  • Strong written and oral communication skills

Great if you are passionate about the following:

  • Algorithm improvements
  • Resource management
  • Performance
  • Resource Utilization
  • Metrics and analytics
  • Benchmarks

We are committed to an inclusive and diverse culture.


Join our mission and help us shape the future of planet-scale AI and solve the pain-points of data scientists developing bleeding edge AI!


Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check. This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.


Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.