Senior/Staff Software Engineer, Super Compute Memory
Pryon
Software Engineering
New York, NY, USA
USD 190k-230k / year
Posted on Oct 9, 2025
About Pryon:
We’re a team of AI, technology, and language experts whose DNA lives in Alexa, Siri, Watson, and virtually every human language technology product on the market. Now we’re building an industry-leading knowledge management and Retrieval-Augmented Generation (RAG) platform. Our proprietary, cutting-edge natural language processing capabilities transform unstructured data into meaningful experiences that increase productivity with unmatched accuracy and speed.
Pryon is building one of the industry's most ambitious AI infrastructure platforms: a petabyte-scale ingestion and inference system powering mission-critical government and enterprise deployments. Our Super Compute Memory (SCM) initiative aims to process and serve massive knowledge bases at unprecedented speed and scale - think 6.5M+ documents ingested in under 20 minutes with sub-second retrieval.
We need a Senior/Staff Software Engineer with deep high performance computing expertise - someone who has built parallel and distributed systems at scale, not just used them. As a founding technical contributor to the SCM team, you will design and implement the parallel computing infrastructure that powers our ingestion, retrieval, and inference layers across multi-cloud and on-prem environments.
In This Role You Will:
- Design GPU resource allocation and scheduling for embedding generation and inference workloads across multi-tenant deployments; optimize GPU utilization (targeting 80%+ GPU memory utilization) through techniques like batching, quantization, and vLLM deployment
- Build and operate GPU clusters (multi-node, multi-GPU) for production inference workloads; implement GPU monitoring, health checks, and auto-recovery mechanisms; optimize cost-per-inference through efficient GPU scheduling
- Profile and optimize GPU-based inference pipelines for latency (targeting <100ms p95) and throughput (queries/sec per GPU); implement inference optimizations including KV caching, continuous batching, and model parallelism strategies
- Establish performance benchmarking framework, identify bottlenecks through profiling, and implement optimizations achieving 10x+ improvements
- Design and implement distributed ingestion pipeline capable of processing 6M+ documents/hour using technologies like message brokers, MinIO, workflow engines and parallel processing frameworks
- Architect high-performance vector similarity search infrastructure supporting billion-scale embeddings with sub-second query latency
- Architect compute-optimized deployments across AWS (EC2, EKS), GCP (GKE, TPUs), Azure (AKS), and on-premises Kubernetes
- Design, develop, and optimize high-performance, distributed systems and software for scalability and reliability
- Collaborate with research scientists, ML engineers, and platform teams to deliver high-quality, large-scale software
- Drive architectural decisions through RFCs, mentor ML and platform engineers on HPC best practices, establish coding standards for high-performance systems
- Lead the technical design and implementation of major features and components with lasting architectural impact
What You'll Need to Be Successful:
- Extensive experience in software development, with a proven track record of delivering complex, large-scale systems (8+ years for Senior, 12+ years for Staff)
- Proven experience building distributed systems at 100M+ scale (documents, vectors, or equivalent)
- Deep knowledge of parallel and distributed computing concepts including consensus algorithms, distributed coordination, and fault tolerance
- Hands-on experience with vector databases (pgvector, Pinecone, Weaviate, Milvus, or equivalent)
- Proficiency in systems programming languages such as C++, Go, or Rust
- Experience with parallel programming models (e.g., MPI, OpenMP, CUDA)
- Production experience optimizing GPU workloads for inference including batch optimization, quantization (INT8, FP16), and GPU memory management
- Experience managing large-scale GPU infrastructure (10+ GPUs in production) including cluster orchestration, resource scheduling, and cost optimization
- Deep understanding of GPU architectures (NVIDIA A100/H100, tensor cores) and inference frameworks (vLLM, TensorRT, Triton)
- Deep understanding of memory hierarchies, cache optimization, and NUMA architectures
- Experience with container orchestration (Kubernetes) and distributed computing frameworks (Ray, Dask, Spark, or equivalent)
- Familiarity with performance analysis and optimization tools and techniques (profilers, tracers, benchmarking)
- Strong systems programming background with evidence of performance-critical contributions (open source, papers, or production systems)
Preferred Qualifications
- Experience with cloud-based HPC, including services on AWS (EC2 P4/P5 instances), GCP (A100/H100 VMs), or Azure (ND-series)
- Knowledge of networking and storage technologies in the context of high-performance computing (RDMA, NVMe, distributed filesystems, GPU-Direct Storage)
- Advanced GPU optimization experience including multi-GPU inference (model parallelism, pipeline parallelism), mixed-precision training/inference, and GPU profiling tools (NVIDIA Nsight, nvprof, PyTorch Profiler)
- Experience with ML infrastructure including model serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server), GPU resource management (NVIDIA MIG, GPU time-slicing), and inference optimization (continuous batching, speculative decoding)
- Production experience with GPU monitoring and observability (DCGM, GPU metrics dashboards, cost-per-query optimization)
- Background in information retrieval or vector search (FAISS, HNSW, IVF indices, approximate nearest neighbor algorithms)
- Production experience with object storage (MinIO, S3, GCS) at petabyte scale
- Familiarity with specific technologies: Kafka, PostgreSQL, pgvector, Kubernetes, FluxCD, Yugabyte
- Contributions to open-source projects in the HPC, distributed systems, or vector search space
- Experience with on-premises enterprise deployments and air-gapped environments (government/defense sector)
Benefits for Full Time Employees:
- Remote first organization
- 100% Company paid Health/Dental/Vision benefits for you and your dependents
- Life Insurance, Short-term and Long-term Disability
- 401k
- Unlimited PTO
We are interested in every qualified candidate who is authorized to work in the United States. However, we are not able to sponsor or take over sponsorship of employment visas at this time.
Pryon will not consider race, religion, sex, sexual preference, or national origin in ways that violate the Nation's civil rights laws.
190000 - 230000 USD a year