Fork: 250-400k net for full-time
Format: full-time or part-time (20 hours/week), remote work (including outside the Russian Federation)
Description:We do cool research projects as a team mil-team.ru , we publish scientific articles and develop the product compressa.ai to optimize the inference of LLM models. We make LLMs (and not only them) work in a loop, cheaper, faster. We are looking for a cool and proactive specialist to join the key team.
What problems will you solve:- Develop the inference engine for LLM models (making it faster than vllm);
- Propose and bring to market improvements in LLM compression methods;
- Implement SotA inference and model compression technologies into the platform.
Your experience:- Used and modified grid launch frameworks (ONNX, TensorRT-LLM, llama.cpp, Vllm) + wrote custom CUDA kernels;
- Applied and modified LLM software optimization methods, you know SotA: sparsification, quantization, distillation;
- Trained LoRA adapters or feature tune LLM models.
Next steps:- Attach your CV (it would be cool if there are links to your open source code in git);
- A brief description (up to 3 paragraphs) of your experience in the topic;
- Motivashku (1 paragraph), why we are on the same path;
- We will review the application and interview the team.