The best way to deploy UMT5 variants into production with low-latency inference?

by Respair - opened Jul 26, 2024

Jul 26, 2024

•

edited Jul 26, 2024

This is such a neat model, but I don't see it being supported by most frameworks since it uses a different sampling method.

Can you recommend anyway to deploy this model (by this, I mean the model we finetune on the downstream task) into production? and possibly a trivial way to convert it to ONNX. optimum doesn't support it just yet. preferably something that relies on GPUs.

man, I really wish there was a vllm for such seq2seq models. their potential is so underrated. if this tiny voice of mine can be heard by the big guys at google, please create a framework that makes it easier to use seq2seq model!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment