Skip to content

Allow multiple model slots even when detected VRAM is 0 #453

@krissetto

Description

@krissetto

Right now if the detected VRAM is 0mb, we only allow 1 model to run at a time. In workflows where we need to query multiple models this can be quite the limitation because DMR will constantly swap the models in and out of memory even when they'd both fit, thrashing the storage and hugely slowing down the UX.

In my specific example, I'm using AMD Strix Halo with 128gb of ram with Vulkan on the llama.cpp backend.

I think it'd be nice if either:

  • DMR didn't limit the number of model slots;
  • There was an option to configure this behavior system-wide

Any thoughts around this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions