Context
The README helpfully notes that vLLM requires 24GB+ VRAM and points users with lower-VRAM GPUs toward Ollama/LM Studio with GGUF quantized models. However, I ran into some difficulties getting this path to work end-to-end and wanted to share feedback that might help other users.
Experience
I tried hosting the GGUF model via Ollama on an 8GB laptop GPU. While the server started, fara-cli failed when making its first model call. Since Fara-7B is a vision-language model (Qwen2.5-VL) that sends base64-encoded screenshots via the OpenAI image_url content type on every step, it's possible Ollama's OpenAI-compatible endpoint doesn't fully support this for Qwen2.5-VL GGUF models — though I'm not 100% sure this was the root cause vs. VRAM constraints.
It would help to know whether the team has validated this path end-to-end, and if so, what configuration was used.
Suggestions for the documentation
1. Example commands
The section says to specify the correct --base_url, --api_key, and --model but does not provide concrete values. Adding something like this would reduce trial and error:
ollama pull <exact_model_name>
fara-cli \
--task "..." \
--base_url http://localhost:11434/v1 \
--api_key ollama \
--model <model_name>
2. VRAM guidance
The advice to select the largest model that fits your GPU is reasonable, but a rough table would help users choose a quantization level more confidently:
| VRAM |
Suggested quantization |
Notes |
| 8GB |
Q4_K_M (~4.5GB) |
Tight with KV cache |
| 12GB |
Q5_K_M / Q6_K |
|
| 16GB |
Q8_0 or FP16 |
|
| 24GB+ |
FP16 via vLLM |
Recommended path |
3. Vision model compatibility note
Since GGUF quantization and llama.cpp-based servers may handle vision inputs differently than vLLM, it would help to clarify whether any quality or compatibility trade-offs should be expected compared to the vLLM path.
4. Modelfile reference
There is a Modelfile in the repository root that is not mentioned in the README. If it is intended for Ollama use, a short note linking to it would make the workflow clearer.
Not a blocker. This is not a critical issue, just sharing this in case it helps improve onboarding for users who start with the Ollama or LM Studio path.
Context
The README helpfully notes that vLLM requires 24GB+ VRAM and points users with lower-VRAM GPUs toward Ollama/LM Studio with GGUF quantized models. However, I ran into some difficulties getting this path to work end-to-end and wanted to share feedback that might help other users.
Experience
I tried hosting the GGUF model via Ollama on an 8GB laptop GPU. While the server started, fara-cli failed when making its first model call. Since Fara-7B is a vision-language model (Qwen2.5-VL) that sends base64-encoded screenshots via the OpenAI image_url content type on every step, it's possible Ollama's OpenAI-compatible endpoint doesn't fully support this for Qwen2.5-VL GGUF models — though I'm not 100% sure this was the root cause vs. VRAM constraints.
It would help to know whether the team has validated this path end-to-end, and if so, what configuration was used.
Suggestions for the documentation
1. Example commands
The section says to specify the correct
--base_url,--api_key, and--modelbut does not provide concrete values. Adding something like this would reduce trial and error:2. VRAM guidance
The advice to select the largest model that fits your GPU is reasonable, but a rough table would help users choose a quantization level more confidently:
3. Vision model compatibility note
Since GGUF quantization and llama.cpp-based servers may handle vision inputs differently than vLLM, it would help to clarify whether any quality or compatibility trade-offs should be expected compared to the vLLM path.
4. Modelfile reference
There is a Modelfile in the repository root that is not mentioned in the README. If it is intended for Ollama use, a short note linking to it would make the workflow clearer.
Not a blocker. This is not a critical issue, just sharing this in case it helps improve onboarding for users who start with the Ollama or LM Studio path.