Ollama/LM Studio hosting section could use more setup detail and compatibility notes

Context
The README helpfully notes that vLLM requires 24GB+ VRAM and points users with lower-VRAM GPUs toward Ollama/LM Studio with GGUF quantized models. However, I ran into some difficulties getting this path to work end-to-end and wanted to share feedback that might help other users.

Experience
I tried hosting the GGUF model via Ollama on an 8GB laptop GPU. While the server started, fara-cli failed when making its first model call. Since Fara-7B is a vision-language model (Qwen2.5-VL) that sends base64-encoded screenshots via the OpenAI image_url content type on every step, it's possible Ollama's OpenAI-compatible endpoint doesn't fully support this for Qwen2.5-VL GGUF models — though I'm not 100% sure this was the root cause vs. VRAM constraints.

It would help to know whether the team has validated this path end-to-end, and if so, what configuration was used.

## Suggestions for the documentation

### 1. Example commands

The section says to specify the correct `--base_url`, `--api_key`, and `--model` but does not provide concrete values. Adding something like this would reduce trial and error:

```bash
ollama pull <exact_model_name>

fara-cli \
  --task "..." \
  --base_url http://localhost:11434/v1 \
  --api_key ollama \
  --model <model_name>
```

### 2. VRAM guidance

The advice to select the largest model that fits your GPU is reasonable, but a rough table would help users choose a quantization level more confidently:
| VRAM  | Suggested quantization | Notes               |
|-------|------------------------|---------------------|
| 8GB   | Q4_K_M (~4.5GB)        | Tight with KV cache |
| 12GB  | Q5_K_M / Q6_K          |                     |
| 16GB  | Q8_0 or FP16           |                     |
| 24GB+ | FP16 via vLLM          | Recommended path    |

### 3. Vision model compatibility note

Since GGUF quantization and llama.cpp-based servers may handle vision inputs differently than vLLM, it would help to clarify whether any quality or compatibility trade-offs should be expected compared to the vLLM path.

### 4. Modelfile reference

There is a Modelfile in the repository root that is not mentioned in the README. If it is intended for Ollama use, a short note linking to it would make the workflow clearer.

Not a blocker. This is not a critical issue, just sharing this in case it helps improve onboarding for users who start with the Ollama or LM Studio path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama/LM Studio hosting section could use more setup detail and compatibility notes #58

Suggestions for the documentation

1. Example commands

2. VRAM guidance

3. Vision model compatibility note

4. Modelfile reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VRAM	Suggested quantization	Notes
8GB	Q4_K_M (~4.5GB)	Tight with KV cache
12GB	Q5_K_M / Q6_K
16GB	Q8_0 or FP16
24GB+	FP16 via vLLM	Recommended path

Ollama/LM Studio hosting section could use more setup detail and compatibility notes #58

Description

Suggestions for the documentation

1. Example commands

2. VRAM guidance

3. Vision model compatibility note

4. Modelfile reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions