Using LLMs locally

Example of running a local LLM

On a 14-core Intel i7:

$ llama-cli -dev 'none' -m GPT-OSS-20B-UD-Q6_K_XL.gguf
[...]
> describe tersely the english town of witney

[Start thinking]
We need a terse description of the English town of Witney. Provide key
facts: location, population, history, notable features: market town,
Oxfordshire, River Thame, 2002 etc. Keep it concise.
[End thinking]

**Witney** – a historic market town in north‑west Oxfordshire, England,
lying on the River Thame just 18 mi west of Oxford. Population about
15,000, with a medieval core, Georgian town‑hall, and the 13th‑century
St John‑the‑Divine church. Known for its textile heritage, annual
market, and the nearby Cotswold hills.

[ Prompt: 35.0 t/s | Generation: 14.8 t/s ]

> /exit


Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 2224 =  1788 +      32 +     404                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 9682 =  9682 +       0 +       0                |

On an embedded Iris Xe with 96 elements:

$ llama-cli -dev 'Vulkan0' -m GPT-OSS-20B-UD-Q6_K_XL.gguf
[...]
> describe tersely the english town of witney

[Start thinking]
User asks: "describe tersely the english town of witney". They want a
short description. Provide concise description: location, population,
features: historic town, medieval, market town, etc. Provide maybe 2-3
sentences. Should be terse.
[End thinking]

Witney is a historic market town in Oxfordshire, England, lying on the
River Thames about 4 mi west of Oxford. With a population of around
12,000, it boasts a 13th‑century church and a well‑preserved medieval
streetscape, as well as a thriving local market and a mix of traditional
and modern amenities. Its riverside setting and proximity to Oxford make
it a popular commuter and tourist destination.

[ Prompt: 43.6 t/s | Generation: 14.3 t/s ]

> /exit


Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]                               | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (Intel(R) Iris(R) Xe Graphics (ADL GT2)) | 23823 = 11899 + (11314 = 10883 +      32 +     398) +         610 |
common_memory_breakdown_print: |   - Host                                             |                    607 =   586 +       0 +      20                |

On 3050 Ti 4GiB:

$ llama-cli -dev 'Vulkan1' -m GPT-OSS-20B-UD-Q6_K_XL.gguf
[...]
> describe tersely the english town of witney

[Start thinking]
User wants a terse description of the English town of Witney. Provide
short but informative description. Probably mention it's in Oxfordshire,
market town, famous for wool and clothing industry, historic
architecture, and its location along the River Windrush, population
around 30k, etc. Should be concise. Let's do it.
[End thinking]

Witney is a historic market town in Oxfordshire, England, situated on
the River Windrush. With a population of about 30,000, it’s famed for
its long‑standing wool and cloth trade, a thriving heritage‑based
tourism sector, and a well‑preserved town centre featuring Georgian,
Tudor and 17th‑century buildings. The town sits roughly 17 mi west of
Oxford and 20 mi north of the city of Reading, offering easy access to
the UK motorway network.

[ Prompt: 20.4 t/s | Generation: 12.3 t/s ]

> /exit


Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]               | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Vulkan1 (RTX 3050 Ti Laptop GPU) |  4096 = 1004 + (2685 =  2255 +      32 +     398) +         405 |
common_memory_breakdown_print: |   - Host                             |                 9236 =  9215 +       0 +      20                |

Why: cost, privacy, when offline (for example train travel), right thinking restrictions, sometimes customisation.

How LLMs work

Information retrieval as in search engines: 1) build database of document clusters indexed by words 2) return set of documents indexed by keywords.
Information retrieval as in LLMs: 1) build database of "word" sequence clusters indexed by "words" 2) return one document merging word sequences of the documented indexed by keywords.
If one indexed every word of every documents and its position in that document it would be possible to walk the index and rebuild any one or every document. The LLMs index is fuzzy (lossy) so walking it returns potential documents instead of actual documents.
Plenty of bad terminology: architecture is the shape of the neural net and model is its instantiation with particular index values called weights.
Overall effect: the model is a lossy compressed snapshot of many books, journals and WWW sites (mainly Wikipedia, Reddit and similar).

What is needed

Vulkan drivers of the headless or server variety for the embedded or discrete graphics card. For GNU/Linux Ubuntu:

#  aptitude search '~nnvidia~i!~M!~ri386'

i   nvidia-headless-595-server                    - NVIDIA headless metapackage
i   nvidia-utils-595-server                       - NVIDIA Server Driver support binaries

Inferencer: a simple MS-Windows or GNU/Linux executable of which there are many like ollama, llama.cpp, vLLM, and many many others; almost all support querying both local inferencers or remote ones via an HTTP API.
There are two main formats for model files called safetensors and GGUF which packs a model into a single file so it is very simple and it is becoming more popular.
Model files are usually downloaded from a repository site called Huggingface and the canonical repository of GGUF files is Unsloth.

Specific inferencers and model files

Probably the fastest and best suited for small systems is llama.cpp which is available for Ubuntu, MS-Windows, MacOS, Android and can be downloaded from GitHub. Its authors are also the designers of the GGUF file format. It also supports image and audio input for conversion to text and scene description.
llama.cpp was designed for CPU-based inference but can also use the Vulkan library to use GPUs. It can also use the CUDA library but the default build for Ubuntu does not have CUDA support as there is little point for small GPUs.
aichat is an excellent pure CLI front-end without inferencer. The inferencer can be llama.cpp in server mode or any online inferencer and it is very easy to switch between them.
Model files are distinguished by architecture, flavour, total number of parameters, number of parameters in the active set, flavour, quantization.
Architecture: the shape or the neural network and usually depends on the laboratory providing the neural network.
The flavour is how the model has been tuned, and the main categories are:
- Plain: for searching.
- Instruct: for conversation.
- Reasoning: for iterative and interactive problem solving.
- Coding: reasoning about source code.
- Multimodal: for generating text given image (OCR, describing images) or audio files.
- Diffusion: for generating images given text.
Other details: how strongly the model has been restricted to the right thinking responses.

More details on model files

The total number of parameters for example 8B or 120B (up to over a trillion for online models) is how fine grained is the index build by training over the input data: the more parameter the less lossy is the compression of the input data so rarer information has better chances of being preserved.
The active number of parameters for example A3B or E4B is how many parameters are likely to be used in a given query on a given type of topic as the model has been partitioned in so-called mixture of experts.
The quantization is how the index values are represented: during training usually as 32b or 16b floating point but it turns out that with clever arithmetic even 8b or 4b integers are almost as good as 32b or 16b floating point and the model file fits in smaller VRAMs and DRAMS. Even 2b and 1b weights can give useful results. The lower the precision of the weights the fuzzier is inference and the greater the chances that unrelated topics get confused by the inferencer (the word-cone into the clusters is less precise). Apart from width quantization can also be static or dynamic, etc... My favourite quantizations are UD_IQ4_NL or IQ6_K_XL.
Other important detail: cutoff date which is the most recent date of training material.

Choosing a model size

Inferencers and models are rated in tokens per second as to speed and that depends on hardware an model size. In general 10 tps is a bit slow but often acceptable, 20tps is fairly reasonable as it most people's reading speed, and 80-100tps feels almost instantaneous.
Speed changes a lot (range of 2-4 times) depending on model, VRAM size, RAM size, active subset, type of query.
Size is by far the most important characteristic as it determines the cost of the hardware: the bigger the more costly. Size depends on number of paramters and specific quantization secondarily.
Smallest size: 6B to 9B parameters result in GGUF files from 4GiB to 8GiB. That usually works even on laptop and small desktops and at even 16-24tps.
Small size: 20B to 35B parameters: usually 12GiB to 24GiB and usually require a significant discrete GPU in the RTX 3070-3090 range with 8-24GiB of VRAM but models with smaller active subsets can often work acceptably even on laptops and small desktops.
Small-medium business size: 60B to 120B parameters running on single servers multiple consumer GPUs in parallel or single top-end server GPUs with 40-96GiB of VRAM.
Big business sizes: 200B to 400B parameters running on full racks with a few dozen servers and dozens of GPUs in total.

My model choice

Gemma-4-E4B, Q4_K_M: the freeware version of Google Gemini stripped down. There are also 26B and 31B versions and the 31B version is dense so it requires much more fast memory.
GPT-OSS-20B, UD-Q6_K_XL: a stripped down but still substantial version of ChatGPT but the cutoff date is a bit old. Still runnable on a laptop.
Nemotron-3-Nano-Omni-30B-A3B-Reasoning, UD-IQ4_NL_XL: just released by NVIDIA from open training data, a bit too big for a laptop or small desktop.
Qwen3.6-35B-A3B, UD-IQ3_S: From AliBaba it is the base of many variants and works pretty well. The 35B parameters means it is big but the active subset is 3B so in many cases it runs pretty well even on laptops and small desktops. The 27B version is dense and is extremely slow on small systems.

Customization

Context: the history of a chat that gets sent in its entirety to the LLM as a search phrase.
RAG: adding some files to the context that get also searched at inference time.
Distillation:taking a model and editings its weights to drop some to create a smaller variant.
Re-training: adding new input documents to the model adjusting the weights accordingly or rebuilding the model.

Some details settings I have tried for llam.cpp:

export LLAMA_ARG_OFFLINE='true'
export LLAMA_ARG_TOOLS='read_file'

export LLAMA_ARG_HOST='127.0.0.1'
export LLAMA_ARG_PORT='11434'

export LLAMA_ARG_CTX_SIZE='4000'
export LLAMA_ARG_REASONING='off'

export LLAMA_ARG_N_PARALLEL='1'
export LLAMA_ARG_MODELS_MAX='1'
export LLAMA_ARG_MODELS_DIR='/var/data/models/links'

export LLAMA_ARG_FLASH_ATTN='on'
export LLAMA_ARG_CACHE_TYPE_K='q8_0'
export LLAMA_ARG_CACHE_TYPE_V='q4_0'

export LLAMA_ARG_MMAP='0'
export LLAMA_ARG_MLOCK='0'
export LLAMA_ARG_NUMA='isolate'

export LLAMA_ARG_LOG_VERBOSITY='3'