Updated: 2026-05-01
Created: 2026-04
$ llama-cli -dev 'none' -m GPT-OSS-20B-UD-Q6_K_XL.gguf [...] > describe tersely the english town of witney [Start thinking] We need a terse description of the English town of Witney. Provide key facts: location, population, history, notable features: market town, Oxfordshire, River Thame, 2002 etc. Keep it concise. [End thinking] **Witney** – a historic market town in north‑west Oxfordshire, England, lying on the River Thame just 18 mi west of Oxford. Population about 15,000, with a medieval core, Georgian town‑hall, and the 13th‑century St John‑the‑Divine church. Known for its textile heritage, annual market, and the nearby Cotswold hills. [ Prompt: 35.0 t/s | Generation: 14.8 t/s ] > /exit Exiting... common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Host | 2224 = 1788 + 32 + 404 | common_memory_breakdown_print: | - CPU_REPACK | 9682 = 9682 + 0 + 0 |
$ llama-cli -dev 'Vulkan0' -m GPT-OSS-20B-UD-Q6_K_XL.gguf [...] > describe tersely the english town of witney [Start thinking] User asks: "describe tersely the english town of witney". They want a short description. Provide concise description: location, population, features: historic town, medieval, market town, etc. Provide maybe 2-3 sentences. Should be terse. [End thinking] Witney is a historic market town in Oxfordshire, England, lying on the River Thames about 4 mi west of Oxford. With a population of around 12,000, it boasts a 13th‑century church and a well‑preserved medieval streetscape, as well as a thriving local market and a mix of traditional and modern amenities. Its riverside setting and proximity to Oxford make it a popular commuter and tourist destination. [ Prompt: 43.6 t/s | Generation: 14.3 t/s ] > /exit Exiting... common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Vulkan0 (Intel(R) Iris(R) Xe Graphics (ADL GT2)) | 23823 = 11899 + (11314 = 10883 + 32 + 398) + 610 | common_memory_breakdown_print: | - Host | 607 = 586 + 0 + 20 |
$ llama-cli -dev 'Vulkan1' -m GPT-OSS-20B-UD-Q6_K_XL.gguf [...] > describe tersely the english town of witney [Start thinking] User wants a terse description of the English town of Witney. Provide short but informative description. Probably mention it's in Oxfordshire, market town, famous for wool and clothing industry, historic architecture, and its location along the River Windrush, population around 30k, etc. Should be concise. Let's do it. [End thinking] Witney is a historic market town in Oxfordshire, England, situated on the River Windrush. With a population of about 30,000, it’s famed for its long‑standing wool and cloth trade, a thriving heritage‑based tourism sector, and a well‑preserved town centre featuring Georgian, Tudor and 17th‑century buildings. The town sits roughly 17 mi west of Oxford and 20 mi north of the city of Reading, offering easy access to the UK motorway network. [ Prompt: 20.4 t/s | Generation: 12.3 t/s ] > /exit Exiting... common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | common_memory_breakdown_print: | - Vulkan1 (RTX 3050 Ti Laptop GPU) | 4096 = 1004 + (2685 = 2255 + 32 + 398) + 405 | common_memory_breakdown_print: | - Host | 9236 = 9215 + 0 + 20 |
right thinkingrestrictions, sometimes customisation.
fuzzy(lossy) so walking it returns potential documents instead of actual documents.
architectureis the shape of the neural net and
modelis its instantiation with particular index values called
weights.
modelis a lossy compressed snapshot of many books, journals and WWW sites (mainly Wikipedia, Reddit and similar).
Vulkandrivers of the
headlessor
servervariety for the embedded or discrete graphics card. For GNU/Linux Ubuntu:
# aptitude search '~nnvidia~i!~M!~ri386' i nvidia-headless-595-server - NVIDIA headless metapackage i nvidia-utils-595-server - NVIDIA Server Driver support binaries
Inferencer: a simple MS-Windows or GNU/Linux executable of which there are many like ollama, llama.cpp, vLLM, and many many others; almost all support querying both local inferencers or remote ones via an HTTP API.
modelinto a single file so it is very simple and it is becoming more popular.
right thinkingresponses.
8Bor
120B(up to over a trillion for online models) is how fine grained is the index build by training over the input data: the more parameter the less lossy is the compression of the input data so rarer information has better chances of being preserved.
A3Bor
E4Bis how many parameters are likely to be used in a given query on a given type of topic as the model has been partitioned in so-called
mixture of experts.
fuzzieris inference and the greater the chances that unrelated topics get confused by the inferencer (the
word-coneinto the clusters is less precise). Apart from width quantization can also be static or dynamic, etc... My favourite quantizations are UD_IQ4_NL or IQ6_K_XL.
cutoff datewhich is the most recent date of training material.
tokens per secondas to speed and that depends on hardware an model size. In general 10 tps is a bit slow but often acceptable, 20tps is fairly reasonable as it most people's reading speed, and 80-100tps feels almost instantaneous.
denseso it requires much more fast memory.
Context: the history of a chat that gets sent in its entirety to the LLM as a search phrase.
RAG: adding some files to the context that get also searched at inference time.
export LLAMA_ARG_OFFLINE='true' export LLAMA_ARG_TOOLS='read_file' export LLAMA_ARG_HOST='127.0.0.1' export LLAMA_ARG_PORT='11434' export LLAMA_ARG_CTX_SIZE='4000' export LLAMA_ARG_REASONING='off' export LLAMA_ARG_N_PARALLEL='1' export LLAMA_ARG_MODELS_MAX='1' export LLAMA_ARG_MODELS_DIR='/var/data/models/links' export LLAMA_ARG_FLASH_ATTN='on' export LLAMA_ARG_CACHE_TYPE_K='q8_0' export LLAMA_ARG_CACHE_TYPE_V='q4_0' export LLAMA_ARG_MMAP='0' export LLAMA_ARG_MLOCK='0' export LLAMA_ARG_NUMA='isolate' export LLAMA_ARG_LOG_VERBOSITY='3'