RK.MD LLM

Over the last month, I’ve been slowly refining a workflow that lets me run a private large language model (LLM) on my NVIDIA DGX Spark. I wanted an inference model that lives entirely on my own hardware, never leaves my network, and feels responsive enough for writing, coding, and day-to-day experimentation.

GPT-OSS:120B has become the backbone of this setup. Tensor-RT (TRT) LLM consists of the TensorRT deep learning compiler and includes optimized kernels and pre- and post-processing steps designed for the NVIDIA DGX Spark that powers my homelab. Once the engine is built, inference runs with remarkable efficiency, keeping latency low even at considerable context lengths.

RK.MD LLM in use.

The model loads into VRAM during boot through a simple systemd service that launches TRT LLM in the background. By the time I log in, the system is already live. I can point OpenWebUI or any local client at the endpoint without touching Docker or spinning up containers.

So who cares about this? Why not just use Gemini, ChatGPT, Claude, etc.

Well, for one, running the model on my own hardware puts the processing in my line of sight. That’s huge for security and privacy. The entire stack – DGX Spark, TRT LLM, GPT OSS – becomes a single, controllable tool that fits into my clinical work, teaching, and app development.

Plus, I’m a nerd, so I think self-hosting is awesome. 😉

Support My Work

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *