llm.rb

deepdive

This guide is the practical companion to the main README. The README explains what llm.rb is. This document shows how to use it.

Mental Model

Everything in llm.rb builds on three concepts:

  • Provider: the model backend
  • Context: the execution state
  • Tools: external work the model can request

Most features extend these, rather than introducing new abstractions.

Contents

Providers

Start with a provider and a context. From there, you can add schemas, tools, MCP, persistence, streaming, and other features without changing the overall shape of the code.

In llm.rb, LLM::Context is the main execution boundary. It keeps message history, provider params, tool state, and usage together, so you can keep building on the same object instead of switching to a different abstraction for each feature.

Those context-level defaults are not fixed. You can override them on a single talk or respond call by passing request params directly, which makes it easy to keep stable defaults at the context level while changing things like model, schema, tools, or stream for one turn.

Supported Providers

llm.rb supports multiple LLM providers behind one API surface:

  • OpenAI (LLM.openai)
  • Anthropic (LLM.anthropic)
  • Google (LLM.google)
  • DeepSeek (LLM.deepseek)
  • xAI (LLM.xai)
  • zAI (LLM.zai)
  • Ollama (LLM.ollama)
  • Llama.cpp (LLM.llamacpp)

Basic Context

At the simplest level, any object that implements #<< can receive visible output as it arrives. That includes $stdout, StringIO, files, sockets, and other Ruby IO-style objects.

This is the smallest complete llm.rb loop: a provider, a context, and a place for streamed output to go. Once that is in place, the rest of the library builds outward from the same pattern:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)

loop do
  print "> "
  ctx.talk(STDIN.gets || break)
  puts
end

Context defaults can still be overridden on a single turn. That is useful when most turns should share one setup, but a specific request needs a different model, schema, tool set, or stream target:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, model: "gpt-4.1-mini", stream: $stdout)

ctx.talk("Answer normally.")
ctx.talk("Now return JSON.", schema: Report, stream: nil, model: "gpt-4.1")

Responses API

llm.rb also supports OpenAI's Responses API through LLM::Context with mode: :responses. The important switch is store:. With store: false, the Responses API stays stateless while still using the Responses endpoint. With store: true, OpenAI can keep response state server-side and reduce how much conversation state needs to be sent on each turn.

Use this when you want the Responses API specifically, not just normal chat completions. llm.rb keeps it behind the same context interface so the rest of your application code does not need to change much:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, mode: :responses, store: false)

ctx.talk("Your task is to answer the user's questions", role: :developer)
res = ctx.talk("What is the capital of France?")
puts res.content

Responses

The response side follows the same idea as the rest of llm.rb. APIs that make model requests return LLM::Response objects as a common base shape, then layer on extra behavior when an endpoint or provider needs something more specific.

That base wrapper still keeps the raw Net::HTTPResponse on response.res, so normalization does not cut you off from low-level HTTP access when you need to inspect headers, status, or other transport details.

Some response adapters also add Enumerable, so list-style and search-style results can often be iterated directly without reaching into response.data.

Streaming

Streaming ranges from plain visible output to structured callbacks that can drive tool execution while the model is still responding.

The simple form is just an object that implements #<<. The advanced form is LLM::Stream, which gives you explicit callbacks for visible output, reasoning output, and tool-call lifecycle events.

Basic Streaming

At the lowest level, any object that responds to #<< can receive visible output chunks.

This is the easiest way to make the model feel responsive. It works well for CLI tools, logs, and any interface where plain visible output is enough:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
ctx.talk("Explain how TCP keepalive works in one paragraph.")
puts

Advanced Streaming

Use LLM::Stream when you want structured callbacks such as on_content, on_reasoning_content, on_tool_call, and on_tool_return.

This is the version to use when streaming is part of control flow, not just presentation. It lets your code react to output, reasoning, and tool events as they happen:

#!/usr/bin/env ruby
require "llm"

class Stream < LLM::Stream
  def on_content(content)
    $stdout << content
  end

  def on_reasoning_content(content)
    $stderr << content
  end

  def on_tool_call(tool, error)
    $stdout << "Running tool #{tool.name}\n"
    queue << (error || tool.spawn(:thread))
  end

  def on_tool_return(tool, ret)
    if ret.error?
      $stdout << "Tool #{tool.name} failed\n"
    else
      $stdout << "Finished tool #{tool.name}\n"
    end
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: Stream.new, tools: [System])

ctx.talk("Run `date` and `uname -a`.")
ctx.talk(ctx.wait(:thread)) while ctx.functions.any?

Reasoning

Some providers expose model reasoning separately from visible assistant output. llm.rb lets you handle that in two ways: stream it as it arrives, or read it from the final response when the provider includes it.

This is part of the normal response model. Completion-style responses expose reasoning_content, and streamed providers can emit reasoning incrementally through LLM::Stream#on_reasoning_content.

Stream Reasoning Output

Use LLM::Stream#on_reasoning_content when you want reasoning output as a separate stream.

If the provider emits reasoning incrementally, this lets you surface or log it without mixing it into the assistant-visible response stream:

#!/usr/bin/env ruby
require "llm"

class Stream < LLM::Stream
  def on_content(content)
    $stdout << content
  end

  def on_reasoning_content(content)
    $stderr << content
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: Stream.new)
ctx.talk("Solve 17 * 19 and show your work.")

Read Reasoning From The Response

When a provider includes reasoning content in the final completion, it is also available on the response object.

This is useful when you want the final response first and only inspect the reasoning afterward, for example in debugging or offline analysis:

#!/usr/bin/env ruby
require "llm"

llm = LLM.llamacpp(url: ENV["URL"])
ctx = LLM::Context.new(llm)
res = ctx.talk("Solve 17 * 19 and show your work.")

puts res.content
puts res.reasoning_content

Structured Outputs

The LLM::Schema system lets you define JSON schemas for structured outputs. Schemas can be defined as classes with property declarations or built programmatically using a fluent interface. When you pass a schema to a context, llm.rb adapts it into the provider's structured-output format when that provider supports one.

The useful part is that the schema stays in Ruby. You describe the shape once, attach it to the context, and let llm.rb adapt it to the provider API instead of hand-writing JSON Schema payloads for each request:

#!/usr/bin/env ruby
require "llm"
require "pp"

class Report < LLM::Schema
  property :category, Enum["performance", "security", "outage"], "Report category", required: true
  property :summary, String, "Short summary", required: true
  property :impact, OneOf[String, Integer], "Primary impact, as text or a count", required: true
  property :services, Array[String], "Impacted services", required: true
  property :timestamp, String, "When it happened", optional: true
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, schema: Report)
res = ctx.talk("Structure this report: 'Database latency spiked at 10:42 UTC, causing 5% request timeouts for 12 minutes.'")
pp res.content!

Fluent Schemas

If you do not want a class, you can build the schema inline.

This style is useful for one-off workflows or dynamic schemas that do not need their own constant:

#!/usr/bin/env ruby
require "llm"
require "pp"

schema = LLM::Schema.new.object(
  category: LLM::Schema.new.string.enum("performance", "security", "outage").required,
  summary: LLM::Schema.new.string.required,
  services: LLM::Schema.new.array(LLM::Schema.new.string).required
)

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, schema:)
res = ctx.talk("Structure this report: 'API latency spiked for the billing service.'")
pp res.content!

Persistence

Contexts can be serialized and restored across process boundaries. That gives you a straightforward way to persist long-lived conversation state between requests, jobs, retries, or deployments.

This works because LLM::Context already holds the state that matters: messages, tool returns, usage, and provider-facing parameters. Persistence is therefore mostly about choosing where to store that snapshot.

Save To A File

File-based persistence is the simplest way to see how context serialization works. It is useful for scripts, local tools, and any workflow where a JSON snapshot is enough.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm)
ctx.talk("Hello")
ctx.talk("Remember that my favorite language is Ruby")

payload = ctx.to_json

restored = LLM::Context.new(llm)
restored.restore(string: payload)
puts restored.talk("What is my favorite language?").content

ctx.save(path: "context.json")

restored = LLM::Context.new(llm)
restored.restore(path: "context.json")
puts restored.talk("What is my favorite language?").content

Persist With ActiveRecord

In Rails or ActiveRecord, a small wrapper is enough to persist context state between requests or jobs.

The key idea is that the database record owns the serialized context, while the LLM::Context instance is rebuilt on demand and flushed back after each turn:

create_table :contexts do |t|
  t.jsonb :snapshot
  t.string :provider, null: false
  t.timestamps
end
class Context < ApplicationRecord
  def talk(...)
    ctx.talk(...).tap { flush }
  end

  def wait(...)
    ctx.wait(...).tap { flush }
  end

  def messages
    ctx.messages
  end

  def model
    ctx.model
  end

  def flush
    update_column(:snapshot, ctx.to_json)
  end

  private

  def ctx
    @ctx ||= begin
      ctx = LLM::Context.new(llm)
      ctx.restore(string: snapshot) if snapshot
      ctx
    end
  end

  def llm
    LLM.method(provider).call(key: ENV.fetch(key))
  end

  def key
    "#{provider.upcase}_KEY"
  end
end

Tools

Tools in llm.rb can be defined as classes inheriting from LLM::Tool or as closures using LLM.function. The same execution model covers provider tool calls, local tools, and MCP-exposed tools.

At the context level, tool execution is explicit. The model can request work, the context records pending functions, and your code decides when to execute them and feed the results back in.

Tool Calling

When the LLM requests a tool call, the context stores Function objects in ctx.functions. call(:functions) executes the pending work and returns the results to the model.

This explicit flow is one of the main design choices in llm.rb. The model can request work, but your code stays in control of when that work runs and how its results get fed back in:

#!/usr/bin/env ruby
require "llm"

class System < LLM::Tool
  name "system"
  description "Run a shell command"
  param :command, String, "Command to execute", required: true

  def call(command:)
    {success: system(command)}
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout, tools: [System])
ctx.talk("Run `date`.")
ctx.talk(ctx.call(:functions)) while ctx.functions.any?

Cancelling A Function

Because pending tool calls are explicit LLM::Function objects, your code can decide not to run them and return a cancellation result instead.

This is useful when tool execution depends on user confirmation, policy checks, or any other application-level gate. The model requests work, but your code can still stop it before the function actually runs:

#!/usr/bin/env ruby
require "llm"

class System < LLM::Tool
  name "system"
  description "Run a shell command"
  param :command, String, "Command to execute", required: true

  def call(command:)
    {success: system(command)}
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, tools: [System])

ctx.talk("Run `date` and `uname -a`.")

approved = ctx.functions.select do |fn|
  print "Run #{fn.name}? [y/N] "
  STDIN.gets.to_s.strip.downcase == "y"
end

returns = ctx.functions.map do |fn|
  if approved.include?(fn)
    fn.call
  else
    fn.cancel(reason: "user declined to run the function")
  end
end

ctx.talk(returns)

Closure-Based Tools

For smaller cases, LLM.function gives you a closure-based alternative to LLM::Tool:

This is useful when you want a quick function without defining a class. The main limitation is that LLM.function does not register a tool class in LLM::Tool.registry, so features that depend on tool-class registration, such as streamed tool resolution through LLM::Stream, only work with LLM::Tool subclasses:

#!/usr/bin/env ruby
require "llm"

weather = LLM.function(:weather) do |fn|
  fn.description "Return the weather for a city"
  fn.params do |schema|
    schema.object(city: schema.string.required)
  end
  fn.define do |city:|
    {city:, forecast: "sunny", high_c: 23}
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, tools: [weather])
ctx.talk("What is the weather in Lisbon?")
ctx.talk(ctx.call(:functions)) while ctx.functions.any?

Concurrent Tools

Use wait(:thread), wait(:fiber), or wait(:task) when you want multiple pending tool calls to run concurrently.

This matters when a turn fans out into several independent tool calls. Instead of blocking on each one in sequence, you can resolve them together and reduce end-to-end latency:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(
  llm,
  stream: $stdout,
  tools: [FetchWeather, FetchNews, FetchStock]
)

ctx.talk("Summarize the weather, headlines, and stock price.")
ctx.talk(ctx.wait(:thread)) while ctx.functions.any?

Agents

LLM::Agent gives you a reusable, preconfigured assistant built on top of the same context, tool, and schema primitives. It is a good fit when you want to package instructions, model choice, tools, and output shape into one class.

The main difference from LLM::Context is control flow. An agent will apply its instructions automatically and keep executing tool calls until the turn settles or it hits the configured limit.

#!/usr/bin/env ruby
require "llm"

class SystemAdmin < LLM::Agent
  model "gpt-4.1"
  instructions "You are a Linux system admin"
  tools Shell
  schema Result
end

llm = LLM.openai(key: ENV["KEY"])
agent = SystemAdmin.new(llm)
res = agent.talk("Run 'date'")

MCP

MCP lets llm.rb treat external services, internal APIs, and prompt libraries as part of the same execution path.

LLM::MCP is a stateful client that can connect over stdio or HTTP, list tools and prompts, and adapt them into the same runtime model used by contexts and agents.

MCP Tools Over Stdio

Use stdio when the MCP server runs as a local process. This is the most direct way to connect local utilities and developer tools into a context.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
mcp = LLM::MCP.stdio(
  argv: ["npx", "-y", "@modelcontextprotocol/server-filesystem", Dir.pwd]
)

begin
  mcp.start
  ctx = LLM::Context.new(llm, stream: $stdout, tools: mcp.tools)
  ctx.talk("List the directories in this project.")
  ctx.talk(ctx.call(:functions)) while ctx.functions.any?
ensure
  mcp.stop
end

MCP Tools Over HTTP

If you expect repeated tool calls, use persistent to reuse a process-wide HTTP connection pool. This requires the optional net-http-persistent gem:

Use HTTP when the MCP server is remote or shared across machines. The persistent client helps when the workflow makes repeated MCP requests.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
mcp = LLM::MCP.http(
  url: "https://api.githubcopilot.com/mcp/",
  headers: {"Authorization" => "Bearer #{ENV.fetch("GITHUB_PAT")}"}
).persistent

begin
  mcp.start
  ctx = LLM::Context.new(llm, stream: $stdout, tools: mcp.tools)
  ctx.talk("List the available GitHub MCP toolsets.")
  ctx.talk(ctx.call(:functions)) while ctx.functions.any?
ensure
  mcp.stop
end

MCP Prompts

MCP servers can also expose prompt templates. llm.rb can list those prompts and fetch a specific prompt by name. Retrieved prompt messages are normalized into LLM::Message objects, and the raw MCP payload stays available in extra.original_content.

This is useful when prompts live outside the application and need to be fetched by name, optionally with arguments, before being passed into a context or agent:

#!/usr/bin/env ruby
require "llm"

mcp = LLM::MCP.stdio(argv: ["npx", "-y", "@mcpservers/prompt-library"])

begin
  mcp.start

  prompts = mcp.prompts
  prompt = mcp.find_prompt(
    name: "suggest_code_error_fix",
    arguments: {
      "code_error" => "undefined method `name' for nil:NilClass",
      "function_name" => "render_profile"
    }
  )

  puts prompts.map(&:name)
  puts prompt.messages.first.content
  puts prompt.messages.first.extra.original_content.type
ensure
  mcp.stop
end

Multimodal Prompts

Contexts provide helpers for composing prompts that include images, audio, documents, and provider-managed files.

These helpers normalize non-text inputs before they reach the provider adapter. That keeps the prompt-building code in Ruby while still letting each provider receive the shape it expects.

Image Input

Image helpers let you build multimodal prompts without manually assembling provider-specific payloads.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm)

res = ctx.talk ["Describe this image", ctx.image_url("https://example.com/cat.jpg")]
puts res.content

Audio Generation

Provider media APIs are exposed alongside chat APIs, so the same provider object can also handle speech output.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
res = llm.audio.create_speech(input: "Hello world")
IO.copy_stream res.audio, File.join(Dir.home, "hello.mp3")

Image Generation

Image generation follows the same pattern: call the provider API, then handle the returned file or stream in normal Ruby code.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
res = llm.images.create(prompt: "a dog on a rocket to the moon")
IO.copy_stream res.images[0], File.join(Dir.home, "dogonrocket.png")

Retrieval And Files

When you want to index content or use provider-side retrieval APIs, llm.rb exposes files, embeddings, and vector stores directly.

This is useful when the workflow needs more than chat completion. You can upload content, build embeddings, create vector stores, and query them from the same provider object you already use for prompts and contexts.

Embeddings

Embeddings are the basic building block for semantic search, clustering, and retrieval workflows.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
res = llm.embed([
  "programming is fun",
  "ruby is a programming language",
  "sushi is art"
])

puts res.class
puts res.embeddings.size
puts res.embeddings[0].size

Files And Vector Stores

When you want provider-side retrieval, file uploads and vector stores let the provider index your content and search over it directly.

#!/usr/bin/env ruby
require "llm"
require "pp"

llm = LLM.openai(key: ENV["KEY"])
file = llm.files.create(path: "README.md")
store = llm.vector_stores.create_and_poll(name: "Docs", file_ids: [file.id])
res = llm.vector_stores.search(vector: store, query: "What does llm.rb do?")

res.each { pp _1 }

Tracing

Assign a tracer to a provider and all context requests and tool calls made through that provider will be instrumented.

Tracing is attached at the provider level, so the same tracer follows normal requests, tool execution, and higher-level workflows built on contexts or agents. That keeps observability close to the runtime model instead of adding it as a separate wrapper later:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
llm.tracer = LLM::Tracer::Logger.new(llm, io: $stdout)

ctx = LLM::Context.new(llm)
ctx.talk("Hello")
#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
llm.tracer = LLM::Tracer::Telemetry.new(llm)

ctx = LLM::Context.new(llm)
ctx.talk("Hello")
pp llm.tracer.spans
#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
llm.tracer = LLM::Tracer::Langsmith.new(
  llm,
  metadata: {env: "dev"},
  tags: ["chatbot"]
)

ctx = LLM::Context.new(llm)
ctx.talk("Hello")

Production And Operations

These are the pieces you reach for once the workflow itself is working.

Most of them are small switches rather than a second framework. Providers are meant to be shared, contexts are meant to stay isolated, and performance or cost controls layer onto the same core objects.

Production Basics

These are the default operational assumptions behind the library. They are simple, but getting them right early makes the rest of the workflow more predictable.

  • Thread-safe providers — share LLM::Provider instances across the app
  • Thread-local contexts — keep LLM::Context instances state-isolated
  • Cost tracking — estimate spend without extra API calls
  • Persistence — save and restore contexts across processes
  • Performance — swap JSON adapters and enable HTTP connection pooling
  • Error handling — structured errors instead of unpredictable exceptions

Thread Safety

Providers are designed to be shared. Contexts should generally stay local to one thread.

That split is intentional. Providers are mostly configuration and transport, while contexts hold mutable workflow state:

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])

Thread.new do
  ctx = LLM::Context.new(llm)
  ctx.talk("Hello from thread 1")
end

Thread.new do
  ctx = LLM::Context.new(llm)
  ctx.talk("Hello from thread 2")
end

Performance Tuning

Swap JSON backends when you need more throughput, and enable persistent HTTP when request volume makes it worth it.

These are opt-in changes. You can stay on the standard library by default and only add extra dependencies when the workload justifies them:

#!/usr/bin/env ruby
require "llm"

LLM.json = :oj
llm = LLM.openai(key: ENV["KEY"]).persistent

Model Registry

The local model registry provides metadata about model capabilities, pricing, and limits without requiring API calls.

This is useful when the application needs to make local decisions about model selection, limits, or estimated cost:

#!/usr/bin/env ruby
require "llm"

registry = LLM.registry_for(:openai)
model_info = registry.limit(model: "gpt-4.1")
puts "Context window: #{model_info.context} tokens"
puts "Cost: $#{model_info.cost.input}/1M input tokens"

Cost Tracking

Contexts accumulate usage as they run, which makes cost tracking available without a separate accounting layer.

#!/usr/bin/env ruby
require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm)
ctx.talk "Hello"
puts "Estimated cost so far: $#{ctx.cost}"
ctx.talk "Tell me a joke"
puts "Estimated cost so far: $#{ctx.cost}"

Putting It Together

See how these pieces come together in a complete application architecture with Relay, a production-ready LLM application built on llm.rb that demonstrates:

  • Context management across requests
  • Tool composition and execution
  • Concurrent workflows
  • Cost tracking and observability
  • Production deployment patterns

Watch the screencast:

Watch the llm.rb screencast