About
This post looks at how we can build a tool that estimates the age of the person in a photo with the llm.rb library. The post demonstrates three key llm.rb features at once – multimodal prompts, image vision and generation, and structured outputs.
Vision
An LLM that can understand both text and images is said to be multimodal – and LLMs are generally moving further in that direction. All of the Large Language Models that the llm.rb library supports have vision capabilities. An image can be given as input in a number of ways: as a URL, as base64-encoded inline data (eg local files), or as a reference to a file object stored by the LLM provider. Our post will focus on providing the image as a URL – the best option for an example that should work out of the box.
Schema
We can provide a JSON schema that describes the response we expect from the LLM, and the LLM will abide by the schema. This is perfect for our example because we can describe a schema that has an age, a confidence rating (between 0.0 and 1.0), and finally any notes the LLM has:
require "llm"
llm = LLM.openai(key: ENV["OPENAI_SECRET"])
schema = llm.schema.object(
age: llm.schema.integer.required.description("The age of the person in a photo"),
confidence: llm.schema.number.required.description("Model confidence (0.0 to 1.0)"),
notes: llm.schema.string.required.description("Model notes or caveats")
)
Image
For the example the most important piece is the image. I decided to generate an image on the fly instead of trying to source a public domain photo online, which was surprisingly difficult. OpenAI, Gemini and xAI (Grok) all have first-class image generation, editing and variation support in llm.rb:
require "llm"
llm = LLM.openai(key: ENV["OPENAI_SECRET"])
img = llm.images.create(prompt: "A man in his 30s")
Example
The following example provides the LLM with an image as a URI object – and it provides the LLM with the schema from the previous section. The image used is of a man, and the LLM estimates them to be in their 30s:
require "llm"
##
# schema
llm = LLM.openai(key: ENV["OPENAI_SECRET"])
schema = llm.schema.object(
age: llm.schema.integer.required.description("The age of the person in a photo"),
confidence: llm.schema.number.required.description("Model confidence"),
notes: llm.schema.string.required.description("Model notes")
)
##
# request
bot = LLM::Bot.new(llm, schema:)
img = llm.images.create(prompt: "A man in his 30s")
bot.chat URI(img.urls[0])
##
# response
res = bot.messages.find(&:assistant?).content!
print "age: ", res["age"], "\n"
print "confidence: ", res["confidence"], "\n"
print "notes: ", res["notes"], "\n"
##
# age: 32
# confidence: 0.89
# notes: The man appears to be in his early thirties ...
Note: This demo shows llm.rb’s multimodal and schema features. Age estimation is approximate and should not be used in sensitive contexts.