About

This post looks at how we can use the llm.rb library to build a tool for estimating the age of a person in a photo. The demo combines three features at once – multimodal prompts, vision (image understanding and generation), and structured outputs.

Background

On a boring day with not much going on it crossed my mind how in theory we could ask a large language model to estimate the age of a person in a photo. I decided to turn the theory into an experiment, since it could also be a cool blog post. As usual, I used the zero-dependency llm.rb library to drive the experiment.

Experiment

Context

llm.rb supports models that can see images, not just text, and provides a way to describe exactly what kind of output we want back through structured outputs. We can hand the model an image (a URL is usually easiest), and a schema that describes the shape of the response we expect. For this experiment: we want the estimated age, the model's confidence (from 0.0 to 1.0), and any textual notes.

I also decided to generate a test image on the fly – it made sense since it makes the experiment self-contained and I found it easier than trying to find a "public domain" photo:

require "llm"

llm = LLM.openai(key: ENV["OPENAI_SECRET"])
schema = llm.schema.object(
  age: llm.schema.integer.required.description("The age of the person in a photo"),
  confidence: llm.schema.number.required.description("Model confidence (0.0 to 1.0)"),
  notes: llm.schema.string.required.description("Model notes or caveats")
)

img = llm.images.create(prompt: "A man in his 30s")
bot = LLM::Bot.new(llm, schema:)
res = bot.chat bot.image_url(img.urls[0])

body = res.choices.find(&:assistant?).content!
print "age: ", body["age"], "\n"
print "confidence: ", body["confidence"], "\n"
print "notes: ", body["notes"], "\n"

# age: 32
# confidence: 0.89
# notes: The man appears to be in his early thirties ...

Explanation

Caveats

Model-based age estimation is always approximate. The result here shouldn’t be trusted for real-world decisions or sensitive applications, and it's important to be mindful of privacy or ethical implications when analyzing images of real people.

Conclusion

It is more or less straightforward to build a script that performs image analysis in pure Ruby. We can easily adapt this pattern for related problems, like emotion detection, captioning, or any task where you want a model to “see” and explain.