Multimodal LLM vs. Multimodal mom said we have at home.

In a world where AI is evolving into increasingly versatile multimodal systems, I decided to build a tool that brings together state-of-the-art object detection, OCR, and language generation into one cohesive workflow. Inspired by my mom’s extreme positivity when we couldn’t afford to dine outside as kids, I combined YOLOv8 with pytesseract and a Llama-based large language model to generate natural, flowing descriptions of our surroundings.

Today, many of the main LLMs are multimodal and usually use Vision Transformers to tokenize an image into text embeddings so the model can understand an image. However, not all LLMs have that luxury, and sending images over an API to be tokenized can be more expensive than processing plain text. Plus, we are about to see more small models that can perform impressive reasoning (my bet is on the Phi family, by the way). So why not create a workaround that connects images to LLMs that only handle text?

This method shows that even a small LLM can be enabled to work on images at scale. Instead of relying on massive multimodal models, I play a game with what we have by describing images through detected objects and extracted text. Before the rise of LLMs, people used pre-trained neural networks like EfficientNet, ResNet, and even YOLO (the fastest kid on the block) to detect objects in an image, and that was considered pretty fancy. With the advent of LLMs, Vision Transformers have become the eyes and ears of these models. However, when all you have is a text-only medium, why not have the system describe every object in an image and their positions using YOLO, and capture any text with pytesseract? Then you simply ask the LLM: “If there is a dog at button left, a man in the center, and a backpack in the center also, what could this image be?”

LLM response:
In the scene, a person is prominently positioned, standing at the forefront, with a dog by their side. The duo appears to be the main focus of the image, taking center stage. Above them, a backpack is suspended, possibly hanging from a hook, a tree branch, or even being held by an unseen hand, but it is clearly visible in the middle of the frame. The person and dog seem to be the main subjects of the image, with the backpack adding an element of context, perhaps suggesting a hiking or outdoor adventure scene. The overall atmosphere of the image appears to be serene and peaceful, with the person, dog, and backpack creating a sense of harmony and balance.

How It Works

At its core, the method is about integrating multiple modalities:

Vision (YOLOv8): I use Ultralytics’ YOLOv8 to perform object detection on an input image. Whether it is a family snapshot or a cluttered home office, YOLOv8 identifies objects and provides their bounding boxes.
Text (OCR): For any text that might be present — think handwritten notes on the fridge or labels on pantry items, I use pytesseract. A simple thresholding operation cleans up the image so that the OCR engine can extract text reliably.
Language (LLM): Finally, I combine these outputs into a single descriptive prompt and send it to a Llama-based model via the Together API. The prompt includes both the detected objects (with relative positions) and any OCR-extracted text. The language model then generates a natural, human-like description of the scene.

The entire pipeline is implemented in a Jupyter Notebook, and the code integrates seamlessly to take you from a raw image to a creative, contextual narrative. Note that here I used LLAMA on the together.ai API just to show it, but it can be a local llama for any tiny LLMs.

from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
from together import Together
import pytesseract

# If you're on Windows, you need to point to your tesseract installation
# No need for you Linux gang dudes :)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
import cv2
from ultralytics import YOLO

# Initialize the Together client with your API key
client_together = Together(api_key="YOUR_API")

def get_relative_position(center_x, center_y, img_width, img_height):
    """Convert coordinates to relative positions in the image"""
    horizontal_pos = "left" if center_x < img_width/3 else "right" if center_x > 2*img_width/3 else "center"
    vertical_pos = "top" if center_y < img_height/3 else "bottom" if center_y > 2*img_height/3 else "middle"
    return f"{vertical_pos} {horizontal_pos}"

def extract_text(image_path):
    """Extract text from image using pytesseract"""
    # Read image using opencv
    img = cv2.imread(image_path)
    
    # Convert to RGB
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Convert to grayscale
    gray = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
    
    # Apply threshold to get image with only black and white
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Perform OCR
    try:
        text = pytesseract.image_to_string(thresh).strip()
        # Remove any special characters and extra whitespace
        text = ' '.join(text.split())
        print("OCR complete")
        return text if text else None
    except Exception as e:
        print(f"Error in text extraction: {e}")
        return None

# Load the image and get dimensions
image_path = "frst.jpg"  # Change this to your actual image path
img = Image.open(image_path)
img_width, img_height = img.size

# Load YOLOv8 model
print("Loading YOLOv8 model...")
model = YOLO('yolov8l.pt')  # Load the nano model (you can use 's', 'm', 'l', or 'x' for other sizes)

# Run object detection
print("Performing object detection...")
results = model(image_path)[0]  # YOLOv8 returns a list of Results objects

# Display and save the results
print("\nDisplaying detected objects...")
# Plot the results on the image
plotted_image = results.plot()

# Convert from BGR to RGB for display
plotted_image_rgb = cv2.cvtColor(plotted_image, cv2.COLOR_BGR2RGB)

# Create a figure with a reasonable size
plt.figure(figsize=(12, 8))
plt.imshow(plotted_image_rgb)
plt.axis('off')  # Hide axes
plt.title('Detected Objects')
plt.show()

# Save the annotated image
output_path = "frst_out.jpg"
cv2.imwrite(output_path, plotted_image)
print(f"Annotated image saved as {output_path}")

# Extract detections
detections = []
for result in results.boxes.data:
    x1, y1, x2, y2, conf, cls = result
    label = model.names[int(cls)]
    center_x = (x1 + x2) / 2
    center_y = (y1 + y2) / 2
    position = get_relative_position(center_x, center_y, img_width, img_height)
    detections.append((label, position))

# Extract text from image
print("Extracting text from image...")
extracted_text = extract_text(image_path)

if not detections and not extracted_text:
    print("No objects or text detected in the image.")
else:
    # Format detected objects
    detected_objects = [f"{label} in the {position} of the image" for label, position in detections]
    
    # Create a descriptive prompt including both objects and text
    prompt_parts = ["Please describe the image in detail."]
    
    if detected_objects:
        prompt_parts.append(
            "The following objects were detected with their relative positions: " +
            ", ".join(detected_objects)
        )
    
    if extracted_text:
        prompt_parts.append(
            f"The following text was found in the image: '{extracted_text}'"
        )
    
    prompt_parts.append(
        "Based on these detections, provide a natural, flowing description of the scene "
        "without mentioning exact coordinates."
    )
    
    prompt = " ".join(prompt_parts)
    
    print("\nSending prompt to Llama...")
    messages = [{"role": "user", "content": prompt}]
    
    # Get response from Llama
    response = client_together.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
        messages=messages,
        max_tokens=300,
        temperature=0.7,
        top_p=0.7,
        top_k=50,
        repetition_penalty=1,
        stop=["<|eot_id|>", "<|eom_id|>"],
        stream=False
    )
    
    # Print the description
    description = response.choices[0].message.content
    print("\nDescription:")
    print(description)

Benefits

Flexibility and Versatility: By fusing object detection and OCR, this approach can describe both visual elements and written information. Whether you have a complex scene with many objects or an image with embedded text, the method adapts.
Objects of Interest: By filtering the object detection engine on the objects you are interested in only, for example, fruits, dogs, whatever, you can ignore the extra information in the image that might act as a noise if a model sees everything.
Ease of Integration: Using readily available tools like YOLOv8 and pytesseract means you do not have to build each component from scratch. It is like how moms make the best use of what is already at home.
Creative Applications: The output is not merely a list of objects or raw text — it is a coherent narrative generated by a language model. This makes the method perfect for creative storytelling, accessibility enhancements (for visually impaired users), and even digital archiving.
Empowering Small LLMs: This technique shows that even a modest language model can be enabled to process images on a large scale. It provides a clever bridge for text-only LLMs to interpret visual data.
Real-World Utility: Imagine a home automation system that describes your living space, an app that catalogs your personal items for inventory or security, or a digital assistant that organizes your photos by generating natural language summaries.

Weaknesses

Computational Overhead: Combining multiple AI models means the pipeline can be resource-intensive. Running YOLOv8, OCR, and a large language model in succession may require a robust system or cloud resources.
Error Propagation: In multimodal systems, errors in one stage (for example, mis-detected objects or OCR misreads) can affect the final narrative. Fine tuning and error handling are crucial.
Dependence on Quality Input: Low-quality or highly cluttered images may result in less accurate detections or OCR outputs. While the language model helps smooth out inconsistencies, there is only so much it can infer if the underlying data is noisy.
Lack of Description Depth: Here, the description is heavily on the YOLO output, meaning instead of a man in pajamas, it just says a person, for example. And this is the main limitation here that restricts the ability of this approach for the complex environment tasks requiring details

When It Is Useful

This method shines in scenarios where a holistic understanding of visual and textual content is valuable. For instance:

Home Organization and Inventory: Automatically generate detailed descriptions of your living space or items stored at home.
Accessibility Tools: Providing visually impaired users with rich, descriptive narratives about their surroundings.
Digital Storytelling and Journaling: Creating engaging summaries for personal photo collections.

Conclusion

By integrating YOLOv8’s robust object detection, pytesseract’s reliable OCR, and a Llama-based language model’s creative narrative generation, this approach offers a comprehensive multimodal-like system that is both practical and innovative. It embodies a “mom said we have it at home” philosophy — making the most out of existing tools to create something truly useful in everyday life.