Visual-based Web Scraping: Using power of multimodal LLMs to Dynamic Web Content Extraction

With LLMs and vision models becoming more accessible than ever, I started exploring the intersection of web scraping and AI. As a test case, I recently wrote a code leveraging LLAMA vision to scrape the content of a webpage but using the multi-modality of LLM; we can call it Visual-based WebScraping. Instead of traditional HTML parsing using libraries like bs4 or selenium, which often breaks when pages are dynamically updated or structured differently, I capture a screenshot of the webpage and then let an open LLM summarize the visual content.

This approach not only simplifies the process by sidestepping the brittle nature of DOM-based scraping but also leverages modern vision-language models to interpret and summarize content — opening up a whole new realm of possibilities.

The Idea Behind Visual-based WebScraping

The concept is simple:

Capture a Screenshot: Use a headless browser (powered by pyppeteer) to render the full webpage as an image.
Summarize Using LLM: Convert the screenshot to a base64-encoded string and send it to an open LLM (in my case, Meta’s Llama-Vision-Free) through the Together API. The model then analyzes the image and provides a concise summary of the page content — in my experiment, summarizing news headlines.

This method shines especially when dealing with dynamically generated content, where HTML structures might change frequently or require JavaScript execution to load properly.

Diving Into the Code

Below is the complete code I wrote for this project. It uses asynchronous programming to capture a screenshot of a news page and then processes the image with an LLM.

import asyncio
import base64
from pyppeteer import launch
from together import Together

# Initialize the Together client with your API key.
client_together = Together(api_key="YOUR_API_KEY_HERE")

async def capture_screenshot(url, output_path="screenshot.png"):
    """
    Uses pyppeteer to launch a headless Chromium instance, navigate to the given URL,
    and capture a full-page screenshot.

    Args:
        url (str): The URL to capture.
        output_path (str): File path to save the screenshot.

    Returns:
        bytes: PNG image bytes, or None if an error occurs.
    """
    try:
        browser = await launch({
            'headless': True,
            'args': ['--no-sandbox', '--disable-setuid-sandbox']
        })
        page = await browser.newPage()

        # Set a realistic User-Agent and extra headers.
        await page.setUserAgent(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/117.0.0.0 Safari/537.36"
        )
        await page.setExtraHTTPHeaders({
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Referer": "https://www.google.com/"
        })

        response = await page.goto(url, {"waitUntil": "networkidle2"})
        if response and response.status != 200:
            print(f"Failed to load page, status code: {response.status}")

        # Capture a full-page screenshot.
        screenshot_bytes = await page.screenshot({"fullPage": True})
        await browser.close()

        # Optionally, save the screenshot to a file.
        with open(output_path, "wb") as f:
            f.write(screenshot_bytes)

        return screenshot_bytes
    except Exception as e:
        print("Error capturing screenshot:", e)
        return None

def analyze_news_with_lamavision(screenshot_bytes):
    """
    Encodes the screenshot in base64 and sends it to Llama-Vision-Free via the Together client.
    The message instructs the model (using role "system") to analyze the screenshot of the news page
    and summarize the news headlines.
    """
    image_base64 = base64.b64encode(screenshot_bytes).decode("utf-8")

    messages = [
        {
            "role": "system",  # using "system" as required by the model
            "content": [
                {
                    "type": "text",
                    "text": "Please analyze this screenshot of the news page and provide a concise english summary of the news headlines."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_base64}"
                    }
                }
            ]
        }
    ]

    response = client_together.chat.completions.create(
        model="meta-llama/Llama-Vision-Free",
        messages=messages,
        max_tokens=300,
        temperature=0.7,
        top_p=0.7,
        top_k=50,
        repetition_penalty=1,
        stop=["<|eot_id|>", "<|eom_id|>"],
        stream=True
    )

    response_text = ""
    for token in response:
        if hasattr(token, 'choices') and token.choices and token.choices[0].delta and token.choices[0].delta.content:
            response_text += token.choices[0].delta.content
    print(response_text.strip())

if __name__ == "__main__":
    # Use the news page URL instead of the Foodora URL.
    url = "https://fh-ooe.at/campus-steyr/news"
    screenshot_bytes = asyncio.get_event_loop().run_until_complete(capture_screenshot(url))
    if screenshot_bytes:
        print("Screenshot generated successfully. Sending image to Llama-Vision for news summary...")
        analyze_news_with_lamavision(screenshot_bytes)
    else:
        print("Failed to generate screenshot.")

How It Works

Capturing the Screenshot:
Using pyppeteer, the script launches a headless browser instance, navigates to the target URL, and captures a full-page screenshot. Realistic headers and a common User-Agent string are set to mimic a typical browser, reducing the chances of being blocked or misidentified.
Interacting with the LLM:
The captured screenshot is encoded into a base64 string and embedded in a data URL. This URL, along with a system instruction, is sent to the Llama-Vision model through the Together API. The model then streams back tokens which are concatenated to form the final summary output.

Benefits:

Resilience to Dynamic Changes:
Since the approach is based on images rather than HTML structure, it is inherently more robust against changes in page layouts or dynamically loaded content.
Simplified Parsing:
Avoids the complex logic often required for HTML parsing, especially when pages contain JavaScript-heavy elements.
Leveraging Vision-Language Models:
This method taps into the power of modern vision-language models, which can understand visual contexts and extract meaningful information from them.

Challenges:

Performance Overhead:
Capturing full-page screenshots and processing images can be more resource-intensive compared to traditional text scraping.
API Dependency:
Relying on external LLM APIs means the process is subject to rate limits, latency, and potential costs associated with API usage.
Image Quality and Interpretation:
The quality of the screenshot and the complexity of the visual layout can affect the model’s ability to accurately summarize the content.

Final Thoughts

Visual-based Web Scraping represents a promising fusion of traditional web scraping and modern AI capabilities. It offers a robust alternative to conventional scraping techniques, particularly for dynamic and visually complex web pages. While there are trade-offs in terms of performance and reliance on external APIs, the benefits of a resilient and flexible approach make it a compelling option for many real-world applications.