VLM | Arc AI - Marco Menner

This post outlines the findings from my recent paper, “Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?”, co-authored with Maximilian Triebel and Dominik Helfenstein. The study investigates whether current Vision-Language Models (VLMs) can replicate the physical reasoning and precise interactions required in complex puzzle environments. Overview To properly assess these capabilities, we introduced VLATIM (Vision-Language Against The Incredible Machine). This novel benchmark is built around the classic physics puzzle game The Incredible Machine 2.

Evaluating VLM Problem-Solving Capabilities in Point-and-Click Puzzle Games