Over the past decades, computer scientists have introduced increasingly sophisticated machine learning-based models, which can perform remarkably well on various tasks. These include multimodal large language models (MLLMs), systems that can process and generate different types of data, predominantly texts, images and videos.