Evaluation of pre-trained vision language models in challenging contexts