Vision and language : representation learning, commonsense reasoning, and consistency