Representation Learning for Grounding Vision and Language in Hierarchical Robot Planning