Towards in-the-wild visual understanding