Towards Robust Audio-Visual Speech Recognition