Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training