This paper presents the approach proposed by the PoliTO team to accomplish the TREC 2021 podcast summarization task. The purpose is to extract synchronized text/audio segments that convey the most relevant podcast information. The main challenge is to consider is the multimodal nature of the data source, which comprises both textual and acoustic sequences. PoliTO presents a two-stage pipeline that (i) extracts relevant content from multimodal sources and (ii) leverages the extracted content to generate abstractive summaries by using an attention-based Deep Learning architecture. The extractive stage combines the high-dimensional encodings of both textual and audio sources to build a neural network-based regression model. The key idea is to predict the textual similarity between the selected text snippets and the podcast description by also exploiting the underlying information provided by the acoustic features. While audio summaries are obtained by concatenating selected audio samples, summaries in textual form are generated by exploiting the selected information as input of a sequence-to-sequence generative model.
PoliTO at TREC 2021 Podcast Summarization Track
La Quatra M.;
2021-01-01
Abstract
This paper presents the approach proposed by the PoliTO team to accomplish the TREC 2021 podcast summarization task. The purpose is to extract synchronized text/audio segments that convey the most relevant podcast information. The main challenge is to consider is the multimodal nature of the data source, which comprises both textual and acoustic sequences. PoliTO presents a two-stage pipeline that (i) extracts relevant content from multimodal sources and (ii) leverages the extracted content to generate abstractive summaries by using an attention-based Deep Learning architecture. The extractive stage combines the high-dimensional encodings of both textual and audio sources to build a neural network-based regression model. The key idea is to predict the textual similarity between the selected text snippets and the podcast description by also exploiting the underlying information provided by the acoustic features. While audio summaries are obtained by concatenating selected audio samples, summaries in textual form are generated by exploiting the selected information as input of a sequence-to-sequence generative model.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.