Аннотации:
The object of this study is the
accuracy of announcer identification based on short utterances.
To solve the task of speaker identification based on ultrashort speech utterances, a phoneme-by-phoneme approach to
constructing voice models has
been proposed within the framework of the study. The validity
of this approach is based on the
fact that short utterances usually
contain a limited number of phonemes. In this regard, a hypothesis was put forward assuming
that in order to increase the accuracy of announcer identification
based on short utterances, it is
necessary to analyze the sound
of specific phonemes by different
announcers.
The experiments involved
speech recordings of monosyllabic words with corresponding
phonemes, on the basis of which,
using the ECAPA-TDNN neural
network architecture, announcer voice models were constructed.
The experimental studies showed
that voice models constructed
based on the sounds of only one
model provide higher announcer
identification accuracy compared
to generalized models constructed
based on all speech sounds.
It was also found that different phonemes provide different
announcer identification accuracy. For example, with a speech
signal duration of 2–3 seconds,
the accuracy of announcer identification by the generalized model
was 75 %. And the accuracy of
announcer identification using a
model built on the basis of only
one phoneme "E", with the same
input data, was 85 %, which is
10 percentage points higher than
that of the generalized model