institution: clova AI Japan
This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method.
Figure 1: Block diagram of the proposed system; © denotes the concatenation process between input tensors.
Figure 2: t-SNE plot of joint emotion embeddings generated from randomly sampled emotion strengths with various emotions. The brighter color indicates that the emotion strength becomes stronger, and vice versa.
Figure 3: Classification accuracy of emotion predictors: (a) conventional CNN-based and (b) proposed LM-based methods
institution: clova AI Japan
Our aim in this study is to selectively choose synthetic data that are beneficial to the training process.
Figure 2: Distribution of the VAE’s posterior mean vector extracted from recorded (red dot) and synthetic (blue triangle) samples: (a) before and (b) after applying the VAE fine-tuning process.
Figure 3: Density histograms of originality determined using RankSVM from the recorded (red boxes) and synthetic (blue boxes) waveforms, respectively.
Figure 5: The MOS test results with respect to various amounts of recorded data: baseline model trained with recorded data alone (red square) and augmented models trained without (purple triangle) and with (blue dot) the proposed data selection method.
institution: Meta, France
Figure 1: T-SNE visualisation of English and French phone embeddings at the CPC level, for monolingual (EN and FR) and bilingual (EN+FR) models. Embeddings are colored based on their phone class label, gender label and language label.
fricative; affricate; plosive; approximant; nasal; nasal vowel; semi-vowel; vowel:
擦音;塞擦音;爆破音;近似值;鼻;鼻元音;半元音;元音
Figure 2: Probing Accuracy on Language Logistic Regression models wrt. number of active coefficients
institution: Berkeley
Electromagnetic Articulography (EMA)
Spectrum-Intermediate Baseline
HiFi-CAR Model
Figure 5: Phoneme confusability based on ASR transcriptions. Phoneme pairs that are confused more frequently have darker lines. Details in Section 8.
To study the generalizability of our time-domain model, we perform interpolation experiments, prompting our model to synthesize unseen articulatory representations between pairs of sounds.
institution: Lancaster Universityphonetician.
The phonetician was asked to offer insights on each of the 300 samples with respect to the following:
observation
‘patchwork’ appearance as the result of a concatenative speech synthesis approach applied in the S10 method.
Clearly spoofed speech may be characterised by a lack of definition in formant frequencies (see Figure 2), artificial-looking striations (see Figure 3), and/or breaks within vocalic portions (see Figure 4)
Figure 5 is an example spectrogram of the spoofed male speech “exactly by the rule book” with the highlighted section showing the ejective-release of the velar plosive
Figure 6 is an example spectrogram of the spoofed female speech “better for me” with the highlighted section displaying breathiness in the final vowel.
Figure 7 shows the spectrogram of the spoofed male speech “more regularly in the new year”; but it is difficult to conclude whether this is because the sample is spoofed or whether it is because the speaker has a distinctive voice.
Figure 8 shows the spectrogram of the spoofed male speech “disappointed”; The degree of denasality is very unusual, but it is difficult to conclude whether this is because the sample is spoofed or whether it is a speaker-idiosyncratic marker.