Hatsune Miku * 1 (Snow Miku) collaboration petunia XNUMXrd "MEIKO Petunia" will be released!
If you write the contents roughly Vocal sound source for voice synthesis and desktop music (DTM) released by Crypton Future Media, and its characters.
April 2022 To all members of the press, Yokohama Ueki Co., Ltd. Hokkaido Branch 4 Kita 14-2 Heiwadori, Shiroishi-ku, Sapporo Hatsune ... → Continue reading
Kyodo News PR Wire
Kyodo PR wire, which distributes press releases and news releases, connects information from "who wants to know" to "people who want to know."
This is a site that consumers should pay attention to, where news releases from major governments and government agencies including local governments and universities are gathered.
Wikipedia related words
If there is no explanation, there is no corresponding item on Wikipedia.
Speech synthesis(Onsei Gosei,British: speech synthesis)A human OfvoiceIs to artificially create[1].
Overview
Human capitalThrough the vocal organsvoiceTo generate and communicate.The task of artificially generating this voiceSpeech synthesisIs.Synthesized voiceSynthetic speechCalled (Gosei Onsei).
Speech synthesis can be realized by various methods.Some musical instruments produce sounds that resemble human voices, and by blowing wind into machines that imitate human throats, sounds that resemble human voices can be produced.Using a digital computerVoice information processingIt is also possible to synthesize voice digitally as a kind of. As of 2021, by the method using a computerPossible voice synthesis that is indistinguishable from real voiceIt has become.
Speech contains various information such as language content, speaker characteristics, and emotions, and speech synthesis requires the generation of synthetic speech with the desired attributes.[2]..The desired attribute is input from the outside and the generation is performed.The task of inputting text (sentence) and generating voice of the desired language content is Text speech synthesis(British: Text-To-Speech; TTS).Especially those that synthesize singing voiceSinging voice synthesisIs called.Also, the voice can be played by another individual orcharacterHow to convert to voiceVoice conversionCalled.
History
Long before the invention of modern signal processing methodsWest Africa OfTalking drumAttempts have been made to imitate voice, such as.
1930s,Bell LabsHomer Dudley (Homer Dudley) IscommunicationWe have developed a vocoder (abbreviation for vocoder, voice coder), which is an electronic voice analyzer / voice synthesizer for the computer.After that, applying this, I made voder, which is a keyboard performance type voice synthesizer with a keyboard added to the voice synthesizer.New York World's Fair (1939)Exhibited at.It is said that the utterance was fully understandable. In the 1940s, Franklin S. Cooper of the Haskins Institute (Franklin S. Cooper) Et alPattern playbackWorking on the development of the machine,1950/It was completed in.There are several versions of this machine, but only one actually worked.This machine converted a diagram of a spectral-format speech pattern into sound.Alvin Riverman (Alvin Liberman) Etc.PhoneticsIt was used for the research of.
The first computer-based speech synthesizerthe 1950sDeveloped late, the first text-to-speech synthesizer1968/Developed in.1961/, Physicists John Larry Kelly, Jr. and Louis Gerstmen[6]It is,Bell LabsでIBM 704Speech synthesis was performed using.AndDaisy bellI made the computer sing the song.I visited Bell Labs to visit my friend John PierceArthur C. ClarkWas impressed by listening to this demo2001 Space JourneyでHAL 9000The climax scene that sings was born[7].
Text-to-speech synthesis is the task of converting text (sentence) into speech.This conversion can be regarded as the following problem[10][11].
When there is a set of text and the corresponding voice waveform, find the voice waveform corresponding to the arbitrarily given text.
One of the solutions to this problemStatistical machine learningIs.That is, voiceDatabase(Corpus) Is an approach that learns a stochastic model of waveform generation and uses it as a synthesizer.In human speech generation, it is extremely rare that the exact same waveform is obtained when the same speaker reads the same sentence several times.Thus, the speech generation process and speech signals have non-deterministic properties, and a probabilistic framework is effective.
In this framework, the text and speech waveforms that exist in the speech database (corpus) are used respectively. and , Given any text , The voice to be synthesized When Predicted distribution of The Estimated from this predicted distribution To sample[12]..Distribution models are often divided into multiple steps by introducing auxiliary variables and approximations.
Pipeline model
For example, language as an auxiliary variableFeature valueAnd the acoustic features are introduced and formulated as follows.Acoustic features that represent the nature of audio signals(Database) and(Composite), linguistic features that represent the nature of the text(Database) and(Arbitrarily given text), a parametric acoustic model that expresses the probability of occurrence of acoustic features when language features are givenThen, the predicted distribution can be decomposed as follows.
After that, the auxiliary variables may be marginalized, but if this is approximated in terms of maximizing the simultaneous probability of the auxiliary variables, the predicted distribution can be approximated as follows.
However,
.
However, it is still difficult to maximize the joint probability, so if it is further approximated by sequential optimization, the following six sub-problems will be optimized respectively.
(Extraction of acoustic features)
(Extraction of language features)
(Learning acoustic model)
(Prediction of language features)
(Prediction of acoustic features)
(Audio waveform generation)
End-to-End model
A model that directly generates audio waveforms without using intermediate features End-to-End It's called a model.That is, In one model and using a corpus To learn.
Method
Speech synthesis methods are roughly divided into three types.
Rule synthesis: Synthesize speech based on rules based on knowledge of speech generation
Concatenative synthesis type speech synthesis: Concatenative synthesis of recorded speech pieces
Statistical parametric speech synthesis: Synthesize speech based on the output of a statistically learned parametric generative model
Each method has different characteristics such as sound quality, computational complexity, and real-time performance, and the method is selected according to the application.
Rule composition
Rule composition[13]Establishes rules based on the knowledge about speech generation obtained through research, and generates speech based on the rules.Historically relatively old.For example, there are the following.
Formant speech synthesis
FormantSpeech synthesis is spectrum andFundamental frequencyAdjust the parameters such as, and synthesize the voice.There is no sound loss and you can hear clearly even if you utter at high speed, the size of the synthesizer is small because you do not need a voice database like a statistical method, you can freely adjust intonation and timbre (within the rules) It has features such as being able to be changed.On the other hand, the synthesized voice is robot-like, and the human voice is poor.
Tone speech synthesisModels the structure of the human vocal tract and synthesizes speech based on it.There is an example that was also used for commercial purposes,NeXTThe system used inUniversity of CalgaryIt was developed by Trillium Sound Research Inc., which was created by a spin-off of the research team of. Trillium makes this freegnuspeechPublished as GNU savannah site[*1]It is available at.
Concatenative synthesis type speech synthesis
Concatenate and synthesize the elements of the recorded voice.Since the recorded voice is used, if there is something close to the input text in the recorded voice, it will be a natural synthetic voice that is close to the real voice, but if not, the naturalness will be impaired at the connection part etc. There is.Moreover, although it is possible to adjust the speech speed and the pitch of the voice to some extent, it is difficult in principle to flexibly process the voice other than that.It is technically difficult to synthesize speech that changes very rapidly, so most of the speaking styles are neutral.
For example, there are the following.
Unit selection type speech synthesis
Unit selection type speech synthesis[14][15]Is also called corpus-based speech synthesis, but generative model speech synthesis also uses the corpus to train the model.When creating a database, voice is recorded, and sentences, phrases, and sentences are used for the recorded voice.accentclause·morpheme-phoneme・ Along with adding labels to indicate accents, etc.voice recognitionThe label and the audio section correspond to each other by adjusting manually.In general, when synthesizing speech, the input text is first analyzed by a text analyzer to obtain information (language features) such as sentences, phrases, accent phrases, morphemes, phonemes, and accents of the text.Next, the fundamental frequency, phoneme continuation length, etc. are predicted from the obtained language features, and the voice element that best matches it (target cost) is selected from the database while considering the smoothness of the connection part (connection cost). Select and connect.This makes it possible to synthesize a natural voice that is close to the real voice.However, in order to synthesize a voice that sounds more natural to any input text, it is necessary to record more voice according to the expected input text, and the database becomes huge accordingly.Concatenative synthesis type speech synthesis requires the synthesizer to hold the speech element, so the capacity is small.Auxiliary storageThis can be a problem for systems that only have.
Diffon speech synthesis
A diphone (diphone, phoneme pair) of the target language is stored in the voice database, and it is used for synthesis.The number of diphons is determined by the phonotactics of the language (for example, Spanish has about 800 diphons and German has about 2,500 diphons).In diphon speech synthesis, the database only needs to hold one speech element per diphon, so the size of the database can be overwhelmingly smaller than that of unit-selective synthesis.When synthesizing speech, linear predictive analysis is performed by arranging diphons.PSOLA, MBROLA, etc.)Digital signal processingTo make a prosody.The sound quality of the synthesized speech is inferior to that of the unit-selective speech synthesis.Due to the development of unit-selective speech synthesis, it is rarely used in recent years.
Field-limited speech synthesis
Synthesize voice by concatenating recorded words and phrases.This is used for reading out texts in a specific field, for example, in station guidance broadcasting.Since this method is field-limited, it is easy to synthesize sounds that sound natural.However, not all input text can be synthesized, and it is extremely difficult to use a particular synthesizer in another field.Input text is limited by the fact that only words and phrases held by the database can be synthesized.Additional recordings will need to be made if new input text is supported (for example, if a new station is set up).Also,FrenchInliaisonIt is difficult to reproduce the change in pronunciation due to the relationship with the surrounding words.In this case, it is necessary to record and synthesize in consideration of the context.
Statistical parametric speech synthesis
Statistical parametric speech synthesis (British: statistical parametric speech synthesis; SPSS)Statistical modelSpeech synthesis based on, that is, a general term for stochastic speech synthesis.[16].
A parametric generative model that learns the characteristics of the voice from the recorded voice is created, and the voice is synthesized based on the output of the generative model.Concatenative synthesis can cause problems with the smoothness of the synthesized speech depending on the conditions, but statistical speech synthesis can basically synthesize smooth speech.In addition, the method enables flexible and diverse speech synthesis, such as the intermediate voice quality of multiple speakers and the rapidly changing voice with emotions.
Hidden Markov model speech synthesis
Hidden Markov model Speech synthesis using (HMM) as an acoustic model. By HMMAcoustic featuresProbabilistically generate a series and generate thisVocoderConverts to a voice waveform.
A pioneer of statistical parametric speech synthesis, proposed in 1999 by a team at Tokyo Institute of Technology.[17]..FewParameterThe characteristics of speech can be expressed with, and the size of the model and the calculation cost required for learning the model and synthesizing speech are small.mobile phone(feature phone) AndElectronic notebookIt works even on terminals with large hardware restrictions.Also, the required recording time is shorter than that of (commercial) unit-selective speech synthesis.
Due to the simplicity of the model, the spectrum tends to be smoother than that of human voice, so the synthesized voice lacks a sense of real voice.Also, the trajectory of the fundamental frequency tends to be simple.
Neural network speech synthesis
Neural network speech synthesis is speech synthesis using a neural network as a speech synthesis model.Acoustic model (language features to Acoustic features) Is modeled by the neural network, and the speech waveform probability distribution (generated model) itself conditioned by the language features is modeled by the neural network.
The first paper was published by the Google team in 2013[18]..The neural network model is more expressive than the hidden Markov model and enables more natural speech synthesis.On the other hand, the number of model parameters and the learning / speech synthesis calculation cost are large.Therefore, at the practical stageserverIs synthesized or non-synthesized inGPUVarious studies have been conducted to enable operation in the environment (some smartphones, etc.).
Hidden Markov Model Similar to speech synthesis, the neural network model outputs acoustic features.In addition, WaveNet (Google, 2016)[19]In the wake of this, a method for directly modeling and outputting audio waveforms has appeared.Under limited conditions, these waveform generation models can synthesize speech with a quality very close to (or equivalent to) human speech. With the advent and commercialization of WaveNet, various studies have been conducted to realize the same voice quality with a faster, lighter, and simpler model (WaveNet Vocoder).[20], Clarinet[21], WaveGlow[22], WaveRNN[23], RNN_MS[24]Such).
Conventionally, language features (input text analyzed by a text analyzer) have been used for model input. Char2017Wav that eliminates the need for language features (text analyzer) in 2[25], Deep Voice[26], Tacotron[27]So-called end-to-end speech synthesis has been proposed, and active research and development is being carried out.
Parametric speech synthesis using the designed language and acoustic features in this way (Statistical parametric speech synthesis) Is waveform generation that does not depend on features, that isStatistical voice waveform synthesisExpanding to (statistical speech waveform synthesis / SSWS)[29].
Sort
Speech synthesis can be classified from several points of view.
Brain activity: A type of Brain-Machine Interface[30]
(Acoustic features: Vocoder..Often incorporated into text-to-speech synthesis and speech conversion)
Voice conversion
Voice conversion(British: voice conversion) Is a task to convert some of the characteristics of the input voice.[31]..Speaker conversion that changes the speaker while maintaining the language content[32], Can be classified into various subtasks such as emotional transformation that changes only the tone of the voice.The task of maintaining the speaker and timbre and changing only the language content to a foreign language can be regarded as both a speech translation task and a speech conversion task.
In speech synthesis, synthetic speech with specified characteristic attributes is often generated.[2]..Attributes include the following, from acoustic features to speech cognitive features.Personality and personality depending on the combination of attributesAccentIs born.
In text-to-speech synthesis, it is necessary to correctly estimate how to read the input text (sentence).Generally rulesdictionary・ Combine statistical methods.However, there are various difficulties in this.For exampleJapaneseThen, distinguish between on-yomi and kun-yomi of kanji (or estimate which reading is used when there are multiple kanji),HomoglyphDistinguishing, estimating accents,Person's name,Place nameIt is difficult to correctly estimate how to read.
Objective evaluation of quality
Of the quality of speech synthesis, it is difficult to objectively evaluate the naturalness of synthetic speech.There is no objective indicator that is commonly recognized as valid among experts.The same applies to the similarity with the target speaker and the reproducibility of the target utterance style.
Fair comparison of performance
In the speech synthesis method, each researcher performs model learning using his / her own data set, and it is often evaluated by his / her own task, and it may be difficult to fairly compare the performance.Therefore, in the Speech Synthesis Special Interest Group (SynSIG) of the International Speech Communication Association (ISCA), which is an international conference on speech,2005/Blizzard Challenge more every year[33]We are holding a competition.In this competition, speech synthesis systems that use a common data set for learning are evaluated based on common tasks, enabling fair comparison of performance.
In particular, in a commercial speech synthesis system, it is possible to improve the performance specialized for the purpose by using a data set according to the purpose.Know-howIt has become.
Speech synthesis system
As of 2020, majorPersonal computerAnd smartphonesoperating systemIs equipped with a reading function (screen reader) by voice synthesis.Historically, various speech synthesis systems have been put into practical use.The following is an example.
TI-99 / 4AWas able to add a voice synthesis function as an option[34].
PC 6001Can add a voice synthesis cartridge,PC-6001mkIIHad a built-in voice synthesis function.Successor PC-6001mkIISR andPC 6601Then it was possible to sing.
FM-7/FM-77A voice synthesis board (MB22437 / FM-77-431) was available as an option in the series.
MZ-1500/2500/2861There was a voice board (MZ-1M08) as an option.The Japanese syllabary and some phrases are sampled on an external chip and burned as a ROM, which is played back by control.
U.S.S
Festival Speech Synthesis System
gnuspeech
HMM-based Speech Synthesis System (HTS)
Open JTalk(HTS-based speech synthesis system for Japanese)
MaryTTS
Academic journals / academic societies
Discussions on speech synthesis researchAcademic magazine,学会There are the following (TaiziAbout some or all papersPeer reviewWhat is doing).
Academic magazine
European Association for Signal Processing (EURASIP)
Matsuo Laboratory, Department of Technology Management Strategy, Graduate School of Engineering
Graduate School of Engineering, Department of Electrical Engineering, Minematsu / Saito Laboratory
Graduate School of Information Science and Technology, Department of System Informatics, System Information Laboratory 1 (Saruwatari / Koyama Laboratory)
Tokyo Institute of Technology
Kobayashi Laboratory, Information and Communication Systems, Faculty of Engineering
It is difficult to perform text analysis 100% correctly in speech synthesis.There are also times when you want a specific reading that cannot be interpreted from the text.Therefore, it is necessary to specify the information by some method,Domain specific languageBesides the method to do byW3CThere is a method using Speech Synthesis Markup Language (SSML) defined by.
^"Speech synthesis is the task of generating speech waveforms" Wang, et al. (2021). FAIRSEQ S2 : A scalable and Integrable Speech Synthesis Toolkit.
^ ab"with desired characteristics, including but not limited to textual content ..., speaker identity ..., and speaking styles" Wang, et al. (2021). FAIRSEQ S2 : A scalable and Integrable Speech Synthesis Toolkit.
^Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine(Explanation of voice mechanism and talking machine)
^Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed.), Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974.
^Tokuda, Keiichi (2015). “Present / Past / Future of Statistical Speech Synthesis Technology”. Spoken language symposiumIEICE-115ISSN0913-5685.
^Tokuda, Keiichi (2017). “The latest trends in speech synthesis research that tells Fuunkyu”. Information and Systems Society Magazine (Institute of Electronics, Information and Communication Engineers) 21 (4): 10–11. two:10.1587 / ieiceissjournal.21.4_10. ISSN2189-9797. NOT130005312792.
^All, Akikawa (2018). “Transition and cutting edge of text-to-speech synthesis technology”. Journal of the Acoustical Society of Japan74 (7): 387–393.
^Klatt, Dennis H. (1980). “Real-time speech synthesis by rule”. The Journal of the Acoustical Society of America68: S18.
^Andrew J., Hunt; Black, Alan W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 373–376. two:10.1109 / ICASSP.1996.541110. ISBN0-7803-3192-3. ISSN1520-6149.
^Kawai, Tsune; Toda, Tomoki; Yamagishi, Junichi; Hirai, Toshio; Nagi, Shintomi; Nishizawa, Nobuyuki; Tsuzaki, Minoru; Tokuda, Keiichi (2006). "Speech synthesis system XIMERA using a large-scale corpus". Journal of the Institute of Electronics, Information and Communication EngineersJ89-D (12): 2688–2698. ISSN18804535. NOT110007380404.
^Masuko, Takashi; Keiichi, Tokuda; Takao, Kobayashi; Satoshi, Imai (1999-05-09). “Speech synthesis using HMMs with dynamic features” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 389–392. two:10.1109 / ICASSP.1996.541114. ISBN0-7803-3192-3. ISSN1520-6149.
^Zen, Heiga; Senior, Andrew; Schuster, Mike (2013-05-26). “Statistical parametric speech synthesis using deep neural networks” (English). 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE): 7962–7966. ISBN978-1-4799-0356-6. ISSN1520-6149.
^van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew et al. (2016-09-12). “WaveNet: A Generative Model” for Raw Audio ”(English). arXiv. arXiv:1609.03499.
^J. Shen, R. Pang, RJ Weiss, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv: 1712.05884, 2017.
^W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” arXiv preprint arXiv: 1807.07281, 2018
^R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flowbased generative network for speech synthesis,” arXiv preprint arXiv: 1811.00002, 2018
^N. Kalchbrenner, E. Elsen, K. Simonyan, et al., “Efficient neural audio synthesis,” arXiv preprint arXiv: 1802.08435, 2018.
^Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal (2019) TOWARDS ACHIEVING ROBUST UNIVERSAL NEURAL VOCODING. Interspeech 2019
^Gopala K. Anumanchipalli, et al .. (2019) Speech synthesis from neural decoding of spoken sentences [paper]
^"Voice conversion (VC) refers to a technique that converts a certain aspect of speech from a source to that of a target without changing the linguistic content" Huang, et al. (2021). S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.
^"speaker conversion, which is the most widely investigated type of VC." Huang, et al. (2021). S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.