Portal field news

Portal field news

in ,

📢 | Hatsune Miku * 1 (Snow Miku) Collaboration Petunia XNUMXrd "MEIKO Petunia" will be released!


Hatsune Miku * 1 (Snow Miku) collaboration petunia XNUMXrd "MEIKO Petunia" will be released!

If you write the contents roughly
Vocal sound source for voice synthesis and desktop music (DTM) released by Crypton Future Media, and its characters.

April 2022 To all members of the press, Yokohama Ueki Co., Ltd. Hokkaido Branch 4 Kita 14-2 Heiwadori, Shiroishi-ku, Sapporo Hatsune ... → Continue reading

 Kyodo News PR Wire

Kyodo PR wire, which distributes press releases and news releases, connects information from "who wants to know" to "people who want to know."
This is a site that consumers should pay attention to, where news releases from major governments and government agencies including local governments and universities are gathered.

Wikipedia related words

If there is no explanation, there is no corresponding item on Wikipedia.

Speech synthesis

Speech synthesis(Onsei Gosei,British: speech synthesis)A human OfvoiceIs to artificially create[1].


Human capitalThrough the vocal organsvoiceTo generate and communicate.The task of artificially generating this voiceSpeech synthesisIs.Synthesized voiceSynthetic speechCalled (Gosei Onsei).

Speech synthesis can be realized by various methods.Some musical instruments produce sounds that resemble human voices, and by blowing wind into machines that imitate human throats, sounds that resemble human voices can be produced.Using a digital computerVoice information processingIt is also possible to synthesize voice digitally as a kind of. As of 2021, by the method using a computerPossible voice synthesis that is indistinguishable from real voiceIt has become.

Speech contains various information such as language content, speaker characteristics, and emotions, and speech synthesis requires the generation of synthetic speech with the desired attributes.[2]..The desired attribute is input from the outside and the generation is performed.The task of inputting text (sentence) and generating voice of the desired language content is Text speech synthesis(British: Text-To-Speech; TTS).Especially those that synthesize singing voiceSinging voice synthesisIs called.Also, the voice can be played by another individual orcharacterHow to convert to voiceVoice conversionCalled.


Long before the invention of modern signal processing methodsWest Africa OfTalking drumAttempts have been made to imitate voice, such as.

1779/IsChristian Gottlieb KraByvowelA machine capable of uttering[3]..This flow isbellowsMade a mechanical speech synthesizer usingOhWolfgang von KempelenWas taken over by.He1791/Paper on[4]And explains the machine.This machinetonguelipIs modeled, not only vowelsconsonantI was able to pronounce.1837/,Charles WheatstoneMade a talking machine based on the design of von Kempelen,1857/Euphonias was created by M. Faber.Wheatstone machines1923/Paget is reproduced in[5].

1930s,Bell LabsHomer Dudley (Homer Dudley) IscommunicationWe have developed a vocoder (abbreviation for vocoder, voice coder), which is an electronic voice analyzer / voice synthesizer for the computer.After that, applying this, I made voder, which is a keyboard performance type voice synthesizer with a keyboard added to the voice synthesizer.New York World's Fair (1939)Exhibited at.It is said that the utterance was fully understandable. In the 1940s, Franklin S. Cooper of the Haskins Institute (Franklin S. Cooper) Et alPattern playbackWorking on the development of the machine,1950/It was completed in.There are several versions of this machine, but only one actually worked.This machine converted a diagram of a spectral-format speech pattern into sound.Alvin Riverman (Alvin Liberman) Etc.PhoneticsIt was used for the research of.

The first computer-based speech synthesizerthe 1950sDeveloped late, the first text-to-speech synthesizer1968/Developed in.1961/, Physicists John Larry Kelly, Jr. and Louis Gerstmen[6]It is,Bell LabsIBM 704Speech synthesis was performed using.AndDaisy bellI made the computer sing the song.I visited Bell Labs to visit my friend John PierceArthur C. ClarkWas impressed by listening to this demo2001 Space JourneyHAL 9000The climax scene that sings was born[7].

1999/IsTokyo Institute of TechnologyBy the teamStatisticalIt is a pioneer of speech synthesis using a generative model.Hidden Markov modelSpeech synthesis was proposed.2013/ToGoogleBy the teamDeep learningSpeech synthesis based on (deep learning) was proposed,2017/Proposed end-to-end text-to-end text-to-speech synthesis that does not require a text processing unit.


If you can't watch these audio and video well,Help: Playing audio/video.

Speech synthesis is used in various services.For exampleCall centerAutomatic response,TMJ,MFPElectronic devices such as, on-site broadcasting at factories, etc.Disaster prevention radio[† 1],station-Bus terminal-空港AtIn-car broadcastingAnd guidance broadcast[† 2],car navigation,Electronic dictionary[† 3],Consumer electronics[† 4],Smartphone,Smart speakerApplications such as[† 5]And voice assistant[† 6][† 7][† 8][† 9][† 10],Entertainment robot[8][† 11],Anime[† 12],tv setprogram[† 13][† 14]-Community broadcasting[9]-Highway radio[† 15]Broadcasting field, such asE-bookRead aloud[† 16]And so on.In addition, speech synthesisVisually impairedOrDyslexiaFor (dyslexic) people, etc.Screen readerIt is used as.It may also be used in place of one's own voice by people who have difficulty speaking or speaking due to illness or treatment.[† 17][† 18].


Text speech synthesis

Text-to-speech synthesis is the task of converting text (sentence) into speech.This conversion can be regarded as the following problem[10][11].

When there is a set of text and the corresponding voice waveform, find the voice waveform corresponding to the arbitrarily given text.

One of the solutions to this problemStatistical machine learningIs.That is, voiceDatabase(Corpus) Is an approach that learns a stochastic model of waveform generation and uses it as a synthesizer.In human speech generation, it is extremely rare that the exact same waveform is obtained when the same speaker reads the same sentence several times.Thus, the speech generation process and speech signals have non-deterministic properties, and a probabilistic framework is effective.

In this framework, the text and speech waveforms that exist in the speech database (corpus) are used respectively. and , Given any text , The voice to be synthesized When Predicted distribution of The Estimated from this predicted distribution To sample[12]..Distribution models are often divided into multiple steps by introducing auxiliary variables and approximations.

Pipeline model

For example, language as an auxiliary variableFeature valueAnd the acoustic features are introduced and formulated as follows.Acoustic features that represent the nature of audio signals(Database) and(Composite), linguistic features that represent the nature of the text(Database) and(Arbitrarily given text), a parametric acoustic model that expresses the probability of occurrence of acoustic features when language features are givenThen, the predicted distribution can be decomposed as follows.

After that, the auxiliary variables may be marginalized, but if this is approximated in terms of maximizing the simultaneous probability of the auxiliary variables, the predicted distribution can be approximated as follows.

However, it is still difficult to maximize the joint probability, so if it is further approximated by sequential optimization, the following six sub-problems will be optimized respectively.

  • (Extraction of acoustic features
  • (Extraction of language features
  • (Learning acoustic model
  • (Prediction of language features
  • (Prediction of acoustic features
  • (Audio waveform generation

End-to-End model

A model that directly generates audio waveforms without using intermediate features End-to-End It's called a model.That is, In one model and using a corpus To learn.


Speech synthesis methods are roughly divided into three types.

  • Rule synthesis: Synthesize speech based on rules based on knowledge of speech generation
  • Concatenative synthesis type speech synthesis: Concatenative synthesis of recorded speech pieces
  • Statistical parametric speech synthesis: Synthesize speech based on the output of a statistically learned parametric generative model

Each method has different characteristics such as sound quality, computational complexity, and real-time performance, and the method is selected according to the application.

Rule composition

Rule composition[13]Establishes rules based on the knowledge about speech generation obtained through research, and generates speech based on the rules.Historically relatively old.For example, there are the following.

Formant speech synthesis

FormantSpeech synthesis is spectrum andFundamental frequencyAdjust the parameters such as, and synthesize the voice.There is no sound loss and you can hear clearly even if you utter at high speed, the size of the synthesizer is small because you do not need a voice database like a statistical method, you can freely adjust intonation and timbre (within the rules) It has features such as being able to be changed.On the other hand, the synthesized voice is robot-like, and the human voice is poor.

Long time agoEmbedded systemOften used in.For examplethe 1970sThe end of theTexas InstrumentsSpeak & Spell, a toy released bySega the 1980sDeveloped in somearcade game(Astro Blaster, Space Fury, Star Trek: Strategic Operations Simulator, etc.).

Tone speech synthesis

Tone speech synthesisModels the structure of the human vocal tract and synthesizes speech based on it.There is an example that was also used for commercial purposes,NeXTThe system used inUniversity of CalgaryIt was developed by Trillium Sound Research Inc., which was created by a spin-off of the research team of. Trillium makes this freegnuspeechPublished as GNU savannah site[*1]It is available at.

Concatenative synthesis type speech synthesis

Concatenate and synthesize the elements of the recorded voice.Since the recorded voice is used, if there is something close to the input text in the recorded voice, it will be a natural synthetic voice that is close to the real voice, but if not, the naturalness will be impaired at the connection part etc. There is.Moreover, although it is possible to adjust the speech speed and the pitch of the voice to some extent, it is difficult in principle to flexibly process the voice other than that.It is technically difficult to synthesize speech that changes very rapidly, so most of the speaking styles are neutral.

For example, there are the following.

Unit selection type speech synthesis

Unit selection type speech synthesis[14][15]Is also called corpus-based speech synthesis, but generative model speech synthesis also uses the corpus to train the model.When creating a database, voice is recorded, and sentences, phrases, and sentences are used for the recorded voice.accentclause·morpheme-phoneme・ Along with adding labels to indicate accents, etc.voice recognitionThe label and the audio section correspond to each other by adjusting manually.In general, when synthesizing speech, the input text is first analyzed by a text analyzer to obtain information (language features) such as sentences, phrases, accent phrases, morphemes, phonemes, and accents of the text.Next, the fundamental frequency, phoneme continuation length, etc. are predicted from the obtained language features, and the voice element that best matches it (target cost) is selected from the database while considering the smoothness of the connection part (connection cost). Select and connect.This makes it possible to synthesize a natural voice that is close to the real voice.However, in order to synthesize a voice that sounds more natural to any input text, it is necessary to record more voice according to the expected input text, and the database becomes huge accordingly.Concatenative synthesis type speech synthesis requires the synthesizer to hold the speech element, so the capacity is small.Auxiliary storageThis can be a problem for systems that only have.

Diffon speech synthesis

A diphone (diphone, phoneme pair) of the target language is stored in the voice database, and it is used for synthesis.The number of diphons is determined by the phonotactics of the language (for example, Spanish has about 800 diphons and German has about 2,500 diphons).In diphon speech synthesis, the database only needs to hold one speech element per diphon, so the size of the database can be overwhelmingly smaller than that of unit-selective synthesis.When synthesizing speech, linear predictive analysis is performed by arranging diphons.PSOLA, MBROLA, etc.)Digital signal processingTo make a prosody.The sound quality of the synthesized speech is inferior to that of the unit-selective speech synthesis.Due to the development of unit-selective speech synthesis, it is rarely used in recent years.

Field-limited speech synthesis

Synthesize voice by concatenating recorded words and phrases.This is used for reading out texts in a specific field, for example, in station guidance broadcasting.Since this method is field-limited, it is easy to synthesize sounds that sound natural.However, not all input text can be synthesized, and it is extremely difficult to use a particular synthesizer in another field.Input text is limited by the fact that only words and phrases held by the database can be synthesized.Additional recordings will need to be made if new input text is supported (for example, if a new station is set up).Also,FrenchInliaisonIt is difficult to reproduce the change in pronunciation due to the relationship with the surrounding words.In this case, it is necessary to record and synthesize in consideration of the context.

Statistical parametric speech synthesis

Statistical parametric speech synthesis (British: statistical parametric speech synthesis; SPSS)Statistical modelSpeech synthesis based on, that is, a general term for stochastic speech synthesis.[16].

A parametric generative model that learns the characteristics of the voice from the recorded voice is created, and the voice is synthesized based on the output of the generative model.Concatenative synthesis can cause problems with the smoothness of the synthesized speech depending on the conditions, but statistical speech synthesis can basically synthesize smooth speech.In addition, the method enables flexible and diverse speech synthesis, such as the intermediate voice quality of multiple speakers and the rapidly changing voice with emotions.

Hidden Markov model speech synthesis

Hidden Markov model Speech synthesis using (HMM) as an acoustic model. By HMMAcoustic featuresProbabilistically generate a series and generate thisVocoderConverts to a voice waveform.

A pioneer of statistical parametric speech synthesis, proposed in 1999 by a team at Tokyo Institute of Technology.[17]..FewParameterThe characteristics of speech can be expressed with, and the size of the model and the calculation cost required for learning the model and synthesizing speech are small.mobile phone(feature phone) AndElectronic notebookIt works even on terminals with large hardware restrictions.Also, the required recording time is shorter than that of (commercial) unit-selective speech synthesis.

Due to the simplicity of the model, the spectrum tends to be smoother than that of human voice, so the synthesized voice lacks a sense of real voice.Also, the trajectory of the fundamental frequency tends to be simple.

Neural network speech synthesis

Neural network speech synthesis is speech synthesis using a neural network as a speech synthesis model.Acoustic model (language features to Acoustic features) Is modeled by the neural network, and the speech waveform probability distribution (generated model) itself conditioned by the language features is modeled by the neural network.

The first paper was published by the Google team in 2013[18]..The neural network model is more expressive than the hidden Markov model and enables more natural speech synthesis.On the other hand, the number of model parameters and the learning / speech synthesis calculation cost are large.Therefore, at the practical stageserverIs synthesized or non-synthesized inGPUVarious studies have been conducted to enable operation in the environment (some smartphones, etc.).

Hidden Markov Model Similar to speech synthesis, the neural network model outputs acoustic features.In addition, WaveNet (Google, 2016)[19]In the wake of this, a method for directly modeling and outputting audio waveforms has appeared.Under limited conditions, these waveform generation models can synthesize speech with a quality very close to (or equivalent to) human speech. With the advent and commercialization of WaveNet, various studies have been conducted to realize the same voice quality with a faster, lighter, and simpler model (WaveNet Vocoder).[20], Clarinet[21], WaveGlow[22], WaveRNN[23], RNN_MS[24]Such).

Conventionally, language features (input text analyzed by a text analyzer) have been used for model input. Char2017Wav that eliminates the need for language features (text analyzer) in 2[25], Deep Voice[26], Tacotron[27]So-called end-to-end speech synthesis has been proposed, and active research and development is being carried out.

Table. Neural TTS
Model nameinputoutputモデルSource
Tacotron 2textMelspectrogramAutoregressivearchive
FastSpeech 2phonemeMelspectrogramTransformer[28]archive
FastSpeech 2sphonemeWaveformTransformer[28]archive

Parametric speech synthesis using the designed language and acoustic features in this way (Statistical parametric speech synthesis) Is waveform generation that does not depend on features, that isStatistical voice waveform synthesisExpanding to (statistical speech waveform synthesis / SSWS)[29].


Speech synthesis can be classified from several points of view.


Voice conversion

Voice conversion(British: voice conversion) Is a task to convert some of the characteristics of the input voice.[31]..Speaker conversion that changes the speaker while maintaining the language content[32], Can be classified into various subtasks such as emotional transformation that changes only the tone of the voice.The task of maintaining the speaker and timbre and changing only the language content to a foreign language can be regarded as both a speech translation task and a speech conversion task.


In speech synthesis, synthetic speech with specified characteristic attributes is often generated.[2]..Attributes include the following, from acoustic features to speech cognitive features.Personality and personality depending on the combination of attributesAccentIs born.


Correct estimation of how to read the text

In text-to-speech synthesis, it is necessary to correctly estimate how to read the input text (sentence).Generally rulesdictionary・ Combine statistical methods.However, there are various difficulties in this.For exampleJapaneseThen, distinguish between on-yomi and kun-yomi of kanji (or estimate which reading is used when there are multiple kanji),HomoglyphDistinguishing, estimating accents,Person's name,Place nameIt is difficult to correctly estimate how to read.

Objective evaluation of quality

Of the quality of speech synthesis, it is difficult to objectively evaluate the naturalness of synthetic speech.There is no objective indicator that is commonly recognized as valid among experts.The same applies to the similarity with the target speaker and the reproducibility of the target utterance style.

Fair comparison of performance

In the speech synthesis method, each researcher performs model learning using his / her own data set, and it is often evaluated by his / her own task, and it may be difficult to fairly compare the performance.Therefore, in the Speech Synthesis Special Interest Group (SynSIG) of the International Speech Communication Association (ISCA), which is an international conference on speech,2005/Blizzard Challenge more every year[33]We are holding a competition.In this competition, speech synthesis systems that use a common data set for learning are evaluated based on common tasks, enabling fair comparison of performance.

In particular, in a commercial speech synthesis system, it is possible to improve the performance specialized for the purpose by using a data set according to the purpose.Know-howIt has become.

Speech synthesis system

As of 2020, majorPersonal computerAnd smartphonesoperating systemIs equipped with a reading function (screen reader) by voice synthesis.Historically, various speech synthesis systems have been put into practical use.The following is an example.

  • TI-99 / 4AWas able to add a voice synthesis function as an option[34].
  • PC 6001Can add a voice synthesis cartridge,PC-6001mkIIHad a built-in voice synthesis function.Successor PC-6001mkIISR andPC 6601Then it was possible to sing.
  • FM-7/FM-77A voice synthesis board (MB22437 / FM-77-431) was available as an option in the series.
  • MZ-1500/2500/2861There was a voice board (MZ-1M08) as an option.The Japanese syllabary and some phrases are sampled on an external chip and burned as a ROM, which is played back by control.
  • U.S.S
    • Festival Speech Synthesis System
    • gnuspeech
    • HMM-based Speech Synthesis System (HTS)
    • Open JTalk(HTS-based speech synthesis system for Japanese)
    • MaryTTS

Academic journals / academic societies

Discussions on speech synthesis researchAcademic magazine,学会There are the following (TaiziAbout some or all papersPeer reviewWhat is doing).

Academic magazine

  • European Association for Signal Processing (EURASIP)
    • Speech Communication(Joint with ISCA)
  • IEEE
    • IEEE Transaction on Information and Systems
    • IEEE Transaction on Signal Processing
  • International Speech Communication Association (ISCA)
    • Computer Speech and Language
    • Speech Communication(Joint with EURASIP)
  • Springer Science and Business Media
    • International Journal of Speech Technology
  • Acoustical Society of Japan
    • Journal of the Acoustical Society of Japan
    • Acoustical Science and Technology (AST)
    • Journal of the Institute of Electronics, Information and Communication Engineers
  • IPSJ
    • Information Processing Society of Japan Journal

International conference

  • Asia Pacific Signal and Information Processing Association (APSIPA)
    • APSIPA Annual Summit Conference (APSIPA ASC)
  • IEEE
    • International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
    • Spoken Language Technology (SLP)
  • International Speech Communication Association (ISCA)
    • Speech Prosody
    • Speech Synthesis Workshop (SSW)

Academic societies in Japan (discussion is possible in Japanese)

  • Acoustical Society of Japan
    • Speech Study Group (SP. Jointly with the Institute of Electronics, Information and Communication Engineers)
    • Acoustical Society of Japan Research Presentation
    • Speech Study Group (SP. Jointly with Acoustical Society of Japan)
  • IPSJ
    • Spoken Language Information Processing Study Group (SLP)

Research group

The following organizations are conducting research on speech synthesis.

University (in Japan)

  • Utsunomiya University
    • Mori Laboratory, Department of Systems Innovation Engineering, Graduate School of Engineering
  • Kyoto University
    • Kawahara Laboratory, Department of Intelligent Informatics, Graduate School of Informatics
  • Kumamoto University
    • Ogata Laboratory, Department of Information and Electrical Engineering, Faculty of Natural Sciences, Graduate School
  • Kobe University
    • Takiguchi Laboratory, Department of Information Science, Graduate School of Systems and Informatics
  • The Graduate University for Advanced Studies(National Institute of InformaticsTeacher)
    • Yamagishi Laboratory, Content Science Research Institute, National Institute of Informatics
  • Tokyo University
    • Matsuo Laboratory, Department of Technology Management Strategy, Graduate School of Engineering
    • Graduate School of Engineering, Department of Electrical Engineering, Minematsu / Saito Laboratory
    • Graduate School of Information Science and Technology, Department of System Informatics, System Information Laboratory 1 (Saruwatari / Koyama Laboratory)
  • Tokyo Institute of Technology
    • Kobayashi Laboratory, Information and Communication Systems, Faculty of Engineering
  • Tohoku University
    • Graduate School of Engineering, Department of Communication Engineering, Ito / Nose Laboratory
  • Nagoya Institute of Technology
    • Graduate School of Engineering, Department of Information Engineering Tokuda・ Nankaku Laboratory
  • Nagoya University
    • Takeda Laboratory, Department of Intelligent Systems, Graduate School of Informatics
    • Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics
  • Nara Institute of Science and Technology
    • Intelligent Communication Laboratory, Department of Information Science, Graduate School of Advanced Science and Technology
  • Yamanashi University
    • Masamasa Morise, Graduate School of Comprehensive Research
  • Ritsumeikan University
    • Yamashita Laboratory, Department of Media and Information Studies, Faculty of Information Science and Engineering, Ritsumeikan University

Public research institute (in Japan)


More informations

Speech Synthesis Markup Language (SSML)

It is difficult to perform text analysis 100% correctly in speech synthesis.There are also times when you want a specific reading that cannot be interpreted from the text.Therefore, it is necessary to specify the information by some method,Domain specific languageBesides the method to do byW3CThere is a method using Speech Synthesis Markup Language (SSML) defined by.

注 釈


  1. ^ "Speech synthesis is the task of generating speech waveforms" Wang, et al. (2021). FAIRSEQ S2 : A scalable and Integrable Speech Synthesis Toolkit.
  2. ^ a b "with desired characteristics, including but not limited to textual content ..., speaker identity ..., and speaking styles" Wang, et al. (2021). FAIRSEQ S2 : A scalable and Integrable Speech Synthesis Toolkit.
  3. ^ History and Development of Speech Synthesis (Helsinki University of Technology)-English
  4. ^ Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine(Explanation of voice mechanism and talking machine)
  5. ^ Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed.), Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974.
  6. ^ http://query.nytimes.com/search/query?ppds=per&v1=GERSTMAN%2C%20LOUIS&sort=newest Obituary of Louis Gerstman (NY Times)
  7. ^ Bell Labs: Where "HAL" First Spoke (Bell Labs Speech Synthesis website)
  8. ^ "Robophone”(Japanese). robohon.com. 2018/11/28Browse.
  9. ^ ""AI announcer" is a radio broadcast Amazon's voice synthesis technology"ITmedia NEWS".2018/11/28Browse.
  10. ^ Tokuda, Keiichi (2015). “Present / Past / Future of Statistical Speech Synthesis Technology”. Spoken language symposium IEICE-115 ISSN 0913-5685. 
  11. ^ Tokuda, Keiichi (2017). “The latest trends in speech synthesis research that tells Fuunkyu”. Information and Systems Society Magazine (Institute of Electronics, Information and Communication Engineers) 21 (4): 10–11. two:10.1587 / ieiceissjournal.21.4_10. ISSN 2189-9797. NOT 130005312792. 
  12. ^ All, Akikawa (2018). “Transition and cutting edge of text-to-speech synthesis technology”. Journal of the Acoustical Society of Japan 74 (7): 387–393. 
  13. ^ Klatt, Dennis H. (1980). “Real-time speech synthesis by rule”. The Journal of the Acoustical Society of America 68: S18. 
  14. ^ Andrew J., Hunt; Black, Alan W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 373–376. two:10.1109 / ICASSP.1996.541110. ISBN 0-7803-3192-3. ISSN 1520-6149. 
  15. ^ Kawai, Tsune; Toda, Tomoki; Yamagishi, Junichi; Hirai, Toshio; Nagi, Shintomi; Nishizawa, Nobuyuki; Tsuzaki, Minoru; Tokuda, Keiichi (2006). "Speech synthesis system XIMERA using a large-scale corpus". Journal of the Institute of Electronics, Information and Communication Engineers J89-D (12): 2688–2698. ISSN 18804535. NOT 110007380404. 
  16. ^ "Statistical parametric speech synthesis ... as a framework to generate a synthetic speech signal based on a statistical model" Tachibana, et al. (2018). An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation. doi: 10.1109 / ICASSP. 2018.8461332
  17. ^ Masuko, Takashi; Keiichi, Tokuda; Takao, Kobayashi; Satoshi, Imai (1999-05-09). “Speech synthesis using HMMs with dynamic features” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 389–392. two:10.1109 / ICASSP.1996.541114. ISBN 0-7803-3192-3. ISSN 1520-6149. 
  18. ^ Zen, Heiga; Senior, Andrew; Schuster, Mike (2013-05-26). “Statistical parametric speech synthesis using deep neural networks” (English). 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE): 7962–7966. ISBN 978-1-4799-0356-6. ISSN 1520-6149. 
  19. ^ van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew et al. (2016-09-12). “WaveNet: A Generative Model” for Raw Audio ”(English). arXiv. arXiv:1609.03499. 
  20. ^ J. Shen, R. Pang, RJ Weiss, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv: 1712.05884, 2017.
  21. ^ W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” arXiv preprint arXiv: 1807.07281, 2018
  22. ^ R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flowbased generative network for speech synthesis,” arXiv preprint arXiv: 1811.00002, 2018
  23. ^ N. Kalchbrenner, E. Elsen, K. Simonyan, et al., “Efficient neural audio synthesis,” arXiv preprint arXiv: 1802.08435, 2018.
  24. ^ Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal (2019) TOWARDS ACHIEVING ROBUST UNIVERSAL NEURAL VOCODING. Interspeech 2019
  25. ^ Sotelo, Jose; Mehri, Soroush; Kumar, Kundan; Santos, Joao Felipe; Kastner, Kyle; Courville, Aaron; Bengio, Yoshua (2017-02-18). “Char2Wav: End-to-End Speech Synthesis” (English) .. ICLR 2017 workshop submission. 
  26. ^ Arik, Sercan O .; Chrzanowski, Mike; Coates, Adam; Diamos, Gregory; Gibiansky, Andrew; Kang, Yongguo; Li, Xian; Miller, John et al. (2017-02-25). “Deep Voice: Real- time Neural Text-to-Speech ”(English). arXiv. arXiv:1702.07825. 
  27. ^ Wang, Yuxuan; Skerry-Ryan, RJ; Stanton, Daisy; Wu, Yonghui; Weiss, Ron J .; Jaitly, Navdeep; Yang, Zongheng; Xiao, Ying et al. End-to-End Speech Synthesis ”(English). arXiv. arXiv:1703.10135. 
  28. ^ a b We use the feed-forward Transformer block,…, as the basic structure for the encoder and mel-spectrogram decoder. archive
  29. ^ Jaime (2018) TOWARDS ACHIEVING ROBUST UNIVERSAL NEURAL VOCODING https://arxiv.org/abs/1811.06292
  30. ^ Gopala K. Anumanchipalli, et al .. (2019) Speech synthesis from neural decoding of spoken sentences [paper]
  31. ^ "Voice conversion (VC) refers to a technique that converts a certain aspect of speech from a source to that of a target without changing the linguistic content" Huang, et al. (2021). S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.
  32. ^ "speaker conversion, which is the most widely investigated type of VC." Huang, et al. (2021). S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.
  33. ^ "Blizzard Challenge 2018 --SynSIG" (English). www.synsig.org. 2018/11/30Browse.
  34. ^ Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002

Primary literature

  1. ^ "Disaster prevention radio starts full-scale on machine voice from November 11st | Atsugi | Town News"Town News", September 2016, 11.2018/11/28Browse.
  2. ^ "Hankyu Corporation Introduces Multilingual Announcement Service for Foreign Visitors to Japan--Printing Information" (Japanese). CNET Japan(July 2018, 5). https://japan.cnet.com/article/35119705/ 2018/11/28Browse. 
  3. ^ "Comfortable functions installed in Exword --Electronic dictionary --CASIO". arch.casio.jp. 2018/11/28Browse.
  4. ^ "Voice dialogue”(Japanese). AX-XW400 | Water Oven Helsio: Sharp. 2018/11/28Browse.
  5. ^ "Voice news distribution Asahi Shimbun Archiki”(Japanese). www.asahi.com. 2018/11/28Browse.
  6. ^ "Deep Learning for Siri's Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis --Apple" (English). Apple Machine Learning Journal. 2018/11/28Browse.
  7. ^ "WaveNet launches in the Google Assistant | DeepMind". DeepMind. 2018/11/28Browse.
  8. ^ "5/30 service started! AI Talk Adopts AI Talk for NTT DoCoMo's New AI Agent "my daiz" AI Co., Ltd.”(Japanese). AI Co., Ltd. (AI). 2018/11/28Browse.
  9. ^ "Emopar | Functions / Services | AQUOS ZETA SH-01G | Product Lineup | AQUOS: Sharp”(Japanese). Sharp smartphone / mobile phone AQUOS official website. 2018/11/28Browse.
  10. ^ “Customize Alexa's voice with Amazon Polly” (English). https://developer.amazon.com/blogs/alexa/post/0e88bf72-ac90-45f1-863b-32ca8e2ae197/amazon-polly-voices-in-alexa-jp 2018/11/28Browse. 
  11. ^ CORPORATION., TOYOTA MOTOR. “Toyota KIROBO mini | KIBO ROBOT PROJECT | KIROBO / MIRATA | Toyota Motor Website”(Japanese). Toyota KIROBO mini | KIBO ROBOT PROJECT | KIROBO / MIRATA | Toyota Motor Website. 2018/11/28Browse.
  12. ^ "The first animation in the history of television in which all characters speak by voice synthesis has started | Robosta --Robot Information WEB Magazine"Robosta".2018/11/28Browse.
  13. ^ "VoiceText Home | HOYA Speech Synthesis Software”(Japanese). HOYA Speech Synthesis Software "Voice Text". 2018/11/28Browse.
  14. ^ "NHK develops "artificial announcer", "News Yomiko" who seems to be on the edge of the cup"ITmedia NEWS".2018/11/28Browse.
  15. ^ "Secrets of Highway Radio How fast is the information, how small the area is, and how does it work? | Vehicle News"Vehicle News".2018/11/28Browse.
  16. ^ "Amazon.co.jp Help: Use the reading function". www.amazon.co.jp. 2018/11/28Browse.
  17. ^ “Remembering Stephen Hawking's iconic synthesized voice” (English). What's next(July 2018, 3). https://whatsnext.nuance.com/in-the-labs/stephen-hawking-famous-text-to-speech-voice/ 2018/11/28Browse. 
  18. ^ "How are you ready to accept? I asked the Kita Ward Assembly who won the "Hitsudan Hostess" | Daily Gendai DIGITAL”(Japanese). Daily Gendai DIGITAL. 2018/11/28Browse.

Related item

外部 リンク


Back to Top