Portal field news

Portal field news


🏛 | Orutsu releases "Parliament GIJIROKU" enhanced voice recognition engine for Diet


Orutsu releases "Parliament GIJI ROKU" enhanced speech recognition engine for Diet

If you write the contents roughly
However, there are many parliamentary and local council jargon that cannot be captured by general-purpose speech recognition.

July 2021, 7 Orutsu Orutsu Co., Ltd. renews the "National Diet GIJIROKU" enhanced voice recognition engine for the Diet ... → Continue reading

 Kyodo News PR Wire

Kyodo PR wire, which distributes press releases and news releases, connects information from "who wants to know" to "people who want to know."
This is a site that consumers should pay attention to, where news releases from major governments and government agencies including local governments and universities are gathered.

Wikipedia related words

If there is no explanation, there is no corresponding item on Wikipedia.

voice recognition

voice recognition(Onsen training,British: speech recognition) IsA human OfsoundEtc.computersTo recognizeSpoken languageTheStringOr convert it tovoiceRefers to the function of identifying the person who is making a voice by capturing the characteristics of[1].


What is voice recognition?A human OfsoundEtc.computersIs to be recognized by.Spoken languageTheStringConvert tovoiceIt refers to the function of identifying the person who is making a voice by capturing the characteristics of[1].

The function to convert spoken words into a character string iskeyboardIt is an alternative to typing from. To call only the function of inputting a character string (sentence), call it "voice input" or "dictation".

It is also possible to operate the application by voice recognition (just like entering a character string or a shortcut from the keyboard to operate the application). Operating an application by voice is called "voice operation".

"Speech recognition" may include a function of identifying who the speaker is. This is a function to perform personal authentication, etc. by comparing with the voice pattern recorded in advance.Speaker recognitionAlso says.

Recognition technology

Statistical method

In speech recognition, statistical methods are often used. This is to accumulate the speech features from the learning data that recorded a large amount of utterances, compare the features extracted from the input speech to be recognized with the accumulated features, and output the closest language sequence as the recognition result. It is a method to do.

In general, acoustic features and linguistic features of speech are often treated separately. Acoustic features are the recognition targetphonemeIt shows what kind of frequency characteristics each has,Acoustic modelCalled. The representation of the acoustic model is mixednormal distributionIs the output probabilityHidden Markov modelIs widely used. What are linguistic features?phonemeRepresents the restrictions on how to arrangeLanguage modelCalled. For example, there is a constraint that the utterances such as "ga (ga)" and "ha (wa)" are likely to continue immediately after the utterance "you (anata)". As a representation of the language model, when the recognition target language is large (computerIs often used and the language to be recognized is small enough to be covered manually (Car navigationVoice operations, etc.)Context-free grammarIs often used.

Dynamic time warping method

Dynamic time warping method (Dynamic time warping, DTW) is an early speech recognition method, but it is no longer used because the method based on Hidden Markov Model is generalized. Measure the similarity between two signal sequences that differ in time or speedalgorithmIs. For example, a human walking pattern has a certain pattern even if the user walks quickly, walks slowly, or fast-forwards or slow-plays a walking image. DTW is not limited to audioTime seriesCan be applied to the data of. In speech recognition, it was used to detect a fixed pattern regardless of the speaking speed. Therefore, a standard pattern for comparison is needed, and the vocabulary that can be recognized is limited.

Hidden Markov model

A voice signal can be viewed as a fragmentary or short-term stationary signal,Hidden Markov model(Hidden Markov Model, HMM) is applicable. That is, when viewed in a short time of about 10 milliseconds, the audio signal is approximatelyStationary processCan be considered. Therefore, the voice of many stochastic processesMarkov chainCan be considered.

Moreover, the speech recognition by the hidden Markov model is automatically trained, and it is simple and does not require much calculation. In the simplest possible setup for speech recognition, the Hidden Markov Model will output a real-valued vector, say in the 10th dimension, every 13 milliseconds. This vector isCepstrumComposed of coefficients. The cepstrum coefficient isFourier transformThe cosine transform is used to extract the first (maximum) coefficient. Hidden Markov models tend to have a probability distribution that can also be called a mixture of Gaussian distributions of diagonal covariances that give the likelihood of each observed vector. Each word or eachphonemeEach has its own output distribution. The hidden Markov model for a word sequence or phoneme sequence is a concatenation of hidden Markov models for individual words or phonemes.

These are the concepts of the speech recognition technology using the hidden Markov model. Various other technologies are used for the voice recognition system. In a vocabulary-rich system, we consider the context dependence of phonemes. In addition, normalization of the cepstrum is performed in order to normalize the difference between speakers and the difference in recording situation. Other attempts at speaker normalization include vocal tract length normalization (VTLN) for normalization between men and women and maximum likelihood linear regression (MLLR) for dealing with a larger number of unspecified speakers. ..

Actual and challenges

From the 1970s when computers started to be used for research and development of speech recognition systems to the present in the early 21st century, huge funds and excellent human resources have been invested for many years, but few have been successful and popularized, There is a big difference in the technical fields such as animation movies represented by 3D images created by digital technology, and recording and playback of moving images, still images, and music, compared to the fact that it has become a big industry since then.[2].

It is said that a Japanese speech recognition system that performs pre-training called "dictation" by limiting the speaker can achieve a recognition rate of 80% in an ideal Japanese environment. 60% is the limit without those training[3].. A system that limits the vocabulary and does not require training can recognize the voices of an unspecified number of speakers, but its use range is limited because the vocabulary is small. It is estimated that 90% recognition rate is found in Western languages ​​with few homonyms.[4] .

Speech recognition software on the market for individuals shows a sufficiently practical recognition rate if the user uses a headset in a quiet room and knows some tips such as separating words. However, it is difficult to recognize in an environment where a loud conversation is made behind the scene even in an indoor environment or in a noisy environment such as outdoors. Also, since it is intended to be used at the individual level, the corresponding vocabulary is limited and business terms are not covered. Furthermore, it is difficult to recognize utterances from a plurality of speakers or utterances that are not intended for voice recognition, such as interviews and conferences.

For enterprises, more expensive software that can be used to create minutes such as meetings for large vocabulary and multiple unspecified speakers is also sold.[Source required]Work can be done more efficiently than listening to a cassette tape or an IC recorder.


The performance of a speech recognition system is generally expressed in terms of accuracy and speed. Accuracy is represented by the word error rate (WER), and velocity is represented by the real time factor (RTF).

Technology under study


If the feature amount of the speaker's voice is distorted by noise or feature separation processing, the difference from the acoustic model is opened, which causes erroneous recognition. Estimate how much distortion and noise are included in the obtained speech feature amount, and have the reliability as a map on the time axis and frequency axis, and mask the low reliability feature amount. , Missing feature theory is used to recover lost speech.[2].


GSS (Geometric source separation) is a technology that separates multiple sound sources. If there is no correlation between sound sources, the sound source separation and its position information (sound source localization) can be obtained relatively easily by input information from multiple microphones. If this is reflected in the reliability map as MFT noise information, the recognition rate does not drop so much even in the presence of noise or simultaneous speech.[2].

Practical example

Use on Macintosh

Since Macintosh, the voice recognition function is 1993Quadra 840AV/ From Centres 660AV, it was installed as Plain Talk.Mac OS 9In, the login function by voice recognition password is also installed.MacOS SierraFrom the voice recognition assistant functionCrabWas installed, and various operations became possible.[5].

Usage on Windows

Windows VistaWindows 7Has a voice recognition function, and using this function, operations such as chatting without keyboard input are possible. There have been previous usages such as operating a PC with a voice recognition function, but in addition to improving the recognition rate of Japanese, Windows operations using a mouse and keyboard can now be operated by voice. ..Windows 10FromCortanaWith the voice recognition assistant function, it is possible to perform various operations. (Windows PhoneThen.Windows Phone 8.1It was installed from. )

Use in companies and organizations

Companies, hospitals, and local governments have gradually introduced the following practical systems since around 2005-6.

  • For doctorsElectronic medical recordInput system
  • Minute preparation support system for local governments
  • Call centerSupport and call content analysis system for mobile phones
  • In a language learning application for schoolspronunciationEvaluation system

Other usage examples

Other application examples

By combining it with “Senssibility Technology” (ST), etc., for example, “I'm sorry” and “I'm sorry” that I said lightly with just the tip of the mouth is the same as “I'm sorry” , It is possible to judge that they are foolish and treat them in an angry manner, or to forgive them by understanding that "sorry," which is pronounced slowly and politely, is a sincere acknowledgment.

Speech recognition software example

Game software example applying voice recognition


  1. ^ a b Daijisen
  2. ^ a b c Tetsuo Nozawa, "Hearing Sensors That Hear Many Voices at Once", Nikkei Electronics September 2008, 9, pages 22-115
  3. ^ Narita Hajime "The World of PC Translation" Kodansha
  4. ^ Wall Street Journal
  5. ^ "Use voice control on Mac”(Japanese). Apple Support. 2021/4/8Browse.
  6. ^ "Microsoft to acquire AI and voice recognition nuances for over 2 trillion yen”(Japanese). CNET Japan (September 2021, 4). 2021/4/13Browse.
  7. ^ ASCII.jp Digital Glossary. “What is PlainTalk”(Japanese). Koto bank. 2021/4/9Browse.


  • Lawrence Rabiner (1993), "Fundamentals of Speech Recognition", Prentice Hall, ISBN-0 13-015157-2
  • Frederick Jelinek (1998), "Statistical Methods for Speech Recognition", MIT Press, ISBN-0 262-10066-5
  • Manfred R. Schroeder (2004), "Computer Speech: Recognition, Compression, Synthesis", Springer-Verlag, ISBN-3 540-64397-4

Related item

外部 リンク

Local council

Local council(Chihogikai) isLegislative officeParliamentOf which, it has jurisdiction over the entire countryCentral governmentHas jurisdiction over some regions, except for those byLocal governmentRefers to the parliament by.


ParliamentFunction as normalLegislative officeHowever, it has jurisdiction over the entire countryCentral governmentUnlike, it deals with ordinances and budgets for specific regions and is often legislated to the extent that it does not violate central government law.

Examples of each country

The United States of America

The United States of AmericaIn the case of the 50 state legislaturesDC CongressAnd each territory has a council, and there is also a city council of each city.

Republic of China

Republic of ChinaThen each ministry had a ministry council, but most of the territoryNational warByPeople's Republic of ChinaAfter being robbed by, most of the provincial councils were suspended and remained in the Republic of China.Taiwan ProvinceAlso inTaiwan Provincial Consultative Taiwan Provincial ConsultativeWas reorganized into.



Back to Top