ScholarSpace will be down for maintenance on Thursday (8/16) at 8am HST (6pm UTC)
Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/49991

Customization of IBM Intu’s Voice by Connecting Text-to-Speech Services and a Voice Conversion Network

File SizeFormat 
paper0104.pdf1.55 MBAdobe PDFView/Open

Item Summary

Title: Customization of IBM Intu’s Voice by Connecting Text-to-Speech Services and a Voice Conversion Network
Authors: Song, Jongyoon
Kim, Hyunjae
Lee, Jaekoo
Choi, Euishin
Kim, Minseok
show 1 moreYoon, Sungroh
show less
Keywords: Business Intelligence, Analytics and Cognitive: Case Studies and Applications (COGS)
IBM Intu, text-to-speech, voice conversion
Issue Date: 03 Jan 2018
Abstract: IBM has recently launched Project Intu, which extends the existing web-based cognitive service Watson with the Internet of Things to provide an intelligent personal assistant service. We propose a voice customization service that allows a user to directly customize the voice of Intu. The method for voice customization is based on IBM Watson’s text-to-speech service and voice conversion model. A user can train the voice conversion model by providing a minimum of approximately 100 speech samples in the preferred voice (target voice). The output voice of Intu (source voice) is then converted into the target voice. Furthermore, the user does not need to offer parallel data for the target voice since the transcriptions of the source speech and target speech are the same. We also suggest methods to maximize the efficiency of voice conversion and determine the proper amount of target speech based on several experiments. When we measured the elapsed time for each process, we observed that feature extraction accounts for 59.7% of voice conversion time, which implies that fixing inefficiencies in feature extraction should be prioritized. We used the mel-cepstral distortion between the target speech and reconstructed speech as an index for conversion accuracy and found that, when the number of target speech samples for training is less than 100, the general performance of the model degrades.
Pages/Duration: 10 pages
URI/DOI: http://hdl.handle.net/10125/49991
ISBN: 978-0-9981331-1-9
DOI: 10.24251/HICSS.2018.104
Rights: Attribution-NonCommercial-NoDerivatives 4.0 International
Appears in Collections:Business Intelligence, Analytics and Cognitive: Case Studies and Applications (COGS)


Please email libraryada-l@lists.hawaii.edu if you need this content in an ADA-compliant format.

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.