Development of TTS Engine for Indian Accent using Modified HMM Algorithm

A text-to-speech (TTS) system converts normal language text into speech. An intelligent text-to-speech program allows people with visual impairments or reading disabilities, to listen to written works on a home computer. Many computer operating systems and day to day software applications like Adobe Reader have included text-to-speech systems. This paper is presented to show that how HMM can be used as a tool to convert text to speech. Keywords— K-means, Text-to-speech, Speech synthesis, HMM Algorithm.


I. INTRODUCTION
A text to speech system is composed of two parts, a frontend and a back-end.The front end has two major tasks.First, it converts the raw text containing symbols like numbers and abbreviations into the equivalent of written-out words.This process is often called text normalization pre-processing or tokenization.The front-end then assigns phonetic transcriptions to each word and divides and marks the text into prosodic units, like phrases, clauses and sentences.The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end.The back-end often referred to as synthesizer, it converts the symbolic linguistic representation into sound HMM-based synthesis is a synthesis method based on Hidden Markov Models, also called Statistical Parametric Synthesis, which is a widely used model for any Speech Recognition and Synthesis applications.In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs.In the proposed project work it is considered to improve the HMM technique to the Indian accents.Statistical parametric speech synthesis based on hidden Markov models (HMMs) is now well-established and can generate natural-sounding synthetic speech.In this framework, we have pioneered the development of the HMM Speech Synthesis System, HTS (H Triple S).This research started by developing algorithms for generating a smooth parameter trajectory from HMMs.Next, to simultaneously model the excitation parameters of speech as well as the spectral parameters, the multispace probability distribution (MSD) HMM was developed.Using the logarithm of the fundamental frequency and its dynamic and acceleration features as the excitation parameters, the MSD-HMM enabled us to treat the sequence, which is a mixture of onedimensional real numbers for voiced regions and symbol strings for unvoiced regions, in a probabilistic framework.To simultaneously model the duration parameters for the spectral and excitation components of the model, the MSD hidden semi-Markov model (MSD-HSMM) was developed.The HSMM is an HMM having explicit state duration distributions instead of transition probabilities, to directly model duration; it can generate more appropriate temporal structures for speech.These basic systems employed a melcepstral vocoder with simple pulse or noise excitation, resulting in synthetic speech with a "buzzy" quality.To reduce buzziness, mixed or multi-band excitation techniques have been integrated into the basic systems to replace the simple pulse or noise excitation and have been evaluated .These basic systems also had another significant problem: the trajectories generated from the HMMs were excessively smooth due to statistical processing, resulting in synthetic speech with a "muffled" quality.To alleviate this problem, a parameter generation algorithm that considers the global variance (GV) of a trajectory to be generated was developed.

II. MODULES
Module 1: Creation of Speech Corpus (dictionary).CMU databases designed for the purpose of speech synthesis research.These single speaker speech databases have been carefully recorded under studio conditions and consist of nearly 1150 phonetically balanced English utterances.They are distributed as free software, without restriction on commercial or non-commercial use.The Arctic corpus consists of four primary sets of recordings (3 male,1 female), plus several ancillary databases.Each database is distributed with automatically segmented phonetic labels.These extra files were derived using the standard voice building scripts of the Festvox system.In addition to phonetic labels, the databases provide complete support for the Festival Speech Synthesis System, including pre-built voices that may be used as is.Festival and Festvox are available at http://www.festvox.org.The Arctic speech corpus is available at http://www.festvox.org/cmu_arctic.CMU Arctic is a set of single speaker databases that have been carefully recorded under studio conditions, packaged with associated information such as phonetic labels and pitchmark files.An Arctic "database" is a reading of the Arctic prompt set (plus associated files) by a single speaker in a specified style of delivery.This release of Arctic contains recordings by four separate speakers.When referring to the Arctic "corpus" we mean the entire collection of databases, including test sets.The databases have version numbers.As with computer code, version numbers indicate the level of maturity and stability.Numbers with a zero after the decimal point (e.g.version 1.0) are major releases intended to serve as a reference point for system development and evaluation.Minor releases are subject to change, allowing for more frequent additions, deletions, and improvements.We denote the time instants associated with the state changes as t=1,2,3……n and actual at time t as qt.

K-Means Clustering
HMM kind of algorithm is well known to be sensitive to its initialization point.The problem of this initialization point choice is addressed in this paper: a model with a very large number of states which describe training sequences with accuracy is first constructed.The number of states is then reduced using a k-mean algorithm on the state.This algorithm is compared to other methods based on a k-mean algorithm on the data with numerical simulations.
The clustering algorithm partitions a dataset into a fixed number of clusters supplied by theuser.Hidden Markov Model (HMM) based clustering method, which identifies a suitable number of clusters in a given dataset without using prior knowledge about the number of clusters.Initially, the dataset is partitioned into windows of fixed size based on the HMM log likelihood values.This provides a framework for identifying the most appropriate number of clusters (windows of varying sizes).After determining the number of clusters, the data values are then labelled and allocated to clusters.

D. Features of Audio
The following features will be extracted from the audio file for training the HMM to get the speech.In this paper it is shown how HMM can be used to generate speech from the text file.HMM uses dictionaries, labelling and the three different algorithms like K-means, Viterbi Algorithm and Baum-Welch Algorithm to train and after that the speech will be generated from the text files.The HMM can be used for image processing and forecasting also.

Fig. 2
Fig. 2 Contents of Label of Speech File Markov Model Proces Classify weather into three states -State 1: rain or snow -State 2: cloudy -State 3: sunny By carefully examining the weather of some city for a long time, we found following weather change pattern.

Fig 6
Fig 6 Example of Baum-Welch Algorithm

TABLE 1
At every state tomorrow weather is depending upon today state Visual representation.