Sunday, May 01, 2005

"Rich" Speak

One of the goals I've had for the KDE Text-to-Speech System (KTTS) for quite some time is what I call "rich" speak -- the ability to speak a web page using a variety of genders, volumes, talking speeds etc. The idea is to distinguish URLs for example, by speaking them faster and softer, or speak emphasized text by speaking it slightly louder.

Well, I finally got it working -- mostly. In order to use this capability right now, you must:

1. Build KTTS from CVS (or SVN when the changeover occurs).
2. Install Konqueror 3.4 or later.
3. Install Festival 1.95beta.
4. Install xsltproc utility.
5. Install the rab_diphone voice. (Even if you don't want to speak English, this is required. Its a bug in the Festival SABLE implementation.)
6. If you already have a Festival Talker configured, you must click Edit, change something, change it back, and click OK to force the Festival Talker configuration to detect that you have the rab_diphone voice installed.
7. Configure an XML Transformer filter and choose one of the installed XSL files -- xhtml2ssml.xsl or xhtml2ssml_simple.xsl. Set the DOCTYPE field to "html" (without the quotes). Don't forget to click Apply.

Now go to some web page in Konqi, select all or part of the page and copy to the clipboard. In KTTSMgr Jobs tab, click the Speak Clipboard button. (When I have some confidence that this is all working well, I plan to enable the Speak button in Konqi to speak richly, but for now, you must use the clipboard.)

If this works for you, you should consider it a minor miracle. Here's what happens. In Konqi, John Tapsell's enhancement places almost valid xhtml into the clipboard as MIME type "text/html". The Speak Clipboard code in KTTSMgr detects the text/html MIME type in the clipboard and verifies that you have an XML Transformer configured for DOCTYPE html. If so, it queues the clipboard contents for speaking. (If not, it queues the plain text from the clipboard.) The XML Transformer filter uses the XSL file together with xsltproc to transform the xhtml to SSML (Speech Synthesis Markup Language). (Before doing this, it has to change any ampersands to amp entities. That's a defect in John's code I hope will be fixed soon.) The Sentence Boundary Detector parses the SSML and breaks it up into separate sentences, each sentence having a complete set of SSML tags. For each sentence, the Festival plugin runs another XSL file that converts the SSML to SABLE. A special Scheme function permits KTTSD to send the SABLE to Festival and get back a .wav file. The wav file is played on the audio device.

That's a lot of pieces and any one of them can fail. xsltproc, for example, is pretty fussy if the xhtml is not valid xml, and if it isn't aborts. Web pages tend to have a lot of special characters and incomplete sentences that confuse Festival, causing it to toss the entire sentence.

If you care to, you can create your own XSL file for doing the xhmtl to SSML conversion. Contributions welcome.

But there are lots of limitations. You can't change genders or voices (Festival will abort if it doesn't find a suitable voice), all this conversion is pretty slow, so I'd avoid large web pages. The audio speed setting in KTTSMgr is ignored in Festival when in SABLE mode. I doubt this works with anything but English.

One day, we'll have a really robust synth engine, and we can streamline this process so that it "just works" without all the hassle. One can hope.