Issue Description:
We are encountering inconsistent text-to-speech (TTS) pronunciation when synthesizing individual words via the Azure Cognitive Services TTS API, specifically for the Malay (ms) language using the ms-MY-YasminNeuralvoice. The issue appears to be related to capitalization patterns within words.
Observed Behavior:
- For the word "pharmacy":
- ✅ Lowercase "p": Pronunciation is correct.
- ❌ **Uppercase "P"** (e.g., "Pharmacy"): Pronunciation is abnormal.
- For the word **"farmasi"** (Malay for "pharmacy"):
- ❌ Abnormal pronunciation occurs regardless of capitalization.
- Additional testing reveals a pattern:
- Pronunciation is normal if only the middle letters are capitalized (e.g., "pHArMaCy").
- Pronunciation becomes abnormal only when the first or last letter is capitalized (e.g., "Pharmacy", "pharmacY", "Farmasi", "farmasI").
SSML Request Example:
curl --location --request POST "https://southeastasia.tts.speech.microsoft.com/cognitiveservices/v1" \
--header "Ocp-Apim-Subscription-Key: ${subscriptionKey}" \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-48khz-96kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '<speak version="1.0" xml:lang="ms" xmlns:mstts="https://www.w3.org/2001/mstts">
<voice xml:lang="ms" name="ms-MY-YasminNeural">
<prosody rate="+0%">
pharmacy
</prosody>
</voice>
</speak>'
Impact:
Our use case involves TTS synthesis for individual words (e.g., labels, buttons, or medical terms), where we cannot fully control input text formatting. The inconsistency disrupts user experience, especially in critical scenarios like accessibility tools or multilingual applications.
Questions for Azure Engineering Team:
- Is this a known issue with the neural TTS engine, particularly for Malay or other languages?
- Could it be related to text normalization or grapheme-to-phoneme conversion when processing uppercase letters at word boundaries?
- Are there recommended SSML tags or attributes (e.g.,
<say-as>, <phoneme>) to enforce consistent pronunciation regardless of capitalization?
- If this is a bug, are there plans to address it in future updates?
Suggested Workarounds (Attempted/Considered):
- Pre-processing text to lowercase (not always feasible for proper nouns or acronyms).
- Using SSML
<say-as interpret-as="verbatim">, but this may not suit all use cases.
Request:
We seek guidance on how to programmatically avoid this issue (e.g., API parameters, SSML configurations) or an estimated timeline for a backend fix. Detailed documentation on TTS capitalization handling would also be helpful.
Environment:
- Region:
southeastasia
- Voice:
ms-MY-YasminNeural
- Output Format:
audio-48khz-96kbitrate-mono-mp3
- Language: Malay (
xml:lang="ms")
Thank you for your support!