Actions2
- Speech to Text Actions
- Text to Speech Actions
Overview
This node integrates with Google Cloud's Speech APIs to provide two main functionalities:
Speech to Text (STT): Converts audio input into text transcription. It supports uploading audio either as base64 content or via a URI pointing to a remote file. The node allows specifying language, audio format, model type, and advanced options like automatic punctuation and channel recognition.
Text to Speech (TTS): Synthesizes spoken audio from input text. Users can select voice characteristics such as language, voice type (Standard, WaveNet, Neural2), specific voice variant, speaking rate, pitch, and output audio format. Optionally, the generated audio can be saved temporarily on disk.
Common Use Cases
- Transcribing recorded meetings, interviews, or calls into text for documentation or analysis.
- Creating voice responses or announcements dynamically from text in applications like IVR systems, chatbots, or accessibility tools.
- Enhancing multimedia content by adding subtitles or voiceovers.
- Automating audio content generation for podcasts, e-learning, or notifications.
Properties
| Name | Meaning |
|---|---|
| Additional Options | A collection of optional settings: - Enable Automatic Punctuation (boolean): Adds punctuation automatically to transcriptions. - Number of Channels (number): Mono=1, Stereo=2. - Separate Recognition Per Channel (boolean): Recognize each stereo channel separately. - Save To Tmp Directory (boolean, TTS only): Save synthesized audio to /tmp directory.- Filename (string, TTS only): Name of the saved audio file without extension. - Boost per Specific Words (speechContexts): Helps recognition by boosting probability of specified words or phrases (boost value 0-20). |
| Language Code | Language of the audio/text. Options include Italian (it-IT), English US/UK, French, German, Spanish. |
| Upload Method | How to provide audio for STT: Base64 content or URI to remote audio file. |
| Audio Content (Base64) | Base64-encoded audio data (required if upload method is base64). |
| Audio URI | URI of audio file in cloud storage or publicly accessible URL (required if upload method is URI). |
| Audio Format | Format of the input audio for STT: OGG Opus, FLAC, LINEAR16 (WAV), MP3, or Auto-detect. |
| Sample Rate (Hz) | Sampling frequency of the audio (8000 to 48000 Hz). Recommended values depend on audio format and quality. |
| Model | Machine learning model for speech recognition: Default, Latest Short (for short commands), Phone Call (low quality), Video. |
| Text | Text to convert to speech (TTS). |
| Output Format | Audio format for synthesized speech: MP3, LINEAR16 (WAV), OGG Opus. |
| Voice Type | Voice quality/type for TTS: Standard (economical), WaveNet (high quality), Neural2 (best quality). |
| Specific Voice | Specific voice variant (A-F) with gender indication. |
| Speaking Rate | Speed of speech (0.25 to 4.0, where 1.0 is normal). |
| Pitch | Voice pitch adjustment (-20.0 to 20.0, 0.0 is normal). |
Output
For Speech to Text:
transcription: The recognized text string combining all detected speech segments.detailedResults: Array of objects containing individual recognition alternatives with their confidence scores.fullResponse: Raw response object from Google Speech API.requestConfig: Configuration used for the recognition request.- If no speech is detected, an empty transcription with a message is returned.
For Text to Speech:
success: Boolean indicating synthesis success.audioFormat: Format of the generated audio.mimeType: MIME type corresponding to the audio format.tempFilePath(optional): Path to the saved audio file in/tmpif saving is enabled.- Binary data under
binary.audiocontains:data: Base64-encoded audio content.mimeType: MIME type of the audio.
Dependencies
- Requires a valid Google Cloud service account key with permissions for Speech-to-Text and Text-to-Speech APIs.
- Node expects credentials to be provided as a JSON service account key including client email, private key, and project ID.
- Uses official Google Cloud client libraries for speech and text-to-speech.
- Optional: Writing synthesized audio files to
/tmprequires filesystem write access.
Troubleshooting
Invalid Service Account Key: Error if the provided JSON key is malformed or missing required fields (
client_email,private_key,project_id). Ensure the key is correctly copied and formatted.No Speech Detected: May occur if audio is silent, corrupted, or incompatible format/sample rate. For OGG_OPUS files, try different sample rates (8000, 16000, 24000, 48000 Hz) or convert audio to FLAC or LINEAR16 WAV.
Synthesis Failures: Can happen if text is too long or voice configuration is invalid. Check that text length is reasonable and voice parameters are supported.
File Save Errors: When saving audio to
/tmp, permission issues or disk space problems may cause errors. Verify node has write access and sufficient space.General API Errors: Network issues, quota limits, or invalid parameters will throw errors. Review error messages and Google Cloud console for quota or billing status.