Google Speech

Use Google Speech API

Actions2

Speech to Text Actions
- Recognize
Text to Speech Actions
- Synthesize

Overview

This node integrates with Google Cloud's Speech APIs to provide two main functionalities:

Speech to Text (STT): Converts audio input into text transcription. It supports uploading audio either as base64 content or via a URI pointing to a remote file. The node allows specifying language, audio format, model type, and advanced options like automatic punctuation and channel recognition.
Text to Speech (TTS): Synthesizes spoken audio from input text. Users can select voice characteristics such as language, voice type (Standard, WaveNet, Neural2), specific voice variant, speaking rate, pitch, and output audio format. Optionally, the generated audio can be saved temporarily on disk.

Common Use Cases

Transcribing recorded meetings, interviews, or calls into text for documentation or analysis.
Creating voice responses or announcements dynamically from text in applications like IVR systems, chatbots, or accessibility tools.
Enhancing multimedia content by adding subtitles or voiceovers.
Automating audio content generation for podcasts, e-learning, or notifications.

Properties

Name	Meaning
Additional Options	A collection of optional settings: - Enable Automatic Punctuation (boolean): Adds punctuation automatically to transcriptions. - Number of Channels (number): Mono=1, Stereo=2. - Separate Recognition Per Channel (boolean): Recognize each stereo channel separately. - Save To Tmp Directory (boolean, TTS only): Save synthesized audio to `/tmp` directory. - Filename (string, TTS only): Name of the saved audio file without extension. - Boost per Specific Words (speechContexts): Helps recognition by boosting probability of specified words or phrases (boost value 0-20).
Language Code	Language of the audio/text. Options include Italian (it-IT), English US/UK, French, German, Spanish.
Upload Method	How to provide audio for STT: Base64 content or URI to remote audio file.
Audio Content (Base64)	Base64-encoded audio data (required if upload method is base64).
Audio URI	URI of audio file in cloud storage or publicly accessible URL (required if upload method is URI).
Audio Format	Format of the input audio for STT: OGG Opus, FLAC, LINEAR16 (WAV), MP3, or Auto-detect.
Sample Rate (Hz)	Sampling frequency of the audio (8000 to 48000 Hz). Recommended values depend on audio format and quality.
Model	Machine learning model for speech recognition: Default, Latest Short (for short commands), Phone Call (low quality), Video.
Text	Text to convert to speech (TTS).
Output Format	Audio format for synthesized speech: MP3, LINEAR16 (WAV), OGG Opus.
Voice Type	Voice quality/type for TTS: Standard (economical), WaveNet (high quality), Neural2 (best quality).
Specific Voice	Specific voice variant (A-F) with gender indication.
Speaking Rate	Speed of speech (0.25 to 4.0, where 1.0 is normal).
Pitch	Voice pitch adjustment (-20.0 to 20.0, 0.0 is normal).

Output

For Speech to Text:
- transcription: The recognized text string combining all detected speech segments.
- detailedResults: Array of objects containing individual recognition alternatives with their confidence scores.
- fullResponse: Raw response object from Google Speech API.
- requestConfig: Configuration used for the recognition request.
- If no speech is detected, an empty transcription with a message is returned.
For Text to Speech:
- success: Boolean indicating synthesis success.
- audioFormat: Format of the generated audio.
- mimeType: MIME type corresponding to the audio format.
- tempFilePath (optional): Path to the saved audio file in /tmp if saving is enabled.
- Binary data under binary.audio contains:
  - data: Base64-encoded audio content.
  - mimeType: MIME type of the audio.

Dependencies

Requires a valid Google Cloud service account key with permissions for Speech-to-Text and Text-to-Speech APIs.
Node expects credentials to be provided as a JSON service account key including client email, private key, and project ID.
Uses official Google Cloud client libraries for speech and text-to-speech.
Optional: Writing synthesized audio files to /tmp requires filesystem write access.

Troubleshooting

Invalid Service Account Key: Error if the provided JSON key is malformed or missing required fields (client_email, private_key, project_id). Ensure the key is correctly copied and formatted.
No Speech Detected: May occur if audio is silent, corrupted, or incompatible format/sample rate. For OGG_OPUS files, try different sample rates (8000, 16000, 24000, 48000 Hz) or convert audio to FLAC or LINEAR16 WAV.
Synthesis Failures: Can happen if text is too long or voice configuration is invalid. Check that text length is reasonable and voice parameters are supported.
File Save Errors: When saving audio to /tmp, permission issues or disk space problems may cause errors. Verify node has write access and sufficient space.
General API Errors: Network issues, quota limits, or invalid parameters will throw errors. Review error messages and Google Cloud console for quota or billing status.

Google Speech

Actions2

Overview

Common Use Cases

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion