Google Speech

Use Google Speech API

Actions2

Speech to Text Actions
- Recognize
Text to Speech Actions
- Synthesize

Overview

The node integrates with the Google Speech API to convert audio speech into text (Speech to Text). Specifically, the "Recognize" operation under the "Speech to Text" resource allows users to transcribe spoken words from audio files or base64-encoded audio content into written text. This is useful in scenarios such as transcribing interviews, voice commands, meeting recordings, or any audio data where extracting textual information is needed.

Practical examples include:

Transcribing customer support calls for analysis.
Converting voice notes into text for documentation.
Processing audio streams from videos to generate subtitles or captions.

Properties

Name	Meaning
Language Code	The language of the supplied audio. Options: Italian (it-IT), English US (en-US), English UK (en-GB), French (fr-FR), German (de-DE), Spanish (es-ES).
Upload Method	How to upload the audio content. Options: Base64 Content (upload raw audio encoded in base64), URI (Remote Audio File) (provide a URI to an audio file in cloud storage or publicly accessible URL).
Audio Content (Base64)	Base64-encoded audio content. Required if using Base64 upload method.
Audio URI	URI of the audio file in Cloud Storage (gs://...) or publicly accessible URL. Required if using URI upload method.
Audio Format	The format of the audio when uploading via Base64. Prefer lossless formats like FLAC or LINEAR16 (WAV). Options: OGG Opus, FLAC, LINEAR16 - WAV, MP3, Auto-detect (not recommended).
Sample Rate (Hz)	Sampling frequency of the audio in Hertz. Relevant for Base64 uploads and certain audio formats. Options include 8000 Hz (telephone quality), 16000 Hz (recommended for voice), up to 48000 Hz (high fidelity).
Modello di Riconoscimento	Machine learning model to use for recognition. Options: Default (general use), Latest Short (for short commands), Telefonico (for low-quality audio), Video.
Additional Options	Collection of optional settings: • Enable Automatic Punctuation (boolean) – adds punctuation automatically. • Number of Channels (number) – mono=1, stereo=2. • Separate Recognition Per Channel (boolean). • Boost per Specific Words (speechContexts) – helps improve recognition accuracy for specified words or phrases by boosting their probability (boost value 0-20).

Output

The node outputs JSON data containing:

transcription: A string with the full recognized text from the audio.
detailedResults: An array of objects, each representing a segment of the audio with its best transcription alternative and confidence score.
fullResponse: The complete raw response object returned by the Google Speech API.
requestConfig: The configuration parameters used for the recognition request.

If no speech is detected, the output includes an empty transcription and a message suggesting trying different audio files, formats, or sample rates.

No binary data output is produced for this operation.

Dependencies

Requires a valid Google Cloud service account key with access to the Google Speech-to-Text API.
The node expects the user to provide this credential securely within n8n.
No additional environment variables are required beyond the configured credentials.

Troubleshooting

Common Issues:
- Invalid or incomplete service account key JSON will cause errors.
- Audio format or sample rate mismatches can lead to poor or no transcription results.
- Using OGG_OPUS format may require experimenting with different sample rates (8000, 16000, 24000, 48000 Hz).
- Large or unsupported audio files might fail recognition.
Error Messages:
- "Invalid service account key JSON. Please provide a valid service account key." — Ensure the JSON contains client_email, private_key, and project_id.
- "No speech detected or recognized." — Try a different audio file, format, or sample rate.
- If recognition fails with OGG_OPUS files, try converting the audio to FLAC or LINEAR16 (WAV) at 16000 Hz.
Recommendations:
- Prefer lossless audio formats (FLAC, LINEAR16) for better accuracy.
- Use the "phone_call" model for low-quality audio.
- Enable automatic punctuation for more readable transcripts.
- Use speech contexts to boost recognition of specific words or phrases relevant to your audio.