WebSocket API Reference
Complete reference documentation for NextEVI’s real-time WebSocket API for Speech-to-Speech voice communication.Base URL
Connection
Endpoint
Unique connection identifier. Generate a UUID v4 for each new connection.
Authentication
- API Key (Query Parameter)
- JWT Token (Header)
- JWT Token (Query Parameter)
Pass your organization API key as a query parameter:
Query Parameters
Organization API key (starts with
oak_). Required if not using JWT authentication.Voice configuration identifier from your NextEVI dashboard.
Project identifier (optional, auto-detected from config if not provided)
JWT token as ‘Bearer token’ - alternative to Authorization header
Response
Connection establishment follows standard WebSocket handshake. Upon successful connection, server sends:- Connection Metadata - Connection details and configuration
- Ready for Messages - Client can now send session settings and audio
Connection Flow
- WebSocket Handshake: Client initiates WebSocket connection
- Authentication: Server validates API key or JWT token
- Connection Metadata: Server sends connection details
- Session Settings: Client configures audio and feature settings
- Ready: Connection ready for voice communication
Message Format
All WebSocket messages use consistent JSON structure:Message type identifier (see message types below)
Unix timestamp in seconds with millisecond precision
Unique identifier for this message (UUID recommended)
Message-specific data payload (varies by message type)
Client Messages
Messages sent from client to server.Session Settings
Configure audio settings and enable features for the connection.enabled(boolean): Enable real-time emotion detection
enabled(boolean): Enable intelligent turn detectionsilence_threshold(number): Silence duration to detect turn end (seconds)
sample_rate(number): Audio sample rate (24000 recommended)channels(number): Audio channels (1 for mono)encoding(string): Audio encoding format (“linear16”)
Audio Input
Send audio data for speech processing.Base64-encoded PCM audio data (16-bit, mono, 24kHz)
Optional identifier for audio chunk ordering
Keep Alive
Maintain connection during idle periods.Server Messages
Messages sent from server to client.Connection Metadata
Sent immediately after successful connection establishment.Confirmed connection identifier
Connection status (“connected”)
Audio configuration details
Associated project identifier
Voice configuration identifier
Transcription
Real-time speech-to-text results from user audio input.Transcribed text from speech input
Transcription confidence score (0-1)
Whether this transcription is final (true) or partial (false)
Whether the user has finished speaking this utterance
Session identifier for this connection
Word-level timing and confidence information
word(string): The wordstart(number): Start time in secondsend(number): End time in secondsconfidence(number): Word confidence score (0-1)
Complete accumulated text for this conversation turn
Whether the user’s conversation turn is still continuing
Original transcript fragment before accumulation
LLM Response Chunk
Streaming text responses from the language model.Text content chunk from language model
Whether this is the final chunk in the response
Unique identifier for this response generation
Sequential index of this chunk in the response
TTS Audio Chunk
Audio response chunks for playback to user.Base64-encoded audio data (WAV format) for playback
Emotion Update
Real-time emotion detection results from user speech.Top detected emotions with confidence scores
name(string): Emotion namescore(number): Confidence score (0-1)
Complete emotion analysis results with scores for all emotions
Time taken to process emotion detection (seconds)
Duration of analyzed speech segment (seconds)
Connection identifier
Session identifier
Turn Detection Events
Conversation turn management events. Turn StartUnique identifier for this conversation turn
Duration of the turn in seconds (turn_end only)
Whether the turn was completed naturally (turn_end only)
TTS Interruption
Indicates AI speech was interrupted by user.Status Messages
System status updates and confirmations.Current system status
ready: System ready for voice communicationprocessing: Processing audio or generating responseerror: Error state
Additional status details and configuration
Error Messages
Error notifications and debugging information.Standardized error code (see Error Reference)
Human-readable error message
Additional error context and debugging information
Response Codes
WebSocket connections use standard HTTP status codes during handshake, then WebSocket close codes:HTTP Status Codes (Handshake)
| Code | Description |
|---|---|
101 | Switching Protocols - Connection successful |
400 | Bad Request - Invalid connection parameters |
401 | Unauthorized - Authentication failed |
403 | Forbidden - Access denied |
404 | Not Found - Invalid endpoint |
429 | Too Many Requests - Rate limited |
500 | Internal Server Error - Server error |
WebSocket Close Codes
| Code | Description | Retry |
|---|---|---|
1000 | Normal Closure - Clean disconnect | No |
1001 | Going Away - Server restart | Yes |
1002 | Protocol Error - Invalid message format | No |
1003 | Unsupported Data - Invalid data type | No |
1006 | Abnormal Closure - Network error | Yes |
1011 | Internal Error - Server error | Yes |
4001 | Unauthorized - Authentication failed | No |
4002 | Invalid Config - Config not found | No |
4003 | Access Denied - Insufficient permissions | No |
4004 | Rate Limited - Too many connections | Yes |
Rate Limits
Connection Limits
| Limit Type | Limit | Window |
|---|---|---|
| Connections per API Key | 100 | 1 minute |
| Connections per IP | 50 | 1 minute |
| Audio Messages | 1000 | 1 minute |
| Text Messages | 100 | 1 minute |
Audio Limits
| Metric | Limit |
|---|---|
| Max Audio Chunk Size | 1 MB |
| Max Message Rate | 100/second |
| Max Session Duration | 60 minutes |
| Max Concurrent Sessions | 10 per API key |
Best Practices
Connection Management
- Generate unique connection IDs (UUID v4 recommended)
- Implement exponential backoff for reconnections
- Handle connection lifecycle properly (open/message/error/close)
- Use keep-alive messages for long idle periods
Audio Streaming
- Send audio in 100-200ms chunks for optimal latency
- Use 24kHz, 16-bit, mono PCM format
- Implement audio buffering on client side
- Use binary WebSocket frames for audio when possible
Error Handling
- Always handle WebSocket error and close events
- Implement retry logic with backoff for network errors
- Don’t retry authentication failures (4xxx codes)
- Log errors with sufficient context for debugging
Performance
- Minimize message payloads where possible
- Use efficient audio encoding (binary vs base64)
- Implement client-side audio processing (noise reduction)
- Monitor connection health and latency
Security
- Use secure WebSocket connections (wss://) only
- Validate all message payloads
- Implement proper authentication token refresh
- Don’t log sensitive data in error messages
