Skip to main content

WebSocket API Reference

Complete reference documentation for NextEVI’s real-time WebSocket API for Speech-to-Speech voice communication.

Base URL

wss://api.nextevi.com/ws/voice/{connection_id}

Connection

Endpoint

wss://api.nextevi.com/ws/voice/{connection_id}
connection_id
string
required
Unique connection identifier. Generate a UUID v4 for each new connection.

Authentication

Pass your organization API key as a query parameter:
wss://api.nextevi.com/ws/voice/{connection_id}?api_key=oak_your_api_key&config_id=your_config_id

Query Parameters

api_key
string
Organization API key (starts with oak_). Required if not using JWT authentication.
config_id
string
required
Voice configuration identifier from your NextEVI dashboard.
project_id
string
Project identifier (optional, auto-detected from config if not provided)
authorization
string
JWT token as ‘Bearer token’ - alternative to Authorization header

Response

Connection establishment follows standard WebSocket handshake. Upon successful connection, server sends:
  1. Connection Metadata - Connection details and configuration
  2. Ready for Messages - Client can now send session settings and audio

Connection Flow

  1. WebSocket Handshake: Client initiates WebSocket connection
  2. Authentication: Server validates API key or JWT token
  3. Connection Metadata: Server sends connection details
  4. Session Settings: Client configures audio and feature settings
  5. Ready: Connection ready for voice communication

Message Format

All WebSocket messages use consistent JSON structure:
{
  "type": "message_type",
  "timestamp": 1645123456.789,
  "message_id": "uuid-string",
  "data": {
    // Message-specific payload
  }
}
type
string
required
Message type identifier (see message types below)
timestamp
number
required
Unix timestamp in seconds with millisecond precision
message_id
string
required
Unique identifier for this message (UUID recommended)
data
object
Message-specific data payload (varies by message type)

Client Messages

Messages sent from client to server.

Session Settings

Configure audio settings and enable features for the connection.
{
  "type": "session_settings",
  "timestamp": 1645123456.789,
  "message_id": "settings-1",
  "data": {
    "emotion_detection": { "enabled": true },
    "turn_detection": { "enabled": true, "silence_threshold": 0.5 },
    "audio": { 
      "sample_rate": 24000, 
      "channels": 1, 
      "encoding": "linear16" 
    }
  }
}
data.emotion_detection
object
  • enabled (boolean): Enable real-time emotion detection
data.turn_detection
object
  • enabled (boolean): Enable intelligent turn detection
  • silence_threshold (number): Silence duration to detect turn end (seconds)
data.audio
object
required
  • sample_rate (number): Audio sample rate (24000 recommended)
  • channels (number): Audio channels (1 for mono)
  • encoding (string): Audio encoding format (“linear16”)

Audio Input

Send audio data for speech processing.
{
  "type": "audio_input", 
  "timestamp": 1645123456.789,
  "message_id": "audio-1",
  "data": {
    "audio": "base64-encoded-audio-data",
    "chunk_id": "chunk-001"
  }
}
data.audio
string
required
Base64-encoded PCM audio data (16-bit, mono, 24kHz)
data.chunk_id
string
Optional identifier for audio chunk ordering
Alternative: Binary Audio For efficiency, send raw PCM audio data (16-bit, mono, 24kHz) as binary WebSocket frames:
const audioBuffer = new Int16Array(audioSamples);
websocket.send(audioBuffer.buffer);

Keep Alive

Maintain connection during idle periods.
{
  "type": "keep_alive",
  "timestamp": 1645123456.789, 
  "message_id": "ping-1"
}

Server Messages

Messages sent from server to client.

Connection Metadata

Sent immediately after successful connection establishment.
{
  "type": "connection_metadata",
  "timestamp": 1645123456.789,
  "message_id": "meta-1", 
  "data": {
    "connection_id": "conn-xyz789",
    "status": "connected",
    "config": {
      "audio_format": "pcm_24khz_16bit_mono",
      "encoding": "linear16", 
      "sample_rate": 24000,
      "channels": 1
    },
    "project_id": "project-123",
    "config_id": "config-abc"
  }
}
data.connection_id
string
Confirmed connection identifier
data.status
string
Connection status (“connected”)
data.config
object
Audio configuration details
data.project_id
string
Associated project identifier
data.config_id
string
Voice configuration identifier

Transcription

Real-time speech-to-text results from user audio input.
{
  "type": "transcription",
  "timestamp": 1645123456.789,
  "message_id": "transcript-1",
  "data": {
    "transcript": "Hello, how can I help you today?",
    "confidence": 0.95,
    "is_final": true,
    "is_speech_final": true,
    "session_id": "conn-xyz789",
    "words": [
      {
        "word": "Hello",
        "start": 1.2,
        "end": 1.6, 
        "confidence": 0.98
      }
    ],
    "accumulated_transcript": "Hello, how can I help you today?",
    "is_turn_incomplete": false,
    "original_fragment": "Hello, how can I help you today?"
  }
}
data.transcript
string
Transcribed text from speech input
data.confidence
number
Transcription confidence score (0-1)
data.is_final
boolean
Whether this transcription is final (true) or partial (false)
data.is_speech_final
boolean
Whether the user has finished speaking this utterance
data.session_id
string
Session identifier for this connection
data.words
array
Word-level timing and confidence information
  • word (string): The word
  • start (number): Start time in seconds
  • end (number): End time in seconds
  • confidence (number): Word confidence score (0-1)
data.accumulated_transcript
string
Complete accumulated text for this conversation turn
data.is_turn_incomplete
boolean
Whether the user’s conversation turn is still continuing
data.original_fragment
string
Original transcript fragment before accumulation

LLM Response Chunk

Streaming text responses from the language model.
{
  "type": "llm_response_chunk",
  "timestamp": 1645123456.789,
  "message_id": "llm-chunk-1",
  "data": {
    "content": "I'd be happy to help you with",
    "is_final": false,
    "generation_id": "gen-abc123", 
    "chunk_index": 1
  }
}
data.content
string
Text content chunk from language model
data.is_final
boolean
Whether this is the final chunk in the response
data.generation_id
string
Unique identifier for this response generation
data.chunk_index
number
Sequential index of this chunk in the response

TTS Audio Chunk

Audio response chunks for playback to user.
{
  "type": "tts_chunk",
  "timestamp": 1645123456.789,
  "message_id": "tts-1", 
  "content": "base64-encoded-audio-data"
}
content
string
Base64-encoded audio data (WAV format) for playback

Emotion Update

Real-time emotion detection results from user speech.
{
  "type": "emotion_update",
  "timestamp": 1645123456.789,
  "message_id": "emotion-1",
  "data": {
    "top_emotions": [
      { "name": "Joy", "score": 0.85 },
      { "name": "Excitement", "score": 0.72 }
    ],
    "all_emotions": {
      "Joy": 0.85,
      "Sadness": 0.12,
      "Anger": 0.03,
      "Fear": 0.05, 
      "Surprise": 0.15,
      "Disgust": 0.02,
      "Contempt": 0.01,
      "Excitement": 0.72,
      "Calmness": 0.45
    },
    "processing_time": 0.045,
    "utterance_duration": 2.3,
    "connection_id": "conn-xyz789",
    "session_id": "conn-xyz789"
  }
}
data.top_emotions
array
Top detected emotions with confidence scores
  • name (string): Emotion name
  • score (number): Confidence score (0-1)
data.all_emotions
object
Complete emotion analysis results with scores for all emotions
data.processing_time
number
Time taken to process emotion detection (seconds)
data.utterance_duration
number
Duration of analyzed speech segment (seconds)
data.connection_id
string
Connection identifier
data.session_id
string
Session identifier

Turn Detection Events

Conversation turn management events. Turn Start
{
  "type": "turn_start", 
  "timestamp": 1645123456.789,
  "message_id": "turn-1",
  "data": {
    "turn_id": "turn-abc123"
  }
}
Turn End
{
  "type": "turn_end",
  "timestamp": 1645123456.789,
  "message_id": "turn-2", 
  "data": {
    "turn_id": "turn-abc123",
    "duration": 3.2,
    "is_complete": true
  }
}
data.turn_id
string
Unique identifier for this conversation turn
data.duration
number
Duration of the turn in seconds (turn_end only)
data.is_complete
boolean
Whether the turn was completed naturally (turn_end only)

TTS Interruption

Indicates AI speech was interrupted by user.
{
  "type": "tts_interruption",
  "timestamp": 1645123456.789,
  "message_id": "interrupt-1",
  "content": ""
}

Status Messages

System status updates and confirmations.
{
  "type": "status",
  "timestamp": 1645123456.789,
  "message_id": "status-1",
  "data": {
    "status": "ready", 
    "details": {
      "session_settings": {
        "sample_rate": 24000,
        "channels": 1,
        "encoding": "linear16"
      }
    }
  }
}
data.status
string
Current system status
  • ready: System ready for voice communication
  • processing: Processing audio or generating response
  • error: Error state
data.details
object
Additional status details and configuration

Error Messages

Error notifications and debugging information.
{
  "type": "error",
  "timestamp": 1645123456.789, 
  "message_id": "error-1",
  "data": {
    "error_code": "AUDIO_PROCESSING_FAILED",
    "error_message": "Failed to process audio chunk",
    "details": {
      "chunk_id": "chunk-001"
    }
  }
}
data.error_code
string
Standardized error code (see Error Reference)
data.error_message
string
Human-readable error message
data.details
object
Additional error context and debugging information

Response Codes

WebSocket connections use standard HTTP status codes during handshake, then WebSocket close codes:

HTTP Status Codes (Handshake)

CodeDescription
101Switching Protocols - Connection successful
400Bad Request - Invalid connection parameters
401Unauthorized - Authentication failed
403Forbidden - Access denied
404Not Found - Invalid endpoint
429Too Many Requests - Rate limited
500Internal Server Error - Server error

WebSocket Close Codes

CodeDescriptionRetry
1000Normal Closure - Clean disconnectNo
1001Going Away - Server restartYes
1002Protocol Error - Invalid message formatNo
1003Unsupported Data - Invalid data typeNo
1006Abnormal Closure - Network errorYes
1011Internal Error - Server errorYes
4001Unauthorized - Authentication failedNo
4002Invalid Config - Config not foundNo
4003Access Denied - Insufficient permissionsNo
4004Rate Limited - Too many connectionsYes

Rate Limits

Connection Limits

Limit TypeLimitWindow
Connections per API Key1001 minute
Connections per IP501 minute
Audio Messages10001 minute
Text Messages1001 minute

Audio Limits

MetricLimit
Max Audio Chunk Size1 MB
Max Message Rate100/second
Max Session Duration60 minutes
Max Concurrent Sessions10 per API key
Rate limits are enforced per API key and IP address. Exceeded limits result in HTTP 429 or WebSocket close code 4004.

Best Practices

Connection Management

  • Generate unique connection IDs (UUID v4 recommended)
  • Implement exponential backoff for reconnections
  • Handle connection lifecycle properly (open/message/error/close)
  • Use keep-alive messages for long idle periods

Audio Streaming

  • Send audio in 100-200ms chunks for optimal latency
  • Use 24kHz, 16-bit, mono PCM format
  • Implement audio buffering on client side
  • Use binary WebSocket frames for audio when possible

Error Handling

  • Always handle WebSocket error and close events
  • Implement retry logic with backoff for network errors
  • Don’t retry authentication failures (4xxx codes)
  • Log errors with sufficient context for debugging

Performance

  • Minimize message payloads where possible
  • Use efficient audio encoding (binary vs base64)
  • Implement client-side audio processing (noise reduction)
  • Monitor connection health and latency

Security

  • Use secure WebSocket connections (wss://) only
  • Validate all message payloads
  • Implement proper authentication token refresh
  • Don’t log sensitive data in error messages

Code Examples

JavaScript Connection

const ws = new WebSocket(
  'wss://api.nextevi.com/ws/voice/conn-123?' + 
  new URLSearchParams({
    api_key: 'oak_your_api_key',
    config_id: 'your_config_id'
  })
);

ws.onopen = () => {
  // Send session settings
  ws.send(JSON.stringify({
    type: 'session_settings',
    timestamp: Date.now() / 1000,
    message_id: 'settings-1',
    data: {
      emotion_detection: { enabled: true },
      audio: { sample_rate: 24000, channels: 1, encoding: 'linear16' }
    }
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('Received:', message);
};

Python Connection

import asyncio
import websockets
import json

async def connect():
    uri = "wss://api.nextevi.com/ws/voice/conn-123?api_key=oak_your_api_key&config_id=your_config_id"
    
    async with websockets.connect(uri) as websocket:
        # Send session settings
        await websocket.send(json.dumps({
            "type": "session_settings",
            "timestamp": time.time(),
            "message_id": "settings-1", 
            "data": {
                "emotion_detection": {"enabled": True},
                "audio": {"sample_rate": 24000, "channels": 1, "encoding": "linear16"}
            }
        }))
        
        # Listen for messages
        async for message in websocket:
            data = json.loads(message)
            print("Received:", data)

asyncio.run(connect())

cURL Connection Test

# Test WebSocket connection with cURL
curl -i -N -H "Connection: Upgrade" \
     -H "Upgrade: websocket" \
     -H "Sec-WebSocket-Version: 13" \
     -H "Sec-WebSocket-Key: $(echo -n 'test' | base64)" \
     -H "Authorization: Bearer your_jwt_token" \
     "wss://api.nextevi.com/ws/voice/conn-123?config_id=your_config_id"