Voice-Triggered Conversational Assistant with Full-Screen Audio Visualization

This Flask-based application listens continuously in the browser for trigger words, sends captured speech to a local language model for processing, and plays the AI-generated response with real-time audio visualization. By leveraging browser-based speech recognition, system-level text-to-speech (eSpeak NG), and a local LLM integration (e.g., Ollama), it creates an immersive, hands-free conversational experience complete with dynamic full-screen frequency bars.

Description

Installation & Setup

  1. Clone or Download the Code
    • Place the files in a directory of your choice (e.g., voice_triggered_app/).
  2. Install Dependencies
    • Make sure Flask is installed:
      pip install flask
    • Verify eSpeak NG is installed on your system and is available from the command line (espeak-ng --version).
  3. Local Language Model Configuration
    • If using Ollama, ensure it’s installed and working.
    • Edit generate_ai_response(prompt) in your code if you need to change the command for a different local LLM or a different model name.
  4. Run the Application
    • Navigate to the code directory and run:
      python app_v9.py
    • By default, Flask runs on http://127.0.0.1:5000.

Usage Instructions

  1. Open the Web Interface
    • In your browser, navigate to the Flask server’s address (e.g., http://127.0.0.1:5000).
  2. Continuous Speech Recognition
    • The browser automatically starts listening for speech via the Web Speech API.
    • The recognized text is visible in your dev console for debugging (Interim vs. Final transcripts).
  3. Trigger Words
    • A set of predefined words (e.g., “coeus,” “koeus,” “coy-us,” etc.) is monitored.
    • Once a trigger word is detected, the user’s final transcript is sent to the Flask endpoint (/get_response).
  4. Receiving AI Response
    • The local LLM generates a response via subprocess.run().
    • eSpeak NG converts the textual response into .wav audio.
  5. Audio Playback & Visualization
    • The app streams the audio file and plays it automatically in the browser.
    • A dynamic spectrum visualizer (bars in a canvas) animates in sync with the audio, creating a full-screen effect.
  6. Deleting Audio
    • When the TTS completes playback, the .wav file is automatically deleted from the server to conserve space.
  7. Concurrent Access
    • If another request is still processing, new requests will receive a 503 with a friendly message to “try again.”

Exiting the Application

  • Stop Flask: Press Ctrl + C in the terminal window where the Flask app is running.
  • Disable Speech Recognition: Simply close your browser tab or stop the server.

Additional Tips

  • Browser Compatibility: webkitSpeechRecognition is supported in Chrome-based browsers. For others, or if you need cross-browser solutions, consider third-party libraries or alternative approaches.
  • Model Changes: If you switch from Ollama to another LLM, ensure the generate_ai_response function’s subprocess.run arguments match the required CLI usage.
  • Performance Tuning: Increase or decrease analyser.fftSize (in index.html) to change the visualizer resolution.
  • Security: Consider using HTTPS in production and restricting access if you plan to deploy publicly.

Enjoy experimenting with voice-triggered AI interactions and audio visualizations in your project! This application provides a foundation for building immersive, hands-free experiences powered by local language models and real-time speech technology.