Give Your Copilot Studio Agent a Voice — Text-to-Speech with DirectLine Web Chat

When you embed a Copilot Studio agent on a webpage using DirectLine and the Bot Framework Web Chat control, it works great — but it’s text only. What if your agent could speak its responses out loud?

In this post, I’ll show you how to intercept bot responses from the Web Chat store middleware and pass them to a text-to-speech engine so your agent speaks every reply in the browser. The pattern works with any TTS provider — the browser’s built-in Web Speech API, Azure Speech Services, ElevenLabs, or anything else.

If you’ve seen my post on Once Upon a Prompt — AI Bedtime Stories, you’ll know I’ve been experimenting with voice generation. This takes that same idea and applies it to a live Copilot Studio agent.

How It Works

The approach is straightforward:

Intercept bot messages — The Web Chat createStore middleware already handles DirectLine events. We add a listener for DIRECT_LINE/INCOMING_ACTIVITY to catch every bot response.
Extract the text — Bot responses can contain plain text, markdown, or Adaptive Card attachments. We strip formatting so the voice reads naturally.
Speak it — Pass the cleaned text to your TTS engine of choice.
Play it — The audio plays directly in the browser.

The user gets a toggle button in the chat header to enable/disable voice at any time.

The Key Code: Store Middleware

The magic happens in the Web Chat store middleware. Here’s the createCustomStore function with the TTS intercept added:

function createCustomStore() {
  return window.WebChat.createStore(
    {},
    ({ dispatch }) =>
      (next) =>
      (action) => {
        // Trigger greeting on connect
        if (action.type === "DIRECT_LINE/CONNECT_FULFILLED") {
          dispatch({
            type: "DIRECT_LINE/POST_ACTIVITY",
            meta: { method: "keyboard" },
            payload: {
              activity: {
                channelData: { postBack: true },
                name: "startConversation",
                type: "event",
              },
            },
          });
        }

        // 🔊 Intercept bot messages for TTS
        if (
          action.type === "DIRECT_LINE/INCOMING_ACTIVITY" &&
          action.payload?.activity?.from?.role === "bot" &&
          action.payload?.activity?.type === "message"
        ) {
          const text = extractBotText(action.payload.activity);
          if (text) {
            speakText(text);
          }
        }

        return next(action);
      }
  );
}

The two key additions are the DIRECT_LINE/INCOMING_ACTIVITY check — which filters for bot messages only — and the calls to extractBotText and speakText.

Extracting Text from Bot Activities

Bot messages aren’t always plain text. They might include markdown formatting, links, or even Adaptive Cards. This helper strips all of that down to clean readable text:

function stripMarkdown(text) {
  return text
    .replace(/\*\*(.*?)\*\*/g, "$1")   // bold
    .replace(/\*(.*?)\*/g, "$1")        // italic
    .replace(/#{1,6}\s?/g, "")          // headings
    .replace(/\[([^\]]+)\]\([^)]+\)/g, "$1") // links
    .replace(/`([^`]+)`/g, "$1")        // inline code
    .replace(/!\[.*?\]\(.*?\)/g, "")    // images
    .replace(/<[^>]+>/g, "")            // HTML tags
    .replace(/\n{2,}/g, ". ")
    .replace(/\n/g, " ")
    .trim();
}

function extractBotText(activity) {
  if (activity.text) return stripMarkdown(activity.text);
  // Handle Adaptive Card attachments
  if (activity.attachments) {
    for (const att of activity.attachments) {
      if (att.content && att.content.body) {
        const texts = att.content.body
          .filter((b) => b.type === "TextBlock" && b.text)
          .map((b) => b.text);
        if (texts.length) return stripMarkdown(texts.join(" "));
      }
    }
  }
  return null;
}

Speaking the Text

This is where you plug in your TTS engine. I’ll show two approaches — the zero-dependency browser option and a cloud API option — so you can pick what suits your needs.

Option 1: Browser Built-in Speech (No API Required)

Every modern browser ships with the Web Speech API. It’s completely free, works offline, and requires zero setup:

function speakText(text) {
  if (!ttsEnabled || !text || text.length < 2) return;

  // Cancel any in-progress speech
  window.speechSynthesis.cancel();

  const utterance = new SpeechSynthesisUtterance(text);
  utterance.rate = 1;
  utterance.pitch = 1;
  utterance.lang = "en-GB";

  // Optionally pick a specific voice
  const voices = window.speechSynthesis.getVoices();
  const preferred = voices.find((v) => v.name.includes("Google UK English Female"))
    || voices.find((v) => v.lang.startsWith("en"));
  if (preferred) utterance.voice = preferred;

  window.speechSynthesis.speak(utterance);
}

The quality varies by browser and OS — Chrome tends to have the best selection of voices — but it’s a great starting point that doesn’t cost a penny.

Mobile Gotcha: Autoplay Restrictions

If you test on a phone and get silence, you’ve hit mobile autoplay policy. iOS Safari and Chrome on Android block Audio.play() and speechSynthesis.speak() unless they originate from a direct user tap. The WebChat middleware fires asynchronously when a bot message arrives — that’s not a tap, so the browser silently blocks it.

The fix is to “unlock” the audio context during a real tap before the first TTS call. In the demo below, I do this when you tap the voice prompt pill and again when you open the chat. The trick is to play a silent WAV through a reusable Audio element and call speechSynthesis.speak(new SpeechSynthesisUtterance('')) — both during the tap handler. After that, the browser allows programmatic playback on those same objects. Re-use the Audio element for subsequent cloud TTS by swapping its src instead of creating a new one.

Option 2: Cloud TTS API

For higher quality, more natural voices, you can call a cloud TTS service instead. The pattern is the same regardless of provider — Azure Speech Services, ElevenLabs, or any other API that returns audio:

const TTS_API_URL = "https://your-tts-proxy.com/api/voice";

async function speakText(text) {
  if (!ttsEnabled || !text || text.length < 2) return;

  // Stop any currently playing audio
  if (ttsAudio) {
    ttsAudio.pause();
    ttsAudio = null;
  }

  try {
    const response = await fetch(TTS_API_URL, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ text, voice: "your-chosen-voice" }),
    });

    if (!response.ok) return;
    const data = await response.json();

    if (data.audio) {
      const blob = base64ToBlob(data.audio, "audio/mpeg");
      const url = URL.createObjectURL(blob);
      ttsAudio = new Audio(url);
      ttsAudio.onended = () => URL.revokeObjectURL(url);
      await ttsAudio.play();
    }
  } catch (err) {
    console.warn("TTS failed:", err);
  }
}

CORS note: Most TTS APIs won’t allow direct browser requests. You’ll typically need a lightweight server-side proxy — a Next.js API route, Azure Function, or Cloudflare Worker — to forward the request. This is the same pattern I used in my Once Upon a Prompt project.

Toggle Button

A speaker icon in the chat header lets users toggle voice on or off. When the agent is speaking, the button pulses — click it to mute instantly:

<button id="tts-button" onclick="toggleTts()" aria-label="Toggle Voice">
  <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
    <polygon points="11 5 6 9 2 9 2 15 6 15 11 19 11 5"></polygon>
    <path class="tts-wave1" d="M15.54 8.46a5 5 0 0 1 0 7.07"></path>
    <path class="tts-wave2" d="M19.07 4.93a10 10 0 0 1 0 14.14"></path>
  </svg>
</button>

Try It Yourself

The agent below is a live Copilot Studio agent with TTS enabled. Click the 🔊 Click to enable voice pill in the bottom-right corner — you’ll hear a welcome message, and then the chat icon will appear. Send a message and hear the response spoken back to you.

Use the 🔊 speaker icon in the chat header to mute/unmute, and the 🔄 toggle to switch between Microsoft AI Voice and the browser’s built-in speech engine.

Wrapping Up

With a handful of JavaScript, your Copilot Studio agent can speak every response out loud. Start with the browser’s built-in Web Speech API for zero setup, then upgrade to a cloud provider when you need higher quality voices.

This is especially useful for accessibility, hands-free scenarios, or just making your embedded agent feel more alive. The pattern is the same regardless of which TTS engine you choose — just swap out the speakText function.

If you’re interested in what’s possible with more advanced voice generation, check out my post on Once Upon a Prompt — AI Bedtime Stories, where I built a multi-voice narrated story generator.

The full source for this demo is embedded in the page — view source or inspect the HTML to see how it all fits together. The key pieces are the store middleware intercept, the speakText router, and the TTS toggle in the chat header.

How It Works#

The Key Code: Store Middleware#

Extracting Text from Bot Activities#

Speaking the Text#

Option 1: Browser Built-in Speech (No API Required)#

Mobile Gotcha: Autoplay Restrictions#

Option 2: Cloud TTS API#

Toggle Button#

Try It Yourself#

Wrapping Up#