Create a React Teleprompter using the Web SpeechRecognition API

The above video is hosted on egghead.io.

In this post, we are going to build a teleprompter web application using the Web Speech API. In particular, we'll use the SpeechRecognition interface to build this app. The idea is that we'll be able to recognize the user's voice, match the words to a predefined script, and then automatically scroll to the next unspoken position.

NOTE: The above 🆓 video goes through all the content of this blog post step by step. You can find the code for this project on GitHub and you can play with it on CodeSandbox

Preview of Teleprompter

The following is a short animated GIF showing the result of what we will be building using native web technologies and JavaScript libraries.

Basic Application

The Teleprompter component that we'll start with is just a shell of what we will be building. To begin with, we are looping over the words passed to the component and displaying them in <span> elements, but eventually, we will want to wire up the SpeechRecognition API and auto-scroll the contents as the user speaks.

import React from 'react'
import styled from 'styled-components'

const StyledTeleprompter = styled.div`
    /* ... */
`

export default function Teleprompter({ words, progress, listening, onChange }) {
    return (
        <React.Fragment>
            <StyledTeleprompter>
                {words.map((word, i) => (
                    <span key={`${word}:${i}`}>{word} </span>
                ))}
            </StyledTeleprompter>
        </React.Fragment>
    )
}

Creating a `SpeechRecognition` Instance

The first thing we'll do is create an instance of SpeechRecognition. To do this, we'll create a recog reference using React.useRef hook, passing null.

Next, we'll use a React.useEffect that we'll execute after the initial render of the component. Here, we will either reference the real window.SpeechRecognition constructor or the vendor-prefixed window.webkitSpeechRecognition version. Depending on which one exists, we'll create a new instance of it and assign it to the current property of our recog reference.

Setting continuous mode to true allows us to continuously capture results once we have started, instead of just getting one result.

Also, we'll want to set interimResults to true as well. This will let us get access to quicker results. However, these results aren't final and may not be as accurate compared to waiting a bit longer.

const recog = React.useRef(null)

React.useEffect(() => {
    const SpeechRecognition =
        window.SpeechRecognition || window.webkitSpeechRecognition
    recog.current = new SpeechRecognition()
    recog.current.continuous = true
    recog.current.interimResults = true
}, [])

`SpeechRecognition` Browser Compatibility

Before we get too far along, it's important to know that the SpeechRecognition feature that we are going to use only has minimal browser support at the moment.

Toggling `SpeechRecognition` to Start and Stop

Now we'll add another React.useEffect hook so that we can toggle whether or not our Speech Recognition system should start or stop listening. So, if we are listening then will tell our recog ref to start, otherwise, we will stop the recognition instance.

React.useEffect(() => {
    if (listening) {
        recog.current.start()
    } else {
        recog.current.stop()
    }
}, [listening])

Adding and Removing Event Listeners

At this point, starting or stopping does nothing yet, so let's wire that up. We'll have another React.useEffect hook, and this one will grab our recog reference and add an event listener, listen to the result event, and handle that with the handleResult callback, which we haven't defined yet, but we will very soon.

Also, we'll want to clean up after ourselves so we'll return a function that will removeEventListener for the result event bound to the handleResult function.

React.useEffect(() => {
    const handleResult = () => {
        /* ... more code later ... */
    }
    recog.current.addEventListener('result', handleResult)
    return () => {
        recog.current.removeEventListener('result', handleResult)
    }
}, [onChange, progress, words])

Handling the Recognition Results

Now, let's define the handleResult function that we've wired up to the recog reference. Here we will grab out the results portion of the argument passed to us. We'll create an interim variable and take the SpeechRecognitionResultList returned and convert it into an array using Array.from(), limit the results to grab only those that are not final using Array.prototype.filter, grab the first transcript from each of those using Array.prototype.map, and finally join all of those together into one big string using Array.prototype.join.

To leverage the results, let's create some new state using the React.useState hook and save off the results calling setResults.

+ const [results, setResults] = React.useState('');

React.useEffect(() => {
+  const handleResult = ({ results }) => {
+    const interim = Array.from(results)
+      .filter((r) => !r.isFinal)
+      .map((r) => r[0].transcript)
+      .join(' ');
+
+    setResults(interim);

    /* ... more code later ... */
  };

  recog.current.addEventListener('result', handleResult);
  return () => {
    recog.current.removeEventListener('result', handleResult);
  };
}, [onChange, progress, words]);

To see the interim results that we've captured in state, let's conditionally show them as a sibling to <StyledTeleprompter> if they exist.

return (
  <React.Fragment>
    <StyledTeleprompter ref={scrollRef}>
      {words.map((word, i) => (
        <span key={`${word}:${i}`}>{word} </span>
      ))}
    </StyledTeleprompter>
+    {results && <Interim>{results}</Interim>}
  </React.Fragment>
);

Scrolling based on Progress Value

Now, let's focus on scrolling. We'll first create a scrollRef variable using the React.useRef hook, set it to null, and then assign it to our <StyledTeleprompter> component. To our <span> we'll add an HTML5 data attribute of data-index and assign it to the index of the word.

This technically isn't necessary for the scrolling, but let's add a color style to indicate if the word has already been spoken or not. If the word index is less than the progress (meaning it has already been said), then it'll look gray, otherwise, it'll look black.

+ const scrollRef = React.useRef(null);

return (
  <React.Fragment>
-    <StyledTeleprompter>
+    <StyledTeleprompter ref={scrollRef}>
      {words.map((word, i) => (
        <span
          key={`${word}:${i}`}
+          data-index={i}
+          style={{
+            color: i < progress ? '#ccc' : '#000',
+          }}
        >
          {word}{' '}
        </span>
      ))}
    </StyledTeleprompter>
    {results && <Interim>{results}</Interim>}
  </React.Fragment>
);

To scroll the teleprompter, we'll add another React.useEffect hook. We will want this hook to be invoked when the progress prop has changed, so we'll set that in the 2nd parameter. Next, grab the scrollRef's current value and querySelector the data-index that is 3 words past where the current progress is currently set to. We are targeting a few words ahead to hopefully scroll before we run out of words that are in view.

At this point we are using the optional chaining operator in case nothing was found, but if it was, we'll use the Element.scrollIntoView() method, passing behavior smooth, block nearest, and inline start.

React.useEffect(() => {
    scrollRef.current
        .querySelector(`[data-index='${progress + 3}']`)
        ?.scrollIntoView({
            behavior: 'smooth',
            block: 'nearest',
            inline: 'start',
        })
}, [progress])

`Element.scrollIntoView` Browser Compatibility

Support for Element.scrollIntoView() is surprisingly good, which is great because it's such a handy feature.

Updating to the Next Progress Value

The trickiest part of the app is trying to figure out where the current progress should be. To do this, we'll introduce a newIndex variable, break up the interim string back into an array, and compare each word with the next expected unspoken word in the teleprompter script.

To make comparison easier we'll use two techniques. One is we'll create a cleanWord function to trim whitespace, lowercase the string, and replace any non-alpha characters with an empty string and next, we'll leverage the string-similarity library from npm.

import stringSimilarity from 'string-similarity'

const cleanWord = (word) =>
    word
        .trim()
        .toLocaleLowerCase()
        .replace(/[^a-z]/gi, '')

If the similarity between our words is greater than 75% then we'll increment our index by one otherwise we'll keep it the same. Then if our newIndex is greater than it was previously and is less than the total number of words, we'll let our consuming component know that something has changed.

React.useEffect(() => {
  const handleResult = ({ results }) => {
    const interim = Array.from(results)
      .filter((r) => !r.isFinal)
      .map((r) => r[0].transcript)
      .join(' ');
    setResults(interim);

+  const newIndex = interim.split(' ').reduce((memo, word) => {
+    if (memo >= words.length) {
+     return memo;
+    }
+    const similarity = stringSimilarity.compareTwoStrings(
+      cleanWord(word),
+      cleanWord(words[memo]),
+    );
+    memo += similarity > 0.75 ? 1 : 0;
+    return memo;
+  }, progress);
+  if (newIndex > progress && newIndex <= words.length) {
+    onChange(newIndex);
+  }
  };

  recog.current.addEventListener('result', handleResult);
  return () => {
    recog.current.removeEventListener('result', handleResult);
  };
}, [onChange, progress, words]);

Conclusion

It's pretty amazing some of the features that are available in many browsers. Although SpeechRecognition isn't available in all browser, it's a pretty powerful feature and fun to play with. I hope you enjoyed seeing it in action and learned about fun and unique ways to leverage the feature.

NOTE: This is a beginning of an egghead playlist that I plan to grow with additional refactors and new features.

Web Mentions

Tweet about this post and have it show up here!