Jul 182019
image - VoiceToText

Voice to persisted actionable reference –  Eidetic Memory https://en.wikipedia.org/wiki/Eidetic_memory

This is a generic long-standing human challenge of communication and knowledge transfer for the roles we engage in. When it comes to extracting actions and knowledge accurately and efficiently and then referencing them we find the flaws in tools for persisting language quickly. Whether you are a parent creating a chore list, a leader establishing laws or a professional or scientist tracking efforts, observations, and results, a developer creating code and documents or just keeping your memories in a journal, we have all experienced this challenge. Yet, humans alike using early cave paintings techniques  and then moving on to  wooden stylist and clay,  stone axe and bone, chisel and stone, smoke signals, inking brush and paper or now typing on devices still lose context fidelity with each of these tools even if they are faster and cost less time and energy to use. No matter the tools employed to date the process still distract us from the nature or purpose of the human purpose and interaction as most of us don’t have eidetic memory and recall associated to this talent nor the ability to stay in the moment while transcribing via writing or texting. I am still appreciative of all the efforts before us to share knowledge and the oral tradition of storytelling when we didn’t even have tools to leverage let alone the symbology of language and emojis we take for granted today.

Compact Tape Recorder built into your phone aka mobile computing device https://en.wikipedia.org/wiki/Cassette_tape

The tape recorder has long been the modern device of choice for interviewers like journalists, medical professionals, professionals tracking time. Now diplomats and related staff are very familiar with devices and services for translation and transcription as the nature of their meetings are generally critical for many people they serve and represent. So as accuracy and speed to results become key in the world for diplomats, live-action sporting events using closed caption and translation for their audiences, technology is racing to assist us and increase the accuracy to near 99% and lessen the time to those results from seconds to near-human realtime of tenths of seconds. 

Natural Language Processing (NLP) https://en.wikipedia.org/wiki/Natural_language_processing

For those individuals new to Natural Language Processing it is an active and maturing discipline of Computer Science and a subfield of Artificial Intelligence (AI) developed and improving via machine learning on massive sets of data created from voice conversations. Many long-time Sci-Fi fans will identify with Star Trek and the ability of crew members to be able to interact with the ship’s computer via voice for commands, queries, broadcasts and related ship member logs, official capacity or personal.  

Voice User Interface (VUI) https://en.wikipedia.org/wiki/Voice_user_interface

The age of Voice User Interfaces (VUI) is upon us whether that is Apple Siri, Amazon Alexa, Google Assistant, Microsoft Cortana, Salesforce Einstein, IBM Watson, Nuance Dragon Naturally Speaking or your own Virtual Personal Assistant (VPA) via dictation. All of the major cloud providers and device and application manufacturers have implemented related features and APIs that can be used to assist. Many have seen the realtime dictation mic built into most mobile device keyboard inputs [show icon]. Some may be aware of the awesome dictation features under accessibility features along with the fun of autocorrect if enabled. 

Speech  Recognition – Silently persisted to personal device vs. Connected Realtime autocorrected text only

Regardless of the voice user interface tools they are currently distracting and counterproductive to active listening in the context of human interaction in today’s meetings, lecture halls and 1:1 meeting. This is even if you can personally get past the awkwardness most feel when breaching the everyday social contracts around these events by talking to a third party even if it is a device or service. Many individuals have a general “creepy” feeling and distrust associated with “devices that are actively listening” to assist us and that is completely normal. This is especially poignant in the context of the interaction is performance or personal. 

Natural ease capturing voice now and extremely low effort tools to get quality results

As most English speakers can convey around 120 words per minute or a large article of 5500+ words in an hour. This is far faster than most people can type with ten fingers or with two thumbs on modern devices and we won’t compare analog devices like pen and paper. An important part is the increased Active Listening and the trust it promotes if correctly translated to positive referenceable results between the parties. 

Per Wikipedia – Active listening is a technique that is used in counseling, training, and solving disputes or conflicts. It requires that the listener fully concentrate, understand, respond and then remember what is being said. https://en.wikipedia.org/wiki/Active_listening

Speed to results

One of the best reasons for using dictation and natural language processing is the speed to results are generally rated at less than a second per word and closer to the blink of a human eye. The average human’s eye blinks at a speed of 300 to 400 milliseconds or 3/10ths or 4/10ths of a second which is way faster than you can write, type or text.

Cost of results

Another reason for the current trend of automated speech to text is the cost of the results are averaging about a $1.36 for the Google API Speech to Text which makes it less than .0002 per word based on an average of 5500 words per hour of voice memo transcribed. As of today we have paid zero as Google as well as Microsoft and Amazon offer free trial accounts and periods as well as “Always Free” levels, so you may never have to pay a cent for your usage. 

Google Cloud – https://cloud.google.com/free/

Microsoft Azure – https://azure.microsoft.com/en-us/free/

Amazon AWS – https://aws.amazon.com/free/start-your-free-trial/

Accuracy generally greater than 90% on default AI model settings

With all of the services increasing their competitive advantages the accuracy continues to climb towards the 99.9% mark from the 90% that even the most basic of APIs provide whether that be built-in dictation or near real-time dictation via your mobile phone VPA like Siri.

Reviewable vs. Real-time dictation for subject matter jargon or speaker dialects

A major reason for persisting your voice via a voice memo is the ability to review it against the resulting text, move forward, backward at advanced speeds and specific times as well as the ability to share both the audio and transcript that results.

Device omnidirectional mics are amazing at recording meetings, lectures and 1:1 meeting

The iPhone currently has three (3) different omnidirectional microphone built-in and almost every modern phone has at least two microphones. The 2-mic setup is used for noise cancellation during calls and for recording the stereo sound while shooting video. And of course, Microphones like Headphones are a audiophiles vanity spend so if you don’t feel the one that came with your phone you can take it to the level you feel is appropriate for taste and usage needs.

Recording tools and applications

Quicktime and other tools like Zoom Meetings are also fantastic in recording impromptu phone calls or meetings you don’t host and Zoom Meetings always allows the host to record the meeting and by default creates an audio_only file as part of the process. There are even automated service plugin services like Ava that will also provide automated transcription by inviting the service to the meeting.  If you have been using Google Voice then you might be familiar with the Voice Mail to Text features that have been available for over a decade as of this writing.

Transcribe Voice Memo to Text https://github.com/dwgrigsby/TranscribeVoiceFile2Text

Automated Results captured from human communication resulting in actions and learning. Increased active listening and participation results through automated voice memo transcription via natural language processing to text for notes while retaining original audio and/or video for reference with time-indexed references in the text.

Grigsby Consulting LLC, Anna and I are ecstatic to announce the first release via open-source of the internal solution we created and use daily. We alluded to this in our article on June 21st, 2019 “Google Cloud Speech to Text API Announcement and Example aka Voice Transcription”  at https://blog.grigsbyconsultingllc.com/google-cloud-speach-to-text-api-announcement-and-example/ 

Below is an excerpt from the above GitHub Project link. 

Working: at present Release v1.00

Via Siri or manually open up the voice memo app on Mac or iPhone and take a voice memo then save the voice memo with a descriptive name if desired and share to your Google Drive. Via terminal run Bash Shell TranscribeVoiceMemo2Text.sh to transcribe the voice memo file on Google Drive showing in finder via Google Drive sync or stream and when it’s complete it would move the voice memo to a processed folder with the transcript.txt copied there as well with the same name as the voice memo.

Give it a try if you meet the current requirements (Mac and iPhone for release 1.00 ). If you are apart of an organization that has Bring Your Own Device (BYOD) Corporate Security and/or Compliance standards, be sure to check out the impact of using this software for the purposes you intend to use it for. We want you to also know we are hard at work on the Windows, Linux and Chromium versions along with support for the related mobile phones pairings respectively. We have used this software and process in many situations from phone calls, conference calls to meet up lectures from the back of the room and we have this example at https://github.com/dwgrigsby/IndyNETConPostman201907.

Also if you want to contribute to an open-source project like this in code, testing, documentation or by buying us coffee and/or giving us your feedback, we are happy for any support and feedback offered