Coach development and analysis tools - AI-enabled coach observation

There has never been a better time to launch an AI coaching analysis product.

A website takes an afternoon. A dashboard can be assembled from off-the-shelf components. Feed a coaching session transcript into a large language model (LLM) and it will produce a structured, confident-sounding report in seconds. Add a logo and a pricing page and you have something that looks, from the outside, entirely credible.

It is a structural feature of the moment we are in. The barrier to building something that appears sophisticated has disappeared. The barrier to building something reliable still exists.

For anyone responsible for coach development this creates a real risk. Products that on the surface look really impressive could quickly be integrated as innovative heads of coach education embrace AI. The problem is that without the ability to scrutinise the output of these tools they might be producing data that is not accurate and doing more harm than good.

The Chain of Uncertainty

Before you trust any AI coaching tool, it helps to understand what it is actually doing, and how many things can go wrong before a report reaches you.

The first problem is transcription. A coaching session is one of the hardest environments for automated speech recognition due to:

Multiple speakers — co-coaches, players, other sessions taking place
Varied speech — shouting, quiet, breathless, rushed
Background noise — whistles, wind, equipment

The best transcription models in the world were predominantly trained on clean, single-speaker audio, think calls and meetings. The transcript the AI receives is already an imperfect representation of what happened. Sometimes significantly imperfect.

The second problem is what the LLM sees. Once the LLM receives that transcript, it attempts to classify coaching behaviour. But it does not read the whole session at once. It processes a window, a segment, without reliable access to what came before or after. The LLM will default to normal linguistic rules to try and split up the transcript but coaching speak is different, it might miss important context.

Finally there is the classification itself. Researchers have spent decades debating how to do it reliably with human coders and it's still imperfect. The model is doing it at speed, at scale, with incomplete information.

Three compounding layers of uncertainty, before the report is even generated.

What Large Language Models are Actually Doing

The field of artificial intelligence has a long history, and a significant part of that history is about imitation.

Alan Turing's famous 1950 paper proposed a test built around a simple idea: if a machine could produce responses indistinguishable from a human in conversation, that was a reasonable measure of intelligence. The Turing Test framed convincingness as the goal and that framing shaped decades of AI development that followed.

Large language models, the technology behind tools like ChatGPT, Claude, and Gemini are trained on vast quantities of human-generated text. They are optimised, in part, to produce fluent, coherent, human-like responses. These systems are exceptionally good at sounding right.

Modern AI development does include truthfulness as an explicit objective. The hallucination problem is however real, and every credible AI product carries a disclaimer acknowledging that it can make mistakes.

The report reads with confidence. The confidence is not evidence of accuracy. It is a feature of how the technology produces text. This is worth bearing in mind.

Access to the Source Data to Critically Assess

Can you see what made the classification? Is the source data ie video or audio file available for you to critically assess the output?

If a tool tells you a coach asked 18 questions in a session, can you click on that number and go to the moments? Can you see the transcript line, hear the audio, watch the video? Can you check whether that was actually a question, whether the model made a plausible guess or just made it up ie hallucinated?

If you cannot, you have no way of knowing whether the output is accurate. You are looking at a guess. A very confident-sounding guess, with no way to verify it.

What happens when the tool is wrong? Every AI product makes mistakes. The ones worth trusting give you a way to report those mistakes and act on them. An error flagging mechanism. A feedback loop. A way for the company to know that a specific classification was incorrect and to use that information to improve the output.

If a tool produces a report with no way to flag an error, no edit function and no feedback mechanism, it is operating as a black box. Black boxes do not improve. Without the ability to scrutinise the output against the source data they have no stake in being right.

Why this took us years

We would not have released SAM without the ability to scrutinise its output. That is not just a philosophical position, it is a practical one too.

We spent 12 years building a video analysis infrastructure before SAM existed. That infrastructure ie the ability to link any data point back to a specific moment on video, became the backbone of how coaches and coach developers use the platform. It also became the mechanism by which we identify where SAM is wrong and make it better. This is a crucial internal tool for us too.

We are still improving SAM and we will continue improving it. The accuracy of the output matters enormously to us, and we are not yet where we want to be and accept that it is imperfect but very very useful. That is precisely why transparency is non-negotiable. The video is the source truth and everything SAM produces needs to be checkable against it.

Before you commit to any AI coaching tool, ask it to show you its working, the source data and the reasoning behind the classification. If it cannot, that's a red flag and proceed with caution.

AI in Coach Development | Separating Substance from Shine — Part 1 of 3

The Chain of Uncertainty

What Large Language Models are Actually Doing

Access to the Source Data to Critically Assess

Why this took us years