I Used DeepSeek R1 + Open Data to Challenge AI ROI

Three conversations are happening in organisations right now.
'We deployed AI right before the rebound β the numbers recovered immediately.'
'Our AI rollout worked. Numbers are up, team is faster.'
'We adopted AI because everyone else did. Eighteen months later, it hasn't moved the bottom line.'
The uncomfortable question that runs through all three: would those results have happened anyway?
The concept that explains why that question is so hard to answer is 150 years old β and it has never been more relevant. It is called regression to the mean.
I used DeepSeek R1 on AWS Bedrock to test it against real data. Two lenses. Two datasets. One conclusion the board deck is missing.
TL;DR β Regression to the mean explains why AI investments appear to work even when they haven't moved anything. SEC data shows 0% natural recovery in 27 trough companies. Stack Overflow and METR data show a 39-point gap between perceived and measured productivity. Three questions to ask before your next board deck are at the bottom.
Regression to the Mean: The Concept Nobody Applies to AI
Regression to the mean (RTM) describes a simple statistical reality: extreme outcomes drift back toward the average on their own. Not because anything changed β because extreme performance contains more noise than signal, and noise does not persist.
Take every footballer rated in the top 10% one season. The following season, most score lower β not because they declined, but because extreme performance is partly skill and partly circumstance. Circumstance does not persist. The lowest-rated players improve for the same reason. Neither group changed fundamentally. The distribution pulled them back toward the centre.
Once seen in sport, it is impossible to unsee in a quarterly revenue report.The question isn't whether the metric recovered. It's whether you set a baseline before the trough. In AI investment, RTM is the most systematically ignored concept in the room.
The AI gets the credit. The measurement goes unquestioned. The baseline is never set.
RTM takes two forms in AI measurement β one statistical, one cognitive. Both produce the same wrong conclusion.
Lens A β SEC Revenue Data: AI ROI Without a Baseline
Data: SEC EDGAR Financial Statement Data Sets 2024 Q4 Β· SIC 7372 Β· Download datasets
Software companies that hit a revenue trough recovered β with and without an AI initiative. Among those that invested, some genuinely outperformed. Others recovered at the same rate as those that did nothing. The aggregate signal does not separate them cleanly.
The FOMO dynamic made this worse. Many organisations did not invest because of operational need. They invested because a competitor mentioned AI in an earnings call. Because the strategy deck required it. Because 'AI first' became the default answer before the question was asked.
That is a fundamentally different trigger β and it produces a different measurement problem.
When the decision to invest is caused by an extreme trough, statistical reversion is already loaded. The AI initiative and the natural recovery happen simultaneously. Attributing the recovery to the initiative is not dishonest. It is just unmeasured.
The companies that cannot answer 'what would have happened without it?' have not measured their AI. They have measured their quarter.
Top 20 most volatile companies β Q1 vs Q2 revenue (USD M, log scale). Green: recovered. Red: declined. SEC EDGAR, SIC 7372, Q3 2023 β Q4 2024.
DeepSeek R1 Findings β SEC Revenue RTM Analysis
0% recovery in 27 large-drop companies contradicts RTM expectations. Mean QoQ (+131.1%) is 34Γ the median (+3.8%) β a handful of outliers distorting the aggregate signal.
Technical finding (DeepSeek R1)
While 55% of companies showed sequential revenue growth, the complete absence of recovery among 27 largest decliners contradicts classical RTM patterns. Extreme positive skew (median 3.8% vs mean 131.1%) suggests concentrated outliers rather than systemic improvement. This distribution questions blanket AI impact claims β most recoveries cluster in already-stable performers.
Executive summary (DeepSeek R1)
Not one company with severe revenue drops recovered naturally. The staggering 131% average growth figure hides that 3/4 of companies saw less than 4% improvement β the true drivers were pre-existing strong performers, not AI interventions.
π‘ So what? Before the next board deck: what was the 12-month rolling average before the trough that triggered this investment? That is your baseline. The quarter you invested from is not. Demand counterfactual modelling before renewing any AI contract.
Lens B β The 39-Point AI Productivity Perception Gap
Data: Stack Overflow Developer Survey 2024 (n=33,239) Β· METR 2025 controlled study Β· Download datasets
Developers using AI tools took 19% longer on real tasks while believing they were 20% faster.
That is a 39-percentage-point gap between perception and measured reality. It holds across every experience level β junior developers, senior developers, architects. The number does not move.
AI sentiment (normalised) by experience β SO 2024, n=33,239. Orange dashed: +20% perceived (METR 2025). Red dotted: β19% measured (METR 2025).
Why This Is RTM's Cognitive Cousin β Not Classical RTM
Classical RTM requires the same subjects measured twice. The SO 2024 survey and METR 2025 are different populations, measured once each. Strict RTM does not apply.
What this is instead is RTM's cognitive cousin β and it produces the same wrong conclusion by a different route.
Regression to the mean describes what happens to numbers measured from an extreme. The 39-point perception gap describes what happens to judgement formed at an extreme moment β adoption. Developers anchor their baseline at maximum enthusiasm. Every measurement taken from that anchor inflates the perceived gain. Different mechanism, identical error.
DeepSeek R1 Findings β Perception vs Measured Productivity
R1 received adoption rates, trust scores, and sentiment distributions. The chart below is what it returned.
Perceived productivity gain +20% vs measured change β19%. The shaded gap is the 39-point perception-reality chasm.
Technical finding (DeepSeek R1)
The 39-point METR gap aligns with predictable variance patterns: senior developers show both highest volatility (Ο=0.91) and largest sentiment/reality disconnect. While AI adoption correlates with a 0.49-point sentiment lift, the β19 actual METR reveals systemic attribution errors. Experience cohorts' inverse trust/performance relationship suggests senior teams over-attribute natural workflow variance to tools.
Executive summary (DeepSeek R1)
Engineering leaders report 20% productivity gains from AI tools, but objective metrics show 19% decline β a 39-point reality gap. This mirrors the 13.4% sentiment drop from 2023's AI hype cycle, showing how teams misattribute normal performance variance to technology interventions.
π‘ So what? The 39-point gap is the number to take into your next vendor conversation. Ask for task-completion data from a controlled study β not survey sentiment. If they cannot produce it, you are buying confidence, not capability.
What This Means for Your Measurement
If your organisation is in boardroom conversation two β 'our AI rollout worked, the numbers are up' β this lens applies directly. The question is not whether the numbers went up. They probably did. The question is whether the team's enthusiasm at the moment of adoption has been baked into the baseline.
Most AI productivity measurements are taken by people who chose to believe in the tool, at the moment they chose to believe in it. That is the least reliable baseline possible. The 39-point gap is what happens when you measure from there and call it evidence.
The business reality today : (1) AI is now a board mandate, not an operational choice. (2) Adoption happens regardless of individual conviction β which makes the attribution problem worse, not better. (3) FOMO-driven adoption inflates the enthusiast pool, inflates the baseline sentiment, and inflates the reported gains. The underlying performance data, as METR measured it, points the other way.
Solution Design: DeepSeek R1 on AWS Bedrock
Same pipeline for both lenses: Python computes stats β DeepSeek R1 reasons over them β structured JSON drives charts and written findings.
One Prompt, Two Lenses
One DeepSeek R1 prompt handles both datasets. Only the context block changes per lens β topic, audience, purpose. The model receives the full stats dictionary and returns structured JSON. Here is the actual Lens A output:
{
"rtm_present": false,
"mechanism": "Extreme Value Distortion",
"key_finding": "0% recovery in large-drop companies contradicts RTM despite 55.1% overall recovery.",
"surprise": "Mean QoQ of 131.1% vs median 3.8% reveals extreme outlier-driven skew.",
"chart_points": [
{"label": "Median QoQ growth", "value": 3.8, "annotation": "Typical company performance"},
{"label": "Large-drop recovery rate", "value": 0.0, "annotation": "Zero bounce-back despite RTM expectations"},
{"label": "Mean QoQ growth", "value": 131.1, "annotation": "Skewed by extreme outliers"}
],
"gap_label": "Median-Mean Disparity",
"gap_value": 127.3,
"highlight_note": "29 companies drove 98% of growth",
"narration": "While 55% of companies showed sequential revenue growth, the complete absence of recovery among 27 largest decliners contradicts classical RTM. Extreme positive skew (median 3.8% vs mean 131.1%) suggests concentrated outliers rather than systemic improvement.",
"executive_summary": "Not one company with severe revenue drops recovered naturally. The 131% average hides that 3/4 of companies saw less than 4% improvement β true drivers were pre-existing strong performers."
}
π‘ Chart axes, annotations, and shaded gap regions are all derived from this output. Change the dataset, the chart changes. Change the audience context, the written outputs change.
R1's Chain-of-Thought (Excerpt β Lens A)
Before producing JSON, R1 reasons through the data:
The overall recovery rate is 55.1% β slightly above half. RTM would predict that
companies performing worse than average would improve. That's consistent with
the 55% figure. But the mean QoQ is 131.1%, way higher than the median of 3.8%.
A highly skewed distribution. A few companies had massive increases pulling the
mean up. That skewness might indicate outliers influencing the average.
The large drop recovery rate being 0% complicates things. If those with the
biggest drops didn't recover, maybe RTM isn't the main factor for them. Their
issues might be structural β not random fluctuations that naturally correct.
RTM might be present in the general recovery (55%), but the 0% recovery in
large drops is unexpected. The mechanism: cognitive bias where companies
attribute the high mean (from a few strong recoveries) to AI, ignoring the
complete lack of recovery in the worst performers.
Three numbers fed the model: QoQ recovery rate (55.1%), large-drop recovery rate (0.0%), mean-median gap (131.1 β 3.8 = 127.3 points).
What Failed First β Model Selection and Prompt Design
| Decision | What Failed | What Worked |
|---|---|---|
| Model choice | Mistral 7B hallucinated acronyms it was told not to expand β 3 runs, 3 inventions for 'RTM': Root Mean Square Error, Return on Technology Investment, Relative Technological Maturity. MATH-500: ~42%. | At that capability gap, token priors beat instructions. Model replaced DeepSeek R1 (MATH-500: 97.3%) |
| Prompt structure | Ten-point briefs produced technically correct, completely bland text | Context + data + one instruction. The model needs room to reason, not a script. |
| Scale conversion | 'Multiply raw scores by 50' burned R1's thinking budget on a scale conflict. Empty JSON. | Handle normalisation in plot_rtm(), not in natural language. Scale conversion belongs in code. |
| Output stability | Numeric outputs stable; interpretive labels varied across runs (even at temperature: 0) |
Anchor on the numbers, treat prose framing as context. Reasoning models need this discipline. |
π‘ The constraint was capability gap. The decision was DeepSeek R1. The trade-off accepted was cost per token β justified by structured JSON output on first attempt.
The Three Questions Worth Asking
Before crediting AI for any metric improvement:
1. What was my baseline before the trough that triggered this investment? Not the crisis quarter. The 12-month rolling average before it.
2. What would have happened without the intervention? If you cannot model the counterfactual, you have not measured the AI. You have measured the quarter.
3. Have I measured across at least three consecutive cycles? One data point is noise. Two is coincidence. Three is a pattern worth acting on.
What's your experience measuring AI ROI in your organisation? Has your team found a reliable counterfactual approach? Drop a comment β I'd genuinely like to know what's working.


