I Built an AI That Bets on the Weather (and What I Learned)

I am not a financial professional. I want to say that upfront, because what follows is going to describe something that sounds a little like trading, and I don't want anyone to mistake this for investment advice. This is a builder's log about an interesting signal-detection problem that happens to involve money.

With that out of the way: yes, I built an AI agent that watches weather prediction markets. And it's one of the more interesting systems I've built in the past year.

What Are Weather Prediction Markets?

Prediction markets are regulated platforms where you can bet on real-world outcomes with real money. Kalshi is one of the largest, regulated by the CFTC in the US. They offer contracts on a wide range of events — economic indicators, political outcomes, and yes, weather.

A typical weather contract looks like this: "Will the high temperature in Los Angeles exceed 78°F on March 15?" You can buy a YES contract for, say, 62 cents. If the temperature exceeds 78°F, the contract pays out $1. If it doesn't, you lose your 62 cents. The market price of the contract reflects the collective probability the crowd assigns to the event — 62 cents = 62% implied probability.

// how prediction markets work

The price of a prediction market contract reflects the crowd's implied probability of the event occurring. A contract trading at $0.65 means the market believes there's a 65% chance the event happens. If you think the true probability is higher — say 80% — you have a positive expected value edge and buying is theoretically profitable over many bets.

The interesting question is: is the crowd right? And more specifically, is there a systematic way to know when the crowd is wrong about the weather?

The Edge: Ensemble Forecasting vs. Market Prices

Modern weather forecasting uses what are called ensemble models — running the same simulation many times with slightly different starting conditions to produce a probability distribution over outcomes. The major public models are GFS (American), ECMWF (European), and ICON (German). Each runs independently, and each produces probabilistic forecasts that tell you not just "it'll be 76°F" but "there's a 72% chance the high temperature exceeds 76°F."

When the three major ensemble models all agree that the probability of an outcome is substantially different from what the prediction market is pricing in, that's a potential edge.

If the market says "40% chance it exceeds 80°F" but GFS, ECMWF, and ICON are all saying 65-70%, that's a 25-30 percentage point discrepancy. After accounting for Kalshi's spread and the inherent uncertainty in the models, there might be something exploitable there.

The agent we built — Njord, running on an Optiplex 7070 at IP 10.2.0.7 — does this comparison continuously. It pulls forecast data for 15 cities (Los Angeles, Dallas, Seattle, Houston, Phoenix, Boston, Minneapolis, Las Vegas, Philadelphia, Detroit, and more), fetches the corresponding Kalshi markets for each city, and runs the spread calculation every 5 minutes during market hours.

15 Cities Monitored

3 Forecast Models

5min Scan Interval

3% Min Edge Required

The Signal Detection Logic

When Njord finds a contract where the ensemble consensus probability diverges from the market price by more than 3% (after the spread), it logs it as a signal and sends an alert to my Discord. The alert looks something like this:

STRONG SIGNAL: KXHIGHLAX-26MAR07-T78 | Market: 41% | Model: 68% | Edge: +27pp | Kelly: 18%

The Kelly criterion tells you how much of your bankroll to bet, given a known edge. If you believe you have a 68% chance of winning a bet that pays 2.44:1 on a win, the Kelly formula says bet about 18% of your bankroll on this specific trade. In practice, most people use "fractional Kelly" (maybe a quarter of the recommended size) to be more conservative.

// the kelly criterion, briefly

Kelly tells you the optimal fraction of your capital to bet when you have a mathematical edge. The formula is: f = (p * b - q) / b, where p is your probability of winning, q is your probability of losing, and b is the net odds received. Betting more than Kelly is reckless; betting much less than Kelly leaves value on the table. Most practitioners use 25-50% of the full Kelly size.

The Bias Problem We Had to Fix

When I first ran this system, the signals looked too good. The models were showing large edges almost constantly, across many cities and many contracts. That's usually a red flag.

The problem turned out to be systematic model bias. The weather models I was using were under-predicting extreme temperatures by 18-32 percentage points in some cities and seasons. I was treating the raw model output as ground truth when it wasn't — the models had known biases that I hadn't corrected for.

Calibration was the fix. Over several weeks of observation, I tracked the difference between model predictions and actual outcomes, built a bias correction layer for each city and season, and applied those corrections before comparing model output to market prices. After calibration, the signal frequency dropped dramatically — which was correct. Real edge is rare. If you're seeing edge everywhere, something is wrong with your model.

Paper Trading: What the Numbers Say

We've been paper trading since early March — placing hypothetical bets with fake money to test whether the signal has real predictive power before risking actual capital. The system is fully built for real trading (the Kalshi API integration, the Kelly sizing, the circuit breakers), but we're running in paper mode while we validate the calibration.

Here's the honest status: the system is scanning, finding signals, but not triggering many paper trades. That's because after calibration and the 3% edge threshold, genuine edge doesn't appear that often. Most days it comes back "markets priced fairly." That's the correct behavior — but it makes for slow paper trade accumulation.

The previous run of the system (before a hardware drive swap forced us to reset the paper trading database) showed 3 profitable trades with a simulated P&L of +$5,175 on a $200-per-trade basis. The new run is fresh but the methodology is the same.

The most important thing this project taught me is the difference between "finding signal" and "finding edge." They sound the same. They're not. Signal is just a pattern. Edge is a pattern that beats the market's pricing after accounting for transaction costs and your own model uncertainty.

What This Has to Do With AI

Njord is running on a CPU-only machine. The models powering it are large language models via API (Google's Gemini 2.5 Flash for most tasks, NVIDIA NIM for the 70B Llama when we need deeper reasoning). But the core signal detection logic is statistical, not AI in the neural-network sense. The language model's role is orchestration and communication — interpreting the scan results, formatting the alerts, handling the natural language interface when I ask "what's the best signal today?"

The actual prediction of weather outcomes is done by the GFS, ECMWF, and ICON models — models that run on supercomputing infrastructure operated by national weather services. I'm not trying to out-forecast them. I'm just detecting when the prediction markets haven't incorporated their output correctly.

That's the core of the thesis. Weather prediction models are very good at their job. Prediction markets are sometimes slow to incorporate that information. The gap between the two is where the potential edge lives.

Where It's Going

The next step is real trading with a small funded Kalshi account — $100-200 to start, trading fractional Kelly sizes, watching the Brier score (a probability calibration metric) evolve over real trades. If the Brier score shows the model is better calibrated than the market at 90-day intervals, we'll scale up. If not, we'll adjust.

The hockey layer is also live. We built a 3-model ensemble (Elo rating system + Pythagorean win expectation + recent form analysis) that scans Kalshi sports contracts. The methodology is the same: find where the market's implied probability differs from what the model says, quantify the edge, alert when it crosses the threshold. So far: 17 signals detected, still waiting for high-confidence entries.

This is what I find genuinely interesting about working in this space: the problems are hard in the right ways. You're not trying to invent a better weather model. You're trying to build a system that reliably detects and acts on a narrow class of informational inefficiency. The edge, if it exists, will be small. But small and consistent, compounded over hundreds of trades, is how this math works.