Creating SteamML: Predicting Game Success With Machine Learning

Mike, our head of games, built a machine learning model to predict potential game success on the Steam platform.

It all started with Dungeon Highway

Back in 2015, as game distribution platform Steam was midway through Greenlight, their crowd-driven submission system, we made a little indie pixel-art runner called Dungeon Highway. On mobile, it did well enough to spawn a sequel called Dungeon Highway Adventures, and we thought, hey, let’s take this to Steam too and get some experience over there. Things looked good. But it bombed.

In retrospect, it was an extraordinarily bad idea to release an indie pixel-art runner at the same time that the rest of the world was also releasing indie pixel-art runners.

We went on to build Exploding Kittens with the lovely Exploding Kittens crew and that’s been a top 3 paid card game on iOS for years now, so all was better in the end, but it nagged at me that I was unable to predict that a particular combination of genres was a really bad idea. Maybe we would have done quite well if we had spent all that effort on another combination of genres. Maybe “football,” “cyberpunk,” and “music” were the way to go. We’re creative people, we could have made something fun with that keyword trio if we knew that was a better option to pursue.

Machine learning is fantastic at predictions

How could we get those suggestions about better game keyword combinations? The data is out there about what’s sold, so what if we could feed all the historical Steam sales data into a machine learning model? That model could help us predict genre combinations that have either been successful in the past or haven’t been explored. So we built that model.

We built the thing

The first model was incredibly simple. Given the top three tags of a game, our model would tell us if it’s going to have a million owners or not. It completely ignored indicators of quality like initial price, number of developers involved, budget, screenshots, etc. I knocked it out of the park with my first swing. It was right 98% of the time! I was ecstatic. The rest of my machine learning crew here at Substantial was bemused. Turns out I knew just enough ML to be dangerous.

Yes, it was right 98% of the time. But guess how many games on Steam have less than a million owners? 98%. It was just predicting that almost every game would fail, which is cute, and close to reality, actually, but it taught me an important lesson in that machine learning algorithms will always cheat if you let them.

I had to dive a little deeper with my model and encourage it to be more accurate. My coworker and fellow machine learning enthusiast Jeff suggested I look into f1-scores and rebalancing my input categories, which then led me to looking up confusion matrices, and finally finding a custom f1 loss function to properly “punish” the machine learning model when it missed combinations that were likely to lead to success. And then we started getting some interesting results.

A few obvious combinations likely to hit 1M+ owners

  • Action RPG + Cyberpunk + Survival
  • Batman + Action RPG + Free to Play
  • MOBA + Medieval + Multiplayer
  • Lara Croft + Open World + Survival Horror
  • Naval Combat + Pirates + Multiplayer
  • Bullet Time + FPS + Stylized

Some less obvious viable combinations

  • Faith + Dinosaurs + MMORPG
  • Mars + Fishing + FPS
  • Martial Arts + Heist + Star Wars
  • Pixel Graphics + Life Sim + Multiplayer
  • America + Sniper + Battle Royale
  • Bowling + Survival + FPS
  • Gore + Werewolves + FPS
  • Great Soundtrack + World War 1 + Story Rich
  • Illuminati + Puzzle + Open World
  • Sandbox + Medical Sim + First Person
  • First Person + Tanks + e-Sports

How accurate is this?

There’s this metric called the Matthews correlation coefficient (MCC) which is useful for scoring machine learning prediction models with unbalanced classes like this one. It takes all classes of outcomes into account: true positives, false positives, true negatives, and false negatives, and produces a score that’s easy to interpret. For a batch of predictions, an MCC score of 0 means our prediction accuracy is no better than a coin flip, 1 means we’re right all the time and -1 means we’re wrong all the time.

Our model scores 0.40.

This seems to make intuitive sense to me. If I were to come up with a formula to quantify success in PC gaming, I’d say that success is demand plus marketing plus build quality. Our model doesn’t take build quality or marketing into account.

This is where it gets murky

The model isn’t perfect. There’s an element of luck and other x-factors in game success. It’s all but impossible for a model to account for that. Additionally, since the model is trained on historical data, for combinations of tags the model has never seen before (breakout hits) it’ll never score well unless some of the individual tags in it were associated with hits in the past. There are also some tags that are perennially successful.

Tags that you can add to almost anything to make it viable

  • Co-op
  • FPS
  • Medieval
  • Multiplayer
  • Open World

We’re well aware that correlation is not causation. The MMORPG tag, for example, has historically done very well. Was this because people like MMORPGs (meaning high demand), or was it because nobody makes a MMORPG unless they have a large budget (meaning a lot of advertising)?

If I, in my spare time with my limited resources, made a Faith + Dinosaurs + MMORPG, would it do well? The model says it would. The game I make would undoubtedly have lower build quality than if Sony or Microsoft made it though. I’m a pretty good network programmer. I can make voxel art of faith-loving dinosaurs, push popsicle sticks up their rears and bounce them around an open-world. Is that enough to get to one million owners?

Probably not, I think. So while this model can help, it can’t account for everything.

What’s next?

I’m going to keep plugging away at this. I’ve got another version of the model that’s at 0.45 MCC which includes a “recency” score to help it better handle trends over time. I think I’d be really happy if I could get it up to 0.6 and have some ideas for improving the accuracy like reusing some of the less common tags we’re tossing, and maybe taking the launch price into account as a proxy for build quality.

In the meantime, you can play with an early version of this in your browser! If you come up with a game idea you wouldn’t have otherwise, let us know!

Let’s build a better future, together.