Not all languages are created equal

126 days ago by pgcd

Share this post:

Share on Bluesky Share on Twitter Share on Facebook

After introducing the Daily mode, I decided to start trying to play it in English every day and sharing my results, in a vague attempt to "drive" something or something else.

I knew my Wordrush skills to be unmatched (well, duh) and, possibly with some arrogance, I believed my English vocabulary to be reasonably extensive, so I expected to be able to get decent scores every time. Turns out that no, it isn't and I wasn't able to. I actually failed the Daily more than once, with the last few characters spent trying to make words out of XYFDSJFEYGFDS clusters. But I'm not truly arrogant, see? So I asked around, ready to be chastised, and I found that even native English speakers weren't faring much better - this game is *hard*, is the word around town.

Being a developer, I can code. Being a good developer, I know that code is just stuff that happens to data, and data is what I care about, so I proceeded to "analyze" the data, only to discover that trivial observations like the ones I made earlier (ie. "in this dictionary there are X words longer than seven letters, so it's twice as hard as this other language where there are X*2") had very misleading consequences. Namely, Swedish.

Swedish is a great language under all respects - it has a reasonably simple grammar like English, a mostly readable script, word compounds like German, and it sounds absolutely delectable to my ears. It also has very long (composite) words, so much so that I had to remove the ones longer than 25 letters because Wordrush doesn't implement (yet) a way of continuing a word
after the letters in the grid are exhausted.
The consequence of this is that there are lots of "long" words (ie. over seven letters) in Wordrush' Swedish dictionary, which I näively took to mean it was an "easy" language - my initial assessment based on long words count put it in the same difficulty bracket as German (so, halfway between English and Italian).

After some discussion about diacritics in German and whether or not to keep them there and in Swedish, I decided that I needed some way to test how much of a difference they actually make to difficulty. Intuitively, I assumed that removing diacritics would make things easier (since the same letter could be used to make a word both with and without - eg. U could work both for TUN and for TÜR) but I had no idea about the magnitude of the effect.

So I proceeded to build an autosolver, as one does. I had already implemented the "hints" functionality to get the longest possible word in the grid, so it was only a matter of coding something something that would take that longest possible word ("take the hint", haha), select the relevant letters, submit them and proceed as required. And let me assure you that "only" does some serious heavy lifting in the previous sentence.

Anyway, after solving a number of edge cases, figuring out how to extract useful information from the solve and, most importantly, managing to speed up the whole thing, I finally came up with what you can see in the video (which also shows that English long words tend towards the "uncommon", vs the ones in Italian) and ran a 100-games-per-language test. You can see the results here https://cryptpad.fr/sheet/#/3/sheet/view/69c6409afb3e887c5682529b8aff6d5c/ - I also added a chart to make it look professional!

The result is that Swedish is *very* hard -and that's discounting how common or uncommon the words are, because the autosolver has no knowledge of that, and simply picks what can be done.
As a matter of fact, with the initial dictionary I was using, Swedish was about 2.5 times harder than Italian. For comparison, German is at 1.8 and English is at 2.
Strangely enough, removing diacritics makes just a small difference in German (1.65) but a vast one in Swedish, which drops to 1.6.

Armed with this abundance of information, I proceeded to try and understand why this would be the case, and found out that not only Swedish is difficult: linguistics is difficult as well, and you can't learn it in one hour. So, no, still no idea whatsoever.

I am inclined to think that there are multiple factors at work to explain the difference in difficulty amongst languages, though:

The massive change between Swedish with diacritics and without suggests that words with diacritics in Swedish tend to be very different words, while in German diacritics are often used for variation on the same word (eg. plurals), which is why German doesn't benefit quite as much from their removal
The incredible difficulty of the Swedish dictionary I used with diacritics could also be explained by the sheer number of letters - the dictionary inadvertently included 37 different glyphs. Removing the eight spurious ones resulted in a marked impact: the difference went from 2.5 to 1.8 (yays) as you can see in the second tab of the sheet I linked
But this still leaves English, which is currently the hardest language in Wordrush. Why does English still need a big multiplier to equalize the scores? And my guess is that a lot of the longer words in English seem to be loan words of some type (plenty of botanics, mineralogy, chemistry and whatnots, plus pure and simple loans). My guess is that, simply, these words don't share the same letter frequency of the most common ones (you could see the list in the video but Google, in its infinite wisdom, has decided to make it a short for whatever stupid and greedy reason, so good luck finding the right place), so they're just harder to "come by", especially compared to Italian, where the longest words are often just "regular" ones with some suffix like -ENDO or -ENTE.

So, did I learn something after spending three days or so coding and running simulators and looking at numbers? ABSOLUTELY NOT.

I'm still exactly at square one, because the only halfway sensible measure to counterbalance English and Swedish difficulty is to just multiply the scores. Not for three-letter words because, man, that's just lazy. But 4 letters and more are adjusted, which should hopefully result in a fairer distribution of scores regardless of the language.

As a side note, it would be perfectly sensible to point out that I shouldn't waste my time like this for a game that has very few regular players, and even fewer Swedish ones. And to that argument I reply: no, I should absolutely waste my time like this. Because Wordrush players are a) indubitably smarter than the average; b) possessed of exquisite taste and c) the best people in the world. So I owe it to them to try and have the best possible game for them to play.

Wordrush

[Just/Nothing] like a classic wordsearch

Add Game To Collection

Status	In development
Author	pgcd
Genre	Puzzle
Tags	Casual, Minimalist, No AI, Relaxing, Short, Singleplayer, Touch-Friendly, wordsearch
Languages	German, English, Spanish; Castilian, Spanish; Latin America, Italian

Small update
45 days ago
Don't dodge the score creep
51 days ago
Scoreboard and rotating grids
73 days ago
Are you a showoff?
82 days ago
Dabbling in things men are not meant to wot of
Mar 20, 2025
Quick fix to initial layout
Feb 20, 2025
Goodbye, ABBA
Feb 20, 2025
Improved help & single-letter removal protection
Feb 19, 2025
Better sharing
Feb 17, 2025

See all posts

Not all languages are created equal

Wordrush

More posts

Leave a comment