Not only is this an interesting ranking of capabilities, most seem to be consistently (with some just giving up) wrong with only strong performer being the Kimi chatbot, an open weighted model known for supporting over a hundred-thousand tokens of context—we’ve no idea what that means—from Beijing, one can minute by minute observe the coding challenge and watch how the results accrue or devolve into a jumble.
Clock the results now and see what recursive improvements are on display.
