• @MoonlightFox@lemmy.world
    link
    fedilink
    English
    11 month ago

    So there is not any trustworthy benchmarks I can currently use to evaluate? That in combination with my personal anecdotes is how I have been evaluating them.

    I was pretty impressed with Deepseek R1. I used their app, but not for anything sensitive.

    I don’t like that OpenAI defaults to a model I can’t pick. I have to select it each time, even when I use a special URL it will change after the first request

    I am having a hard time deciding which models to use besides a random mix between o3-mini-high, o1, Sonnet 3.5 and Gemini 2 Flash

    • @brucethemoose@lemmy.world
      link
      fedilink
      English
      21 month ago

      Heh, only obscure ones that they can’t game, and only if they fit your use case. One example is the ones in EQ bench: https://eqbench.com/

      …And again, the best mix of models depends on your use case.

      I can suggest using something like Open Web UI with APIs instead of native apps. It gives you a lot more control, more powerful tooling to work with, and the ability to easily select and switch between models.