This is art of a presentation I gave to my company on January 2nd, 2025.
“What LLM should I use?” is more nuanced than it used to be. For coding or general technical tasks, I currently observe:
- OpenAI o1/o1 Pro is the best single-turn model, but as a reasoning model it usually take ~10+ seconds to generate an answer.
- Anthropic Claude 3.5 Sonnet is the best conversational model, and returns high-quality answers in less than ~2 seconds. Its coding responses are high quality.
- OpenAI gpt-4o is very similar to Sonnet, probably slightly better in some forms of code generation, but definitely worse conversationally. However, it can get stuck in loops when troubleshooting code, continuing to suggest prior (failed) solutions. Claude is better at output diversity and can “break out” to suggest alternate approaches over the course of a conversation.
- Google Gemini 2.0 Flash Thinking Experimental - released Dec 19th, Google’s equivalent to o1, haven’t fully evaluated yet, but I appreciate that the reasoning traces are shown.
- Google Gemini 2.0 Flash and Gemini-Exp-1206 are better conversationally than gpt-4o, comparable in code generation.
Claude 3.5 Sonnet has replaced Google search in 90% of my scenarios. I use it for all varieties of general Q&A. Hallucinations are a major problem for smaller, portable models, but the options listed above perform quite well in this regard.
Workflow
Let’s imagine a scenario where I want to create a new Python class to accomplish an objective. My process looks like this:
- Claude 3.5 Sonnet - Voice mode
- Speak through an explanation of the task, ask the model for a detailed problem definition / task specification.
- Further refine the task specification, amending the output from Claude while probing any considerations I have missed during my initial explanation.
- OpenAI o1
- Paste the resulting documentation from Claude into o1, optionally including additional instructions or context
- Review model output, ask for refinements if necessary
- Test code in IDE
- Troubleshooting
- Claude for adjustments at the line or function/method level
- gpt-4o or Gemini for a “second opinion” when stuck
- Back to o1 for major rewrites or edits across long context (e.g. 100 lines of code or more)
- Claude for adjustments at the line or function/method level
I do not have as much experience in areas like creative writing, where diversity of model outputs or maintenance of themes is more important.
Voice
I strongly recommend using voice-to-text functionalities for interacting with the models. It’s far more efficient for conveying information than typing. The transcriptions themselves are quite accurate, and the models are flexible enough to recognize common transcription errors without special handling.
I frequently accomplish productive design work while walking the dogs, describing system behavior via AirPods and voice functionality on the Claude mobile app. When I return to my computer, I have all the outline I need to begin diagramming or testing code.