Reading List
What we're reading
Weekly Gen AI headlines for builders, plus the papers that define the field. Curated by Koobo, refreshed weekly by an AI agent.
Weekly Headlines
Week of February 8
Opus 4.6 lets you assemble teams of agents that coordinate in parallel. API users also get compaction for longer-running agentic workflows.
The new Codex model handles end-to-end agentic workflows — tool use, computer operation, and multi-step tasks. Available in Cursor and VS Code.
Frontier treats agents like employees — build, deploy, and manage them at org scale. Targets the gap between model intelligence and production agent ops.
Gemini 3 Flash combines Gemini 3 Pro's reasoning with Flash efficiency. Available now via Gemini API, Vertex AI, and Gemini CLI.
Claude Code's source is public on GitHub but uses a custom license, not MIT. Developers debate the distinction as agent tooling forks emerge.
The new architecture could shape DeepSeek's next major model. Analysts split on whether a standalone R2 is coming or if it folds into a larger release.
Curated weekly by Koobo Content Agent
Groundbreaking
Recent breakthroughs that changed the landscape.
ReAct: Synergizing Reasoning and Acting in Language Models
Yao et al.
Showed that interleaving reasoning traces with actions lets language models solve complex tasks by thinking and acting in alternation. ReAct is the conceptual foundation for most modern AI agent architectures — reason about what to do, then do it, then reason again.
Toolformer: Language Models Can Teach Themselves to Use Tools
Schick et al.
Demonstrated that language models can learn to use external tools (calculators, search engines, APIs) through self-supervised learning. Established that tool use is a learnable skill, not just a prompting trick — a key insight for building capable AI agents.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron et al.
Meta's release of high-quality open-weight models with permissive licensing catalyzed the open-source AI ecosystem. Llama 2 proved that open models could approach proprietary performance, launching a wave of community fine-tuning and derivative models.
Mixtral of Experts
Jiang et al.
Demonstrated that mixture-of-experts architectures can match models 6x their active parameter count. By activating only a subset of parameters per token, MoE models achieve large-model quality at small-model inference cost — a key efficiency breakthrough.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek AI
Trained for an estimated $6 million, DeepSeek-R1 matched OpenAI o1's reasoning capabilities and was released under the MIT license. Validated that frontier-level reasoning can be achieved through RL without expensive supervised fine-tuning, fundamentally altering the economics of AI development.
Qwen2.5 Technical Report
Qwen Team
Alibaba's Qwen2.5 series demonstrated that open-source models trained on 18 trillion tokens across 29 languages could match or exceed proprietary models on coding, math, and reasoning benchmarks. The subsequent Qwen3 variants outperformed OpenAI O3 on advanced mathematics.
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Google
Google's natively multimodal model family demonstrated that training on interleaved text, image, audio, and video from the start produces stronger cross-modal reasoning than bolting modalities onto a text model. Set new benchmarks for multimodal understanding.
The Claude Model Family: Claude 3.5 System Card
Anthropic
Anthropic's detailed system card for Claude 3.5 set a new standard for AI transparency, documenting model capabilities, safety evaluations, and known limitations. Demonstrated how responsible AI development can coexist with frontier capabilities.
Foundational
The canonical papers that define the field.
Attention Is All You Need
Vaswani et al.
Introduced the Transformer architecture, replacing recurrence with self-attention for sequence modeling. This paper is the foundation of every modern large language model — GPT, BERT, Llama, Claude, and Gemini all descend from this architecture.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al.
Demonstrated that pre-training a bidirectional transformer on unlabeled text, then fine-tuning on specific tasks, dramatically outperforms training from scratch. Established the pre-train/fine-tune paradigm that defines modern NLP.
Language Models are Few-Shot Learners
Brown et al.
Showed that scaling language models to 175 billion parameters enables few-shot learning — performing tasks from just a few examples without fine-tuning. Proved that scale itself is a path to general capability.
Training language models to follow instructions with human feedback
Ouyang et al.
Introduced RLHF (Reinforcement Learning from Human Feedback) to align language models with human intent. This technique transformed raw language models into useful assistants — the key innovation behind ChatGPT and every instruction-tuned model since.
Scaling Laws for Neural Language Models
Kaplan et al.
Established precise mathematical relationships between model size, dataset size, compute budget, and performance. These scaling laws became the strategic blueprint for training larger and more capable models — directly informing investment decisions across the industry.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei et al.
Demonstrated that prompting models to show their reasoning step-by-step dramatically improves performance on math, logic, and multi-step tasks. Chain-of-thought is now a standard technique in both prompting and model training.
Constitutional AI: Harmlessness from AI Feedback
Bai et al.
Introduced a method for training AI systems to be helpful and harmless using a set of principles (a 'constitution') rather than extensive human labeling. Pioneered AI-to-AI feedback for alignment, reducing dependence on human annotation.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al.
Combined retrieval systems with generative models, allowing language models to access external knowledge at inference time. RAG is now the standard architecture for building AI systems that need to work with specific, up-to-date, or proprietary information.
Want to see AI analysis in action?
Try our AI Strategy Analyzer — describe a work or business scenario and get an instant agentic AI assessment.
Try the AI Strategy Analyzer