GOEDEL-PROVER-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction2просмотра3 месяца назад
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis1просмотр3 месяца назад
LIVEMCP-101: Stress Testing and Diagnosing MCP-Enabled Agents on Challenging Queries3просмотра4 месяца назад
Numerical Models Outperform AI Weather Forecasts of Record-Breaking Extremes6просмотров4 месяца назад
COMPUTER RL: Scaling End-to-End Online Reinforcement Learning for Computer Usage Agents7просмотров4 месяца назад
AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions6просмотров4 месяца назад