iOS World Agents – Embodied AI Evaluation Framework

🔍 Overview

Built an embodied simulation framework to evaluate LLM-driven agents performing real iOS actions inside simulator environments, enabling behavioral evaluation beyond static benchmarks.

🚀 Key Contributions

  • Designed JSON-based task schemas covering 50+ multi-step tasks across Safari, Maps, Calendar, Files, and Settings
  • Orchestrated GPT-4o, Gemini-1.5, Grok-2 for controlled cross-model behavioral comparison
  • Implemented Reflexion + TextGrad feedback loops, improving task completion by 8–10% without fine-tuning