iOS World Agents – Embodied AI Evaluation Framework
🔍 Overview
Built an embodied simulation framework to evaluate LLM-driven agents performing real iOS actions inside simulator environments, enabling behavioral evaluation beyond static benchmarks.
🚀 Key Contributions
- Designed JSON-based task schemas covering 50+ multi-step tasks across Safari, Maps, Calendar, Files, and Settings
- Orchestrated GPT-4o, Gemini-1.5, Grok-2 for controlled cross-model behavioral comparison
- Implemented Reflexion + TextGrad feedback loops, improving task completion by 8–10% without fine-tuning
🔗 Links
- 💻 Code: GitHub Repository
