Last year, I worked on evaluating the cybersecurity capabilities of LLM agents on real world tasks (from bug bounty/white hat hacker sites) in Percy Liang and Dan Boneh's research lab.
See our paper and poster published in NeurIPS.BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems