I was debugging some repl delay monitoring metrics and I noticed that the test I was doing (sysbench --test=oltp prepare) to generate replication data was far outstripping the slave. The SQL thread was caught up to the IO thread, but the IO thread was way behind the master.
Replicating from: a2.db.bcp.re1.yahoo.com
Master: a2_db_bcp_re1.000166/138395515
Slave I/O: Yes a2_db_bcp_re1.000165/802640907 ???
Slave Relay: Yes a2_db_bcp_re1.000165/802030586 596K
198 secs
In this case, the I/O thread was getting further and further behind as sysbench did bulk inserts into my master. My theory is that a lot of relatively small binary log records simply don't transfer efficiently. That leaves the SQL thread idle some of the time waiting for the IO thread, and leads it inefficient replication.
I poked around the replication options manual page, looking for something to help and found this: slave_compressed_protocol
