Coarse localization:
bounding boxes cannot reach pixel-level accuracy,Inability to ground comprehensively:
bounding boxes cannot ground backgrounds,Tendency to provide trivial information:
current datasets usually capture objects like head to form the trivial relation of person-has-head, due to the large freedom of bounding box annotation.Duplicate groundings:
the same object could be grounded by multiple separate bounding boxes.Positional Relations (6) | over, in front of, beside, on, in, attached to. |
Common Object-Object Relations (5) | hanging from, on the back of, falling off, going down, painted on. |
Common Actions (31) | walking on, running on, crossing, standing on, lying on, sitting on, leaning on, flying over, jumping over, jumping from, wearing, holding, carrying, looking at, guiding, kissing, eating, drinking, feeding, biting, catching, picking (grabbing), playing with, chasing, climbing, cleaning (washing, brushing), playing, touching, pushing, pulling, opening. |
Human Actions (4) | cooking, talking to, throwing (tossing), slicing. |
Actions in Traffic Scene (4) | driving, riding, parked on, driving on. |
Actions in Sports Scene (3) | About to hit, kicking, swinging. |
Interaction between Background (3) | entering, exiting, enclosing (surrounding, warping in) |