Multi-Person 3D Motion Prediction with Multi-Range Transformers

Jiashun Wang¹

Huazhe Xu²

Medhini Narasimhan²

Xiaolong Wang¹

UC San Diego¹

UC Berkeley²

[arXiv]

[code]

Abstract:

We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human's action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social interactions. The Transformer decoder then performs prediction for each person by taking a corresponding pose as a query which attends to both local and global-range encoder features. Our model not only outperforms state-of-the-art methods on long-term 3D motion prediction, but also generates diverse social interactions. More interestingly, our model can even predict 15-person motion simultaneously by automatically dividing the persons into different interaction groups.

Video:

Methods:

Individual input motion is sent to the Local-range Transformer Encoder and all the person's motions are sent to the Global-range Transformer Encoder. The encoded motion features are used as the key and value together with the query person skeleton for the Transformer Decoder. The output is the future motion prediction results. On the right, we show the architecture of the Transformer decoder. The encoder architecture is similar with the decoder except that the query, key and value are from the same input.

Qualitative Results:

We show the results of our method. Green color represents the input and blue represents the output. The results show that our method can predict smooth and natural multi-person 3d motions.

We show our method compared with the other methods. Green color represents the input and blue represents the output. Our results are not only the closest to the real records but also very smooth and natural. It can be seen that RNN-based method (SocialPool) will quickly produce freezing motion. When predicting the absolute skeleton joint positions, decoding based on an input seed sequence (HRI) or adding the input sequential residual to the output (LTD), will make the predicted motion have hysteresis and repeat the history.