Wearable motion understanding

AnyMo

Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

Baiyu Chen^1,2, Zechen Li¹, Wilson Wongso^1,2, Lihuan Li^1,2, Xiachong Lin¹, Hao Xue^1,2,3,4, Benjamin Tag¹, Flora Salim^1,2

¹The University of New South Wales

²ARC Centre of Excellence for Automated Decision Making + Society

³The Hong Kong University of Science and Technology (Guangzhou)

⁴The Hong Kong University of Science and Technology

AnyMo learns wearable IMU motion representations that transfer across sensing setups and datasets, while connecting sparse wearable signals to open-vocabulary recognition, retrieval, and motion captioning.

Paper Code AnyMo Bench

+11.7%

Accuracy

average zero-shot HAR gain across 14 unseen datasets

+28.6%

Text-to-IMU MRR

stronger bidirectional motion-language retrieval

+18.8%

BERT-F1

zero-shot wearable IMU captioning improvement

AnyMo method family overview and performance comparison — Method families for wearable human motion understanding and radar plot comparing the performance of AnyMo with baselines across various tasks and capabilities.

Why AnyMo?

Real wearable IMU data are scarce and fragmented across different sensing setups.

AnyMo uses body-surface geometry to synthesize diverse plausible IMU observations from motion and mesh data.

Wearable models can overfit to sensing setup instead of the underlying motion.

Masked cross-view prediction encourages the model to recover shared motion from sparse, setup-specific observations.

Raw IMU streams are long and inefficient for direct LLM input.

AnyMo compresses multi-position wearable signals into compact full-body motion tokens.

Closed-set HAR labels are hard to transfer across datasets and enable different downstream tasks.

AnyMo aligns IMU and language to support recognition, retrieval, and motion caption generation.

Overview

Wearable setup variation is structured, not arbitrary.

The signal measured by a wearable IMU is produced by the interaction of body motion, body-surface geometry, local sensor orientation, and device response. This structure explains why a wrist watch, glasses, or a phone in a pocket can observe the same activity through very different inertial patterns.

AnyMo uses this structure as an inductive bias. It simulates dense plausible IMU candidates over body-surface placements, pre-trains a spatio-temporal graph encoder from paired placement views and masked sparse observations, then converts setup-stable motion latents into compact full-body IMU tokens for motion-language modeling.

Physics-grounded surface simulation

Local surface frames from body normals and tangent planes define plausible wearable placements, orientations, and noisy IMU signals on the body mesh.

Setup-agnostic representation learning

Paired placement views and masked sparse observations train a graph encoder to recover full-body motion structure from setup-specific IMU windows.

Full-body tokenization and
alignment

The learned motion representation is quantized into compact full-body IMU tokens, then aligned with language for recognition, retrieval, and captioning.

Interactive geometry

Explore body-surface tangent planes.

The visualization shows template mesh points, segment placements, and the local surface geometry used to construct plausible wearable sensor setups. The spike vectors are surface normals, while the translucent squares are local tangent planes.

Normals show the outward direction of each candidate surface placement.

Tangent planes show the local surface frame used to orient simulated wearable IMUs.

Drag to rotate the 3D view; scroll inside the plot to zoom.

Interactive motion examples

Switch between real motion windows to inspect how body-surface IMU positions, tangent planes, normals, and IMU trajectories evolve over time.

Playing badminton

A dynamic in-the-wild sports motion showing how local IMU trajectories vary across body-surface placements.

Method

From surface-aware simulation to language.

Physics-Grounded Geometry-Aware Simulation

Geometry-Aware Setup-Agnostic Pre-Training and Tokenization

Tokenization and Pre-Training Detail

Contrastive Instruction Tuning and Inference

Quantitative Results

AnyMo improves recognition, retrieval, and captioning.

The tables summarize the main benchmark results from the paper. Purple bold marks the best result, while purple underline marks the second-best result.

Zero-Shot HAR Comparison

Recognition performance across easy, medium, and hard datasets.

Method	Metric	Opportunity	UCI-HAR	w-HAR	RealWorld	TNDA-HAR	EgoExo4D	OpenPack	PAMAP2	USC-HAD	WISDM	DSADS	UTD-MHAD	Ego4D	MMEA	Average
Number of Classes		4	6	7	8	8	8	10	12	12	18	19	27	31	32
Level		Easy	Easy	Easy	Easy	Easy	Easy	Medium	Medium	Medium	Medium	Medium	Hard	Hard	Hard	Average
ImageBind	Acc	59.3	14.0	8.2	17.4	22.6	8.2	11.6	12.8	11.2	6.9	6.2	2.3	2.8	4.1	13.4
	F1	39.6	6.7	6.1	10.4	18.6	8.5	8.3	7.7	6.9	4.4	3.6	0.8	0.8	1.3	8.8
	R@2	83.5	20.2	52.5	27.4	35.3	12.5	20.7	15.5	23.5	13.5	8.2	4.7	6.4	7.6	23.7
IMU2CLIP	Acc	47.3	30.1	36.1	27.3	26.8	10.9	9.5	14.0	14.8	12.4	11.5	3.7	1.8	5.3	18.0
	F1	37.6	27.3	20.9	20.5	22.3	5.9	4.5	11.3	11.6	8.3	8.3	0.4	0.4	3.1	13.0
	R@2	69.0	55.1	60.7	43.9	46.2	18.5	19.5	26.6	29.1	20.4	17.6	7.9	3.1	10.8	30.6
IMUGPT	Acc	10.1	1.1	67.2	16.9	14.3	12.5	11.4	8.9	6.0	8.3	7.5	3.7	4.6	3.4	12.6
	F1	10.4	0.3	38.8	4.0	6.1	7.6	9.1	1.5	6.9	6.6	2.0	0.3	1.9	1.8	7.0
	R@2	33.7	18.2	67.2	33.7	28.5	27.4	22.9	19.3	31.8	13.9	14.6	8.8	9.3	6.7	24.0
HARGPT	Acc	28.8	15.0	4.9	12.7	13.7	16.1	10.2	11.1	9.5	5.5	5.8	3.3	3.6	2.4	10.2
	F1	17.3	12.7	3.1	5.3	5.4	12.0	5.5	2.1	3.6	3.5	3.4	1.5	1.1	0.9	5.5
	R@2	47.0	31.4	11.5	31.7	25.2	32.0	22.7	23.0	17.6	11.8	12.1	9.3	7.8	6.3	20.7
UniMTS	Acc	45.9	35.2	59.0	43.6	59.1	23.1	11.5	47.2	30.5	27.8	31.5	22.8	3.7	6.1	31.9
	F1	42.2	22.0	42.9	36.7	53.7	18.4	7.5	43.6	27.8	25.5	23.7	18.5	4.3	2.8	26.4
	R@2	80.0	53.1	60.7	64.0	77.5	47.0	21.9	63.2	45.4	47.1	46.0	32.6	6.9	10.7	46.9
NormWear	Acc	26.0	3.7	3.3	16.8	12.2	10.5	9.8	7.9	8.7	4.4	0.7	3.7	2.0	2.7	8.0
	F1	10.3	1.6	1.3	3.8	2.8	3.1	3.1	1.6	1.4	0.9	0.1	0.3	0.2	0.3	2.2
	R@2	66.1	29.6	3.3	20.4	19.1	20.2	16.5	10.5	12.2	12.4	2.6	7.4	5.8	6.0	16.6
Gemma 4 26B Text	Acc	35.9	29.4	29.5	30.8	28.2	18.0	9.6	13.3	29.7	11.6	10.7	4.7	2.4	5.6	18.5
	F1	23.5	19.9	12.7	18.8	19.3	11.8	7.6	7.4	12.4	8.0	7.6	2.0	1.2	1.9	11.0
	R@2	68.4	60.5	31.1	58.9	53.1	40.7	19.4	31.1	42.7	20.8	22.0	10.2	6.7	9.4	33.9
Gemma 4 26B Plot	Acc	39.8	29.8	27.9	31.2	27.8	24.8	7.7	19.4	32.9	9.4	11.5	5.6	3.5	4.4	19.7
	F1	33.0	21.6	10.6	23.6	15.1	11.9	4.5	10.8	14.0	6.1	7.5	1.2	1.3	2.0	11.7
	R@2	74.4	59.6	32.8	58.1	50.1	24.8	18.2	35.4	46.8	16.9	21.5	10.7	7.9	7.9	33.2
AnyMo	Acc	59.4	56.5	57.4	48.4	59.4	30.2	13.1	52.6	27.7	25.4	36.3	16.3	8.6	8.1	35.7 (+11.7%)
	F1	58.8	51.6	42.2	37.2	53.1	24.1	11.6	41.5	22.6	18.6	29.5	11.3	6.3	4.0	29.5 (+11.6%)
	R@2	83.5	89.5	98.4	77.6	87.9	51.6	28.0	78.2	64.0	41.1	53.0	24.2	13.7	13.9	57.5 (+22.6%)

Cross-Modal Retrieval

Unseen and zero-shot retrieval on Nymeria held-out and EgoExo4D.

Dataset	Nymeria Held-out								EgoExo4D Zero-shot
Method	100 Samples				All Samples				100 Samples				All Samples
	R@1	R@5	R@10	MRR	R@1	R@5	R@10	MRR	R@1	R@5	R@10	MRR	R@1	R@5	R@10	MRR
IMU -> Text
ImageBind	0.0	6.0	14.0	5.0	0.1	0.2	0.3	0.3	1.0	5.0	8.0	4.6	0.1	0.2	0.3	0.3
IMU2CLIP	1.0	6.0	12.0	5.5	0.0	0.1	0.3	0.3	2.0	10.0	23.0	8.2	0.0	0.3	0.5	0.4
UniMTS	4.0	12.0	23.0	10.0	0.2	0.9	1.6	0.9	1.0	9.0	16.0	6.3	0.1	0.6	1.3	0.7
GPT-5.4 Mini	1.0	7.0	11.0	4.4	--	--	--	--	1.0	6.0	10.0	3.7	--	--	--	--
Gemma 4 26B	2.0	9.0	16.0	6.1	--	--	--	--	2.0	4.0	12.0	4.6	--	--	--	--
AnyMo	28.0	63.0	77.0	44.6	2.3	9.5	15.4	7.0	2.0	9.0	27.0	9.5	0.2	0.7	1.4	0.8
Text -> IMU
ImageBind	1.0	8.0	14.0	6.7	0.1	0.2	0.3	0.3	2.0	3.0	7.0	5.1	0.0	0.0	0.2	0.2
IMU2CLIP	0.0	6.0	14.0	5.0	0.1	0.2	0.3	0.3	1.0	9.0	17.0	7.7	0.1	0.3	0.5	0.4
UniMTS	1.0	6.0	12.0	5.5	0.1	0.2	0.4	0.3	1.0	5.0	10.0	5.3	0.0	0.1	0.3	0.2
GPT-5.4 Mini	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--
Gemma 4 26B	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--
AnyMo	33.0	60.0	79.0	46.7	3.0	9.9	16.1	7.5	3.0	10.0	23.0	9.9	0.0	0.3	0.6	0.4

Wearable IMU Motion Captioning

Unseen and zero-shot caption generation on Nymeria and EgoExo4D.

Method	Nymeria Held-out					EgoExo4D Zero-shot
	BLEU-1	BLEU-4	ROUGE-L	METEOR	BERT-F1	BLEU-1	BLEU-4	ROUGE-L	METEOR	BERT-F1
GPT-5.4 Mini	19.2	0.3	15.7	25.0	57.3	12.6	0.0	15.5	23.8	56.5
Gemma 4 26B	16.2	0.0	13.6	21.5	56.5	3.5	0.0	4.6	6.4	55.1
AnyMo	25.0	6.5	31.1	33.5	69.7	20.7	0.4	19.7	30.3	67.1

Qualitative Results

Visual evidence of geometry-grounded transfer.

These qualitative views show how AnyMo aligns synthetic and real wearable motion in the learned space, and how the model turns sparse IMU signals into full-body motion-language examples.

Real-synthetic alignment

UMAP visualization of AnyMo real and synthetic alignment — UMAP visualization of paired real and synthetic IMU embeddings for ten activity categories.

Motion captioning examples

Qualitative wearable IMU captioning examples — Qualitative Results of Wearable IMU Motion Caption Generation. We use green to highlight correct parts and red for mistakes.

Resources

AnyMo-180, AnyMo Bench and synthetic data.

We curate AnyMo-180, a fine-grained activity-label vocabulary for Nymeria motion windows, and build dense body-surface IMU placements for geometry-aware simulation. Together with AnyMo Bench, these form one of the largest fine-grained IMU-based HAR training corpora and benchmarks for unseen-subject and cross-device evaluation.

180

AnyMo-180 activity classes

158,138

Labeled motion windows

2,374

Body-surface IMU positions

AnyMo Bench

A challenging fine-grained in-the-wild HAR benchmark.