Wearable motion understanding
AnyMo
Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
Baiyu Chen1,2, Zechen Li1, Wilson Wongso1,2, Lihuan Li1,2, Xiachong Lin1, Hao Xue1,2,3,4, Benjamin Tag1, Flora Salim1,2
1The University of New South Wales
2ARC Centre of Excellence for Automated Decision Making + Society
3The Hong Kong University of Science and Technology (Guangzhou)
4The Hong Kong University of Science and Technology
AnyMo learns wearable IMU motion representations that transfer across sensing setups and datasets, while connecting sparse wearable signals to open-vocabulary recognition, retrieval, and motion captioning.
average zero-shot HAR gain across 14 unseen datasets
stronger bidirectional motion-language retrieval
zero-shot wearable IMU captioning improvement

Overview
Wearable setup variation is structured, not arbitrary.
The signal measured by a wearable IMU is produced by the interaction of body motion, body-surface geometry, local sensor orientation, and device response. This structure explains why a wrist watch, glasses, or a phone in a pocket can observe the same activity through very different inertial patterns.
AnyMo uses this structure as an inductive bias. It simulates dense plausible IMU candidates over body-surface placements, pre-trains a spatio-temporal graph encoder from paired placement views and masked sparse observations, then converts setup-stable motion latents into compact full-body IMU tokens for motion-language modeling.
Physics-grounded surface simulation
Local surface frames from body normals and tangent planes define plausible wearable placements, orientations, and noisy IMU signals on the body mesh.
Setup-agnostic representation learning
Paired placement views and masked sparse observations train a graph encoder to recover full-body motion structure from setup-specific IMU windows.
Full-body tokenization and
alignment
The learned motion representation is quantized into compact full-body IMU tokens, then aligned with language for recognition, retrieval, and captioning.
Interactive geometry
Explore body-surface tangent planes.
The visualization shows template mesh points, segment placements, and the local surface geometry used to construct plausible wearable sensor setups. The spike vectors are surface normals, while the translucent squares are local tangent planes.
Method
From surface-aware simulation to language.
Physics-Grounded Geometry-Aware Simulation

Geometry-Aware Setup-Agnostic Pre-Training and Tokenization

Tokenization and Pre-Training Detail

Contrastive Instruction Tuning and Inference

Quantitative Results
AnyMo improves recognition, retrieval, and captioning.
The tables summarize the main benchmark results from the paper. Purple bold marks the best result, while purple underline marks the second-best result.
Zero-Shot HAR Comparison
Recognition performance across easy, medium, and hard datasets.
| Method | Metric | Opportunity | UCI-HAR | w-HAR | RealWorld | TNDA-HAR | EgoExo4D | OpenPack | PAMAP2 | USC-HAD | WISDM | DSADS | UTD-MHAD | Ego4D | MMEA | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of Classes | 4 | 6 | 7 | 8 | 8 | 8 | 10 | 12 | 12 | 18 | 19 | 27 | 31 | 32 | ||
| Level | Easy | Easy | Easy | Easy | Easy | Easy | Medium | Medium | Medium | Medium | Medium | Hard | Hard | Hard | Average | |
| ImageBind | Acc | 59.3 | 14.0 | 8.2 | 17.4 | 22.6 | 8.2 | 11.6 | 12.8 | 11.2 | 6.9 | 6.2 | 2.3 | 2.8 | 4.1 | 13.4 |
| F1 | 39.6 | 6.7 | 6.1 | 10.4 | 18.6 | 8.5 | 8.3 | 7.7 | 6.9 | 4.4 | 3.6 | 0.8 | 0.8 | 1.3 | 8.8 | |
| R@2 | 83.5 | 20.2 | 52.5 | 27.4 | 35.3 | 12.5 | 20.7 | 15.5 | 23.5 | 13.5 | 8.2 | 4.7 | 6.4 | 7.6 | 23.7 | |
| IMU2CLIP | Acc | 47.3 | 30.1 | 36.1 | 27.3 | 26.8 | 10.9 | 9.5 | 14.0 | 14.8 | 12.4 | 11.5 | 3.7 | 1.8 | 5.3 | 18.0 |
| F1 | 37.6 | 27.3 | 20.9 | 20.5 | 22.3 | 5.9 | 4.5 | 11.3 | 11.6 | 8.3 | 8.3 | 0.4 | 0.4 | 3.1 | 13.0 | |
| R@2 | 69.0 | 55.1 | 60.7 | 43.9 | 46.2 | 18.5 | 19.5 | 26.6 | 29.1 | 20.4 | 17.6 | 7.9 | 3.1 | 10.8 | 30.6 | |
| IMUGPT | Acc | 10.1 | 1.1 | 67.2 | 16.9 | 14.3 | 12.5 | 11.4 | 8.9 | 6.0 | 8.3 | 7.5 | 3.7 | 4.6 | 3.4 | 12.6 |
| F1 | 10.4 | 0.3 | 38.8 | 4.0 | 6.1 | 7.6 | 9.1 | 1.5 | 6.9 | 6.6 | 2.0 | 0.3 | 1.9 | 1.8 | 7.0 | |
| R@2 | 33.7 | 18.2 | 67.2 | 33.7 | 28.5 | 27.4 | 22.9 | 19.3 | 31.8 | 13.9 | 14.6 | 8.8 | 9.3 | 6.7 | 24.0 | |
| HARGPT | Acc | 28.8 | 15.0 | 4.9 | 12.7 | 13.7 | 16.1 | 10.2 | 11.1 | 9.5 | 5.5 | 5.8 | 3.3 | 3.6 | 2.4 | 10.2 |
| F1 | 17.3 | 12.7 | 3.1 | 5.3 | 5.4 | 12.0 | 5.5 | 2.1 | 3.6 | 3.5 | 3.4 | 1.5 | 1.1 | 0.9 | 5.5 | |
| R@2 | 47.0 | 31.4 | 11.5 | 31.7 | 25.2 | 32.0 | 22.7 | 23.0 | 17.6 | 11.8 | 12.1 | 9.3 | 7.8 | 6.3 | 20.7 | |
| UniMTS | Acc | 45.9 | 35.2 | 59.0 | 43.6 | 59.1 | 23.1 | 11.5 | 47.2 | 30.5 | 27.8 | 31.5 | 22.8 | 3.7 | 6.1 | 31.9 |
| F1 | 42.2 | 22.0 | 42.9 | 36.7 | 53.7 | 18.4 | 7.5 | 43.6 | 27.8 | 25.5 | 23.7 | 18.5 | 4.3 | 2.8 | 26.4 | |
| R@2 | 80.0 | 53.1 | 60.7 | 64.0 | 77.5 | 47.0 | 21.9 | 63.2 | 45.4 | 47.1 | 46.0 | 32.6 | 6.9 | 10.7 | 46.9 | |
| NormWear | Acc | 26.0 | 3.7 | 3.3 | 16.8 | 12.2 | 10.5 | 9.8 | 7.9 | 8.7 | 4.4 | 0.7 | 3.7 | 2.0 | 2.7 | 8.0 |
| F1 | 10.3 | 1.6 | 1.3 | 3.8 | 2.8 | 3.1 | 3.1 | 1.6 | 1.4 | 0.9 | 0.1 | 0.3 | 0.2 | 0.3 | 2.2 | |
| R@2 | 66.1 | 29.6 | 3.3 | 20.4 | 19.1 | 20.2 | 16.5 | 10.5 | 12.2 | 12.4 | 2.6 | 7.4 | 5.8 | 6.0 | 16.6 | |
| Gemma 4 26B Text | Acc | 35.9 | 29.4 | 29.5 | 30.8 | 28.2 | 18.0 | 9.6 | 13.3 | 29.7 | 11.6 | 10.7 | 4.7 | 2.4 | 5.6 | 18.5 |
| F1 | 23.5 | 19.9 | 12.7 | 18.8 | 19.3 | 11.8 | 7.6 | 7.4 | 12.4 | 8.0 | 7.6 | 2.0 | 1.2 | 1.9 | 11.0 | |
| R@2 | 68.4 | 60.5 | 31.1 | 58.9 | 53.1 | 40.7 | 19.4 | 31.1 | 42.7 | 20.8 | 22.0 | 10.2 | 6.7 | 9.4 | 33.9 | |
| Gemma 4 26B Plot | Acc | 39.8 | 29.8 | 27.9 | 31.2 | 27.8 | 24.8 | 7.7 | 19.4 | 32.9 | 9.4 | 11.5 | 5.6 | 3.5 | 4.4 | 19.7 |
| F1 | 33.0 | 21.6 | 10.6 | 23.6 | 15.1 | 11.9 | 4.5 | 10.8 | 14.0 | 6.1 | 7.5 | 1.2 | 1.3 | 2.0 | 11.7 | |
| R@2 | 74.4 | 59.6 | 32.8 | 58.1 | 50.1 | 24.8 | 18.2 | 35.4 | 46.8 | 16.9 | 21.5 | 10.7 | 7.9 | 7.9 | 33.2 | |
| AnyMo | Acc | 59.4 | 56.5 | 57.4 | 48.4 | 59.4 | 30.2 | 13.1 | 52.6 | 27.7 | 25.4 | 36.3 | 16.3 | 8.6 | 8.1 | 35.7 (+11.7%) |
| F1 | 58.8 | 51.6 | 42.2 | 37.2 | 53.1 | 24.1 | 11.6 | 41.5 | 22.6 | 18.6 | 29.5 | 11.3 | 6.3 | 4.0 | 29.5 (+11.6%) | |
| R@2 | 83.5 | 89.5 | 98.4 | 77.6 | 87.9 | 51.6 | 28.0 | 78.2 | 64.0 | 41.1 | 53.0 | 24.2 | 13.7 | 13.9 | 57.5 (+22.6%) |
Cross-Modal Retrieval
Unseen and zero-shot retrieval on Nymeria held-out and EgoExo4D.
| Dataset | Nymeria Held-out | EgoExo4D Zero-shot | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | 100 Samples | All Samples | 100 Samples | All Samples | ||||||||||||
| R@1 | R@5 | R@10 | MRR | R@1 | R@5 | R@10 | MRR | R@1 | R@5 | R@10 | MRR | R@1 | R@5 | R@10 | MRR | |
| IMU -> Text | ||||||||||||||||
| ImageBind | 0.0 | 6.0 | 14.0 | 5.0 | 0.1 | 0.2 | 0.3 | 0.3 | 1.0 | 5.0 | 8.0 | 4.6 | 0.1 | 0.2 | 0.3 | 0.3 |
| IMU2CLIP | 1.0 | 6.0 | 12.0 | 5.5 | 0.0 | 0.1 | 0.3 | 0.3 | 2.0 | 10.0 | 23.0 | 8.2 | 0.0 | 0.3 | 0.5 | 0.4 |
| UniMTS | 4.0 | 12.0 | 23.0 | 10.0 | 0.2 | 0.9 | 1.6 | 0.9 | 1.0 | 9.0 | 16.0 | 6.3 | 0.1 | 0.6 | 1.3 | 0.7 |
| GPT-5.4 Mini | 1.0 | 7.0 | 11.0 | 4.4 | -- | -- | -- | -- | 1.0 | 6.0 | 10.0 | 3.7 | -- | -- | -- | -- |
| Gemma 4 26B | 2.0 | 9.0 | 16.0 | 6.1 | -- | -- | -- | -- | 2.0 | 4.0 | 12.0 | 4.6 | -- | -- | -- | -- |
| AnyMo | 28.0 | 63.0 | 77.0 | 44.6 | 2.3 | 9.5 | 15.4 | 7.0 | 2.0 | 9.0 | 27.0 | 9.5 | 0.2 | 0.7 | 1.4 | 0.8 |
| Text -> IMU | ||||||||||||||||
| ImageBind | 1.0 | 8.0 | 14.0 | 6.7 | 0.1 | 0.2 | 0.3 | 0.3 | 2.0 | 3.0 | 7.0 | 5.1 | 0.0 | 0.0 | 0.2 | 0.2 |
| IMU2CLIP | 0.0 | 6.0 | 14.0 | 5.0 | 0.1 | 0.2 | 0.3 | 0.3 | 1.0 | 9.0 | 17.0 | 7.7 | 0.1 | 0.3 | 0.5 | 0.4 |
| UniMTS | 1.0 | 6.0 | 12.0 | 5.5 | 0.1 | 0.2 | 0.4 | 0.3 | 1.0 | 5.0 | 10.0 | 5.3 | 0.0 | 0.1 | 0.3 | 0.2 |
| GPT-5.4 Mini | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| Gemma 4 26B | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| AnyMo | 33.0 | 60.0 | 79.0 | 46.7 | 3.0 | 9.9 | 16.1 | 7.5 | 3.0 | 10.0 | 23.0 | 9.9 | 0.0 | 0.3 | 0.6 | 0.4 |
Wearable IMU Motion Captioning
Unseen and zero-shot caption generation on Nymeria and EgoExo4D.
| Method | Nymeria Held-out | EgoExo4D Zero-shot | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| BLEU-1 | BLEU-4 | ROUGE-L | METEOR | BERT-F1 | BLEU-1 | BLEU-4 | ROUGE-L | METEOR | BERT-F1 | |
| GPT-5.4 Mini | 19.2 | 0.3 | 15.7 | 25.0 | 57.3 | 12.6 | 0.0 | 15.5 | 23.8 | 56.5 |
| Gemma 4 26B | 16.2 | 0.0 | 13.6 | 21.5 | 56.5 | 3.5 | 0.0 | 4.6 | 6.4 | 55.1 |
| AnyMo | 25.0 | 6.5 | 31.1 | 33.5 | 69.7 | 20.7 | 0.4 | 19.7 | 30.3 | 67.1 |
Qualitative Results
Visual evidence of geometry-grounded transfer.
These qualitative views show how AnyMo aligns synthetic and real wearable motion in the learned space, and how the model turns sparse IMU signals into full-body motion-language examples.
Real-synthetic alignment

Motion captioning examples

Resources
AnyMo-180, AnyMo Bench and synthetic data.
We curate AnyMo-180, a fine-grained activity-label vocabulary for Nymeria motion windows, and build dense body-surface IMU placements for geometry-aware simulation. Together with AnyMo Bench, these form one of the largest fine-grained IMU-based HAR training corpora and benchmarks for unseen-subject and cross-device evaluation.
AnyMo Bench
A challenging fine-grained in-the-wild HAR benchmark.
AnyMo Bench evaluates recognition under two forms of generalization: fine-grained daily activities on unseen subjects, and cross-device transfer between co-located IMU units mounted at the head, left wrist, and right wrist.
Baseline Results on AnyMo Bench
Purple bold marks the best result in each setting, while purple underline marks the second-best result.
| Model | Acc@1 | Acc@5 | Macro-F1 |
|---|---|---|---|
| Fine150 / Unseen Subject | |||
| DeepConvLSTM | 35.3 | 63.0 | 17.2 |
| MantisV2 | 38.5 | 65.2 | 22.8 |
| COMODO | 37.8 | 65.2 | 16.0 |
| Fine150 / Unseen Subject + Cross Device | |||
| DeepConvLSTM | 1.9 | 9.5 | 0.4 |
| MantisV2 | 14.4 | 39.7 | 8.6 |
| COMODO | 24.0 | 50.6 | 8.0 |
| Core50 / Unseen Subject | |||
| DeepConvLSTM | 43.2 | 75.4 | 34.5 |
| MantisV2 | 45.8 | 76.8 | 41.3 |
| COMODO | 46.2 | 78.8 | 37.3 |
| Core50 / Unseen Subject + Cross Device | |||
| DeepConvLSTM | 1.8 | 12.1 | 0.6 |
| MantisV2 | 16.6 | 48.7 | 18.4 |
| COMODO | 32.6 | 67.8 | 23.3 |
@article{chen2026anymo,
title={AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild},
author={Chen, Baiyu and Li, Zechen and Wongso, Wilson and Li, Lihuan and Lin, Xiachong and Xue, Hao and Tag, Benjamin and Salim, Flora},
journal={arXiv preprint arXiv:2605.22715},
year={2026}
}


