InfoAtlas: A Foundation-style Model for Zero-Shot Statistical Dependency Measurement

Hu, Zhengyang; Chen, Yanzhi; Ren, Hanxiang; Zeng, Qunsong; Zheng, Youyi; Weller, Adrian; Huang, Kaibin; Yang, Yanchao

ICML 2026 Accepted

InfoAtlas:
A Foundation-style Model for
Zero-Shot Statistical Dependency Measurement

Zhengyang Hu^1,*, Yanzhi Chen^2,3,*, Hanxiang Ren⁴, Qunsong Zeng¹, Youyi Zheng⁴,
Adrian Weller², Kaibin Huang¹, Yanchao Yang^1,†

¹The University of Hong Kong ²University of Cambridge ³Microsoft ⁴Zhejiang University

^*Equal contribution ^†Corresponding author

Need fast, accurate dependence estimation?

InfoAtlas is a pretrained model that, after one-time training on a vast atlas of synthetic distributions, estimates the dependency between any (x, y) pair in a single forward pass. Samples in, dependence out — and it stays differentiable.

Paper arXiv Poster Code BibTeX

300×

faster than neural MI estimators

all-order

linear, non-linear, non-Gaussian, discrete — all caught

1 model

handles any d_x, d_y, n — zero retraining

0 grad

steps at test time — one forward pass

Existing MI estimation (per-dataset gradient training) vs. InfoAtlas (pretrained, one forward pass)

Existing neural MI estimators (left) train a network per dataset. InfoAtlas (right) is pretrained once and emits MI from samples in a single forward pass.

Abstract

Neural mutual information (MI) estimators are accurate but slow: each new dataset triggers its own optimization run. InfoAtlas removes that step. Pretrained once on a large synthetic atlas of dependence structures, it infers MI in a single forward pass — matching state-of-the-art accuracy at ~300× the speed, on inputs of varying dimension and sample size, with strong zero-shot transfer to real data.

Method

A dual-path attentive hypernetwork, pretrained on a synthetic atlas of dependence structures.

Given samples from an unknown joint distribution, InfoAtlas emits the parameters of a near-optimal Donsker–Varadhan critic in a single forward pass. A joint path attends over paired samples (x, y); a marginal path shuffles the pairing to model independence. Cross-attention fuses the two, and a small MLP decodes the critic weights.

Variable-shape inputs

Smaller inputs are padded with independent Gaussian noise — a transformation that provably preserves MI — and attention handles varying sample sizes natively. For dimensions beyond the trained range, we plug InfoAtlas into k-sliced MI [1], itself a solid dependence measure.

[1] Goldfeld, Greenewald, Nuradha, Reeves. k-Sliced Mutual Information: A Quantitative Study of Scalability with Dimension. NeurIPS 2022.

The shift: prior neural estimators optimize a critic per dataset. InfoAtlas infers it — minutes become milliseconds.

Results

From synthetic benchmarks to real-world data — one checkpoint, zero fine-tuning. Use the arrows or ← / → to browse.

Method	Mn-dense 5-5-0.5	Spiral 3-3-2-2.0	Asinh@St 5-5-2	St 3-3-3	Uniform 3-3-2-2.0	Hc@Mn 5-5-2	Additive 1-1-0.1	Bimodal 1-1-0.75	Time (s)
GT	0.59	1.02	0.45	0.18	1.02	1.02	1.71	0.41	—
KSG	0.54	0.75	0.25	0.07	0.79	0.58	1.61	0.41	0.13
KDE	1.59	2.87	2.43	2.36	1.17	2.23	2.94	1.23	2.04
MINE	0.60	1.00	0.53	0.21	1.03	1.06	1.63	0.39	25.9
MINE-5s	0.60	0.90	0.33	0.15	0.93	1.06	1.61	0.38	4.92
MINDE	0.58	0.92	0.43	0.36	0.89	1.01	1.42	0.50	34.2
InfoNCE	0.56	0.98	0.49	0.18	0.97	1.03	1.62	0.40	67.6
KNIFE	0.93	0.10	0.66	0.50	0.07	0.92	0.05	0.65	48.4
InfoAtlas	0.60	0.89	0.41	0.21	0.93	0.96	1.46	0.39	0.09

Bold: closest to ground truth. Underlined: second-best. Neural Non-parametric

CLIP — InfoAtlas (5-sliced) — InfoAtlas (5-sliced)

CLIP — InfoNet (1-sliced) — InfoAtlas (5-sliced)

MI heatmap w.r.t. reference point 1 — MI heatmap, reference point 1 (★)

MI heatmap w.r.t. reference point 2 — MI heatmap, reference point 1 (★)

Method	Pick Cube		Stack Cube		Peg Insertion		Time (s)
Method	Seen	Unseen	Seen	Unseen	Seen	Unseen	Time (s)
No-MI-Loss	66.0	60.0	67.4	41.0	38.6	9.3	—
MINE-100	86.4	81.0	68.0	37.0	55.0	13.5	0.62
MINE-1000	81.2	81.0	61.2	37.0	65.4	17.8	6.01
InfoNet	91.0	76.0	63.0	27.0	46.4	9.8	1.23
InfoAtlas	94.2	82.0	68.2	37.0	72.4	18.3	2.17

Policy success rate (%) on 100-D state observations. InfoAtlas uses 25 slices, InfoNet uses 250 slices.

Sample size n	1k	5k	10k	20k	50k
Time (s)	0.039	0.049	0.055	0.070	0.106
GPU memory (GB)	0.21	0.55	0.98	1.81	4.33

Speed averaged over batch size 16 on a single H200 GPU.

Sanity check on the BMI benchmark (Czyz et al., 2023): 8 tasks with known ground-truth MI. InfoAtlas matches the best neural baselines at ~0.09 s per task — about 300× faster.

BibTeX

Please cite the ICML 2026 version once the proceedings are out; for now, the arXiv preprint suffices.

@article{hu2026infoatlas,
  title         = {{InfoAtlas}: A Foundation-style Model for Zero-Shot
                   Statistical Dependency Measurement},
  author        = {Hu, Zhengyang and Chen, Yanzhi and Ren, Hanxiang and Zeng, Qunsong
                   and Zheng, Youyi and Weller, Adrian and Huang, Kaibin and Yang, Yanchao},
  journal       = {arXiv preprint arXiv:2606.00241},
  year          = {2026},
  eprint        = {2606.00241},
  archivePrefix = {arXiv},
  note          = {To appear at ICML 2026. Equal contribution: Zhengyang Hu and Yanzhi Chen.}
}