
A new AI model, H-CAST, groups fine details into object-level concepts as attention moves from lower to high layers, outputting a classification tree—such as bird, eagle, bald eagle—rather than focusing only on fine-grained recognition.
The research was presented at the International Conference on Learning Representations in Singapore and builds upon the team’s prior model, CAST—the counterpart for visually grounded single-level classification. The paper is also published on the arXiv preprint server.
While some argue that deep learning can reliably provide fine-grained classification and infer broader categories, this tactic only works with clear images.
“Real-world applications involve plenty of imperfect images. If a model only focuses on fine-grained classification, it gives up before it even starts on images that don’t have enough information to support that level of detail,” said Stella Yu, a professor of computer science and engineering at U-M and contributing author of the study.
Hierarchical classification overcomes this issue, providing classification at multiple levels of detail for the same image. However, up to this point, hierarchical models have struggled with inconsistencies that come with treating each level as its own classification task.
For example, when identifying a bird, fine-grained classification often depends on local details like beak shape or feather color, while coarse labels require global features like overall shape. When these two levels are disconnected, it can result in a fine classifier predicting “green parakeet” while the coarse classifier predicts “plant.”
The new model instead focuses all levels on the same object at different levels of detail by aligning fine-to-coarse predictions through intra-image segmentation.
Previous hierarchical models trained from coarse to specific, focusing on the logic of semantic labeling which flows from general to specific (e.g., bird, hummingbird, green hermit). H-CAST instead trains in the visual space where recognition begins with fine details like beaks and wings that are composed of coarser structures, leading to better alignment and accuracy.
“Most prior work in hierarchical classification focused on semantics alone, but we found that consistent visual grounding across levels can make a huge difference. By encouraging models to ‘see’ the hierarchy in a visually coherent way, we hope this work inspires a shift toward more integrated and interpretable recognition systems,” said Seulki Park, a postdoctoral research fellow of computer science and engineering at the University of Michigan and lead author of the study.
Unlike prior methods, the research team leveraged unsupervised segmentation—typically used for identifying structures within a larger image—to support hierarchical classification. They demonstrate that its visual grouping mechanism can be effectively applied to classification without requiring pixel-level labels and helps improve segmentation quality.
To demonstrate the new model’s effectiveness, H-CAST was tested on four benchmark datasets and compared against hierarchical (FGN, HRN. TransHP, Hier-ViT) and baseline models (ViT, CAST, HiE).
“Our model outperformed zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions,” said Yu.
For instance, in the BREEDS dataset, H-CAST’s full-path accuracy was 6% higher than previous state-of-the-art and 11% higher than baselines.
Feature-level nearest neighbor analysis also shows H-CAST retrieves semantically and visually consistent samples across hierarchy levels—unlike prior models that often retrieve visually similar but semantically incorrect samples.
This work could potentially be applied to any situation that requires an understanding of multi-level images. It could particularly benefit wildlife monitoring, identifying species where possible but falling back on coarser predictions. H-CAST can also help autonomous vehicles interpret imperfect visual input like occluded pedestrians or distant vehicles, helping the system make safe, approximate decisions at coarser levels of detail.
“Humans naturally fall back on coarser concepts. If I can’t tell if an image is of a Pembroke Corgi, I can still confidently say it’s a dog. But models often fail at that kind of flexible reasoning. We hope to eventually build a system that can adapt its prediction level just like we do,” said Park.
H-CAST was trained and tested using ARC High Performance Computing at U-M.
UC Berkeley, MIT and Scaled Foundations also contributed to this research.
More information:
Seulki Park, et al. Visually consistent hierarchical image classification. International Conference on Learning Representations (2025).
Seulki Park et al, Visually Consistent Hierarchical Image Classification, arXiv (2024). DOI: 10.48550/arxiv.2406.11608
Citation:
Computer vision identifies images with a classification tree, including broad and specific categories (2025, May 14)
retrieved 14 May 2025
from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.
Leave a comment