This repo includes supplementary data for the paper ["A Grounded Typology of Word Classes"][1]
Here you'll find word-level groundedness measures based on PaliGemma for the COCO-35L, Multi30k, and Crossmodal-3600 datasets. POS tagging is based on [Stanza][2].
If you wish to compute groundedness scores on your own data, you'll need the [`paligemma-3b-ft-coco35-224`][3] checkpoint from Huggingface for the captioning model. For the language model, please use our trained, comparable model: [`chaley22/pali-captioning-lm-nolora`][4]
The model covers the 35 languages in COCO-35L:
- Arabic (ar)
- Bengali (bn)
- Czech (cs)
- Danish (da)
- German (de)
- English (en)
- Spanish (es)
- Persian (fa)
- Finnish (fi)
- French (fr)
- Hebrew (he)
- Hindi (hi)
- Croatian (hr)
- Hungarian (hu)
- Indonesian (id)
- Italian (it)
- Japanese (ja)
- Korean (ko)
- Norwegian (no)
- Dutch (nl)
- Polish (pl)
- Portuguese (pt)
- Romanian (ro)
- Russian (ru)
- Swedish (sv)
- Swahili (sw)
- Maori (mi)
- Telugu (te)
- Thai (th)
- Turkish (tr)
- Ukranian (uk)
- Vietnamese (vi)
- Chinese (zh)
[1]: https://arxiv.org/abs/2412.10369
[2]: https://stanfordnlp.github.io/stanza/
[3]: https://huggingface.co/google/paligemma-3b-ft-coco35l-224
[4]: https://huggingface.co/chaley22/pali-captioning-lm-nolora