Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space

from Hackernoon 2 years ago

We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, ensuring both languages are represented in the same script.
Hackernoonhttps://hackernoon.com/training-a-bilingual-language-model-by-mapping-tokens-onto-a-shared-character-space?source=rss

The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step.
Hackernoonhttps://hackernoon.com/training-a-bilingual-language-model-by-mapping-tokens-onto-a-shared-character-space?source=rss

Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.
Hackernoonhttps://hackernoon.com/training-a-bilingual-language-model-by-mapping-tokens-onto-a-shared-character-space?source=rss

It has been shown that language models generalize better on multilingual tasks when the target languages share structural similarity, possibly due to script similarity.
Hackernoonhttps://hackernoon.com/training-a-bilingual-language-model-by-mapping-tokens-onto-a-shared-character-space?source=rss

Read at Hackernoon

#language-model #bilingualism #transliteration #machine-translation #arabic-hebrew-language-processing

Collection

[

...

]

Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space | HackerNoonTraining a Bilingual Language Model by Mapping Tokens onto a Shared Character Space | HackerNoon Briefly

Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space | HackerNoon
Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space | HackerNoon
Briefly