Open data collection is announced to create a large model of the kazakh language "Kaz LLM"
To create and launch a large-scale model of the kazakh language, National Information Technologies JSC (NITEC JSC) is starting to use the service of the leader in the field of machine learning – Hugging Face. An open data collection will be organized on the platform, which can be joined by the professional IT community of Kazakhstan and open data holders.
What is Hugging Face?
Hugging Face is a leading platform for sharing machine learning research, using which users can develop tools and build AI models. Users of the platform interact with open source code, making artificial intelligence more accessible and fostering a culture of knowledge sharing and progress. Hugging Face helps share AI models that other companies use in their work, including Google, Microsoft Corp, Amazon, Meta Platforms Inc and others. As of 2023, more than 1.2 million users were registered on the platform, and nearly 30 million people visited the site in January 2024 alone. Residents of the USA, China, Japan and India are among the most active users of the service.
What is it for?
The creation of a modern linguistic model of the Kazakh language is an important step towards strengthening Kazakhstan's digital independence and promoting national culture in the global digital space. The first step in creating a language model is data collection.
As a result of the data collection, a high quality Kazakh natural language processing (NLP) model will be created. In the future, this will help to improve not only automatic translation, but also the quality and accuracy of text processing in the Kazakh language as a whole.
Representatives of the professional IT community and open data holders can join the data collection process. Data collection will be carried out on a specially created account of NITEC JSC. Users can log in to the platform and upload files to their account huggingface.co/nitec. Text files of different styles and genres in the Kazakh language in txt, .csv, .json formats can be downloaded.