A magyar nyelv digitális fenntarthatóságának támogatása

Gábor Prószéky; Tamás Váradi

doi:10.18349/MagyarNyelv.2023.4.482

For the digital sustainability of the Hungarian language

Authors

Gábor Prószéky HUN-REN Nyelvtudományi Kutatóközpont
Tamás Váradi HUN-REN Nyelvtudományi Kutatóközpont

DOI:

https://doi.org/10.18349/MagyarNyelv.2023.4.482

Keywords:

Hungarian National Corpus, spelling advisory portal, digitization of dictionary cards, Hanti and Mansi text corpora

Abstract

The project follows the founding mission of the Hungarian Academy of Sciences to ensure that Hungarian is given a worthy role in the digital space. International research focuses mainly on English, with less attention paid to smaller languages like Hungarian. (1) The Hungarian National Corpus (MNSz) consists of more than one billion words, and highly used in linguistic research. It is composed of six stylistic layers and five regional language varieties. The corpus is primarily used to support corpus-based and corpus-driven linguistic research on Hungarian, not only in linguistic research but also in many fields of humanities and social sciences. However, new possibilities, such as newer, higher quality language parsers or large databases to produce large language models, have made it necessary to expand and improve the corpus. (2) Spelling control is a key element of the linguistic norm and is becoming increasingly important in the digital space. The Spelling Advisory Portal, supported by the Hungarian Academy of Sciences, meets this demand with state-of-the-art technology, but needs upgrading in terms of software platform, methodology and customer focus. (3) The collection of more than four million dictionary cards belonging to the Great Dictionary of the Hungarian Language was created at the end of the 19th century. The cataloguing and digitization of the dictionary cards is ongoing and, although the construction of the collection started almost twenty years ago, further development is needed to ensure its full digital use. (4) In order to support the digital presence of the related languages of Hungarian, the Hanti and Mansi, there is a need to create an analyzed digital corpus based on written modern texts that provide the opportunity to document and preserve the current state of the Obi-Ugrian languages. In summary, these studies will contribute to the development of Hungarian in the digital space and cover a wide range of linguistic research from normative language to orthography and related languages.

Downloads

pdf (Magyar)

Published

2023-12-20

Issue

Vol. 119 No. 4 (2023): Magyar Nyelv

Section

Különfélék

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Magyar Nyelv is a Diamond Open Access periodical. Documents can be freely downloaded and duplicated in an electronic format, and can be used unchanged and with due reference to the original source. Such use must not serve commercial purposes. In the case of any form of dissemination and use, Hungarian Copyright Act LXXVI/1999 and related laws are to be observed. The electronic version of the journal is subject to the regulations of CC BY-NC-ND (Creative Commons – Attribution-NonCommercial-NoDerivatives).

The journal permits its authors, at no cost and without any temporal limitation, to make pre-print copies of their manuscripts publicly available via email or in their own homepage or that of their institution, or in either closed or free-for-all repositories of their institutions/universities, or other non-profit websites, in the form accepted by the journal editor for publication and even containing amendments on the basis of reviewers’ comments. When the authors publicize their papers in this manner, they have to warn their readers that the manuscript at hand is not the final published version of the work. Once the paper has been published in a printed or online form, the authors are allowed (and advised) to use that (post-print) version for the above purposes. In that case, they have to indicate the exact location and other data of the journal publication. The authors retain the copyright of their papers; however, in the case of an occasional secondary publication, the bibliographical data of the first publication have to be included.