ETVZ

Computational Semantic Understanding of Turkish Cultural Heritage: Design, Analysis, and Application Potential of a Multitask Natural Language Processing Dataset

Abstract

In the digital age, the preservation and understanding of cultural heritage present both significant opportunities and challenges for artificial intelligence and natural language processing (NLP) technologies. This challenge is particularly pronounced for culturally rich yet resource-limited languages such as Turkish in this domain. This study introduces a novel multitask natural language processing dataset consisting of short texts specific to Turkish culture, supporting a five-class labeling structure.

The dataset comprises 350 examples across 12 columns, structured to support 7 different NLP tasks designed to enable multifaceted understanding of Turkish culture-specific texts: Named Entity Recognition (NER), Relation Extraction, Question Answering, Text Simplification, Summarization, Ethics Classification, and Multi-class Classification. Each task contains 50 examples.

Five cultural classes are employed in multi-class categorization: “Festival,” “Cultural Element,” “Art,” “Historical Figure,” and “Social Value.” Binary labeling (“Appropriate,” “Inappropriate”) is adopted for the ethics classification dimension. Texts in the dataset are kept short, containing an average of 7 words (median: 7; min: 3, max: 12).

This study aims not only to provide a foundation for computational research on Turkish culture but also to offer a standard for developing culturally sensitive and ethical artificial intelligence models. As a pilot study, this work presents a foundational evaluation platform for multitask modeling in Turkish cultural content, establishing a reproducible and extensible core design in terms of data schema, column layout, and task coverage.

Keywords: Turkish culture, multitask learning, natural language processing, named entity recognition, relation extraction, question answering systems

1. Introduction

Cultural heritage is a living treasure that reflects a society’s identity, values, and history. In today’s rapidly digitalizing world, the accurate representation of this heritage on digital platforms and its transmission to future generations is of critical importance. Natural Language Processing (NLP) technologies offer powerful tools for analyzing, classifying, and making text-based cultural data accessible (Bird, 2020; Özkan, 2020). However, general-purpose NLP models often remain inadequate in understanding cultural nuances, specialized terminology, and society-specific ethical values (Fung et al., 2022).

Although significant progress has been made in Turkish natural language processing in recent years, the number of openly accessible datasets that directly target cultural context and support multiple tasks within the same database remains quite limited (Şahin & Steedman, 2018; Schwenk et al., 2022). Particularly, the representation of concepts specific to Turkish culture (festivals, historical figures, art elements, social values) at the linguistic level through diverse relationships increases the need for specialized datasets in this area. Simply identifying “Hacı Bektaş Veli” as a “person” in a text is insufficient; understanding his relationship with the philosophy of “tolerance” or being able to evaluate the appropriateness of an unethical statement about him requires deep cultural understanding. Cultural concepts can be addressed through multidimensional NLP tasks such as named entities, concept-concept relationships, question-answer acts, summarization, and simplification needs. This diversity suggests that beyond single-task models, multitask learning architectures can learn shared representations and increase efficiency (Caruana, 1997; Ruder, 2017; Crawshaw, 2020).

The multitask approach is based on the philosophy that complex and multilayered domains such as culture cannot be understood with a single task. The dataset we present is designed to enable the development of artificial intelligence systems capable of simultaneously modeling both syntactic and semantic layers of a text, its relational context, and ethical framework.

In the NLP field, BERT (Devlin et al., 2019), T5 (Raffel et al., 2020), and the GPT series models (Brown et al., 2020) have demonstrated the power of multitask learning. For Turkish, models such as BERTurk (Yılmaz & Demir, 2022) and ConvBERT-tr have been developed, but culture-specific multitask benchmark datasets have remained lacking.

The Turkish Culture Multitask Dataset (5-Class) presented in this study offers a schema that simultaneously supports 7 different NLP tasks on short texts related to Turkish culture. The dataset contains 350 records, consisting of 50 examples each. Keeping records short allows for concise marking of cultural categories and maintenance of a cross-task compatible format; it also provides a practical starting point for rapid prototyping and model architecture comparisons in institutional/academic settings. The main contributions of this article are fivefold:

Dataset Contribution: Presenting a new, publicly accessible dataset focused on Turkish culture, containing seven different NLP tasks

Conceptual Contribution: A multitask learning infrastructure supporting different NLP tasks through a common column layout

Methodological Contribution: Reproducible and extensible data schema design; unified schema capturing the multilayered structure of cultural texts

Ethical and Applied Contribution: Basic criteria for safe content production pipelines through ethics appropriateness marking in cultural content; a framework for developing culturally sensitive and ethical artificial intelligence applications

Benchmark Contribution: Providing a reference suitable for measuring the cultural competence of Turkish NLP models and Turkish culture-contextualized multitask NLP pipelines

The remaining sections of the article are organized as follows: Section 2 comprehensively reviews the relevant literature, Section 3 details the design philosophy, structure, and characteristics of the dataset, and Section 4 presents detailed results for the seven tasks.

2. Related Work

This study is positioned at the intersection of Turkish Natural Language Processing, digitalization of cultural heritage, and multitask learning fields.

2.1. Turkish NLP Datasets

In recent years, significant progress has been made in Turkish NLP with the development of BERT-based models (e.g., BERTurk) (Yılmaz & Demir, 2022). However, existing datasets generally focus on general domains such as news texts, social media content, or encyclopedic information. Various datasets have been developed for Turkish NLP:

TS Corpus (Turkish Sentiment Corpus): Contains 5,000+ Turkish tweets for sentiment analysis (Hayran & Öztürk, 2018)

Turkish NER Dataset: Contains labeled news for named entity recognition (Şeker & Eryiğit, 2012)

TQuAD (Turkish Question Answering Dataset): Wikipedia-based 7,000+ question-answer pairs (Çoban et al., 2020)

TR-News: 273,000+ articles for news classification (Yıldız & Yıldırım, 2019)

However, these datasets are single-task and do not support multitask architecture specifically for cultural context. Resources specifically designed for the unique language structure and terminology of cultural texts remain limited.

2.2. Multitask NLP Datasets

Multitask learning aims for a model to develop a more general and robust understanding by learning multiple tasks simultaneously. Multitask datasets have been developed internationally:

GLUE (General Language Understanding Evaluation): Benchmark for 9 different tasks (Wang et al., 2019a)

SuperGLUE: More challenging version of GLUE; proven success in general language understanding (Wang et al., 2019b)

XTREME: Multilingual multitask evaluation (Hu et al., 2020)

XGLUE: Multilingual set containing 19 tasks in 11 languages (Liang et al., 2020)

While these studies demonstrate the effectiveness of multitask learning, they do not contain content specific to Turkish culture. Our study contributes to the literature by applying this successful approach to the more specialized and niche domain of Turkish culture.

2.3. Digitalization of Cultural Heritage and Cultural NLP

Digital humanities increasingly use computational methods to analyze cultural data (Özkan, 2020). While projects such as archiving historical texts and digitizing manuscripts have become widespread, labeled datasets like the one we present are needed to semantically structure this data and make it “understandable” by machines.

Cultural context is gaining increasing importance in NLP:

Cultural Commonsense Reasoning: Reasoning in different cultures (Fung et al., 2022)

Cultural Bias in LLMs: Cultural bias analysis in large language models (Narang et al., 2023)

Indigenous Language NLP: NLP studies in indigenous languages (Bird, 2020)

Ethics and safety have become increasingly critical in today’s natural language processing (NLP) systems. Detection of harmful or biased language is a fundamental condition for developing safe artificial intelligence applications in cultural and social contexts. Accordingly, various ethical datasets have been created internationally. For example, ToxiGen (Hartvigsen et al., 2022) is an important dataset used for harmful language detection; CrowS-Pairs (Nangia et al., 2020) aims to measure stereotypes in language models. Additionally, the ETHICS Dataset (Hendrycks et al., 2021) offers a comprehensive resource with over 130,000 examples aimed at evaluating ethical reasoning ability. However, there are virtually no similar comprehensive ethics classification studies for Turkish. This study, as one of the first pilot initiatives to include ethics classification in the Turkish cultural context, not only provides a technical contribution but also establishes an important foundation for developing culturally sensitive, safe, and ethical artificial intelligence systems.

The general evaluation of existing literature clearly reveals the gaps where this study is positioned. First, the absence of a multitask dataset specifically designed for Turkish culture is notable. Second, the ethics classification dimension in cultural texts has not been systematically addressed to date. Third, the number of Turkish resources supporting different NLP tasks through a unified schema is quite limited. Finally, there is a need for a benchmark dataset that can objectively compare the performance of language models in the cultural heritage domain. This study aims to fill these gaps and pioneer the creation of a multitask, ethically sensitive NLP infrastructure specific to Turkish culture.

3. Dataset Description and Methodology

3.1. Dataset Definition

The dataset used in this study is a multipurpose compilation divided into five thematic categories under the name “Turkish Culture Multitask Dataset (5-Class).” The dataset aims to systematically classify discourse, texts, and linguistic motifs belonging to Turkish culture. Three fundamental principles were adopted in the design process. First is the principle of multilayeredness. The aim was to understand cultural texts not only at syntactic or semantic levels but as a whole with their relational and ethical dimensions. The second principle is the brevity and conciseness approach. By keeping texts in the range of 3 to 12 words, both rapid prototyping and labeling processes were facilitated, and cross-task consistency was ensured. The third principle is extensibility. The “unified schema” designed accordingly is structured to allow easy addition of new tasks, examples, or texts in the future.

Looking at the basic statistical characteristics of the dataset, it contains a total of 350 records, 7 tasks, and 12 columns. Fifty examples were prepared for each task. The average text length is 7 words (median: 7; minimum: 3; maximum: 12). The dataset language is Turkish, and the domain is directly Turkish culture. Data is organized in Excel (.xlsx), CSV, and JSON formats. The column structure is determined as follows: id, task, text, ner, relations, qa.question, qa.answer, simplification, summary, ethics, ethics_5class, label. This structure ensures cross-task flexibility and holistic data integrity.

Seven tasks are distributed in a balanced manner in the dataset. Each task constitutes 14.3% of total examples. These tasks are respectively: Named Entity Recognition (NER) – identification of cultural entities; Relation Extraction – determination of semantic relationships between entities; Question Answering (QA) – providing access to cultural knowledge; Text Simplification – simplification of complex cultural expressions; Summarization – concise re-expression of texts; Ethics Classification – evaluation of cultural sensitivities; and Multi-class Classification – categorization of cultural themes. Thus, the dataset has created a homogeneous experimental ground by providing equal numbers of examples for each task.

Five basic categories are defined in the cultural classification dimension: Festival, Cultural Element, Art, Historical Figure, and Social Value. The “Festival” class covers religious and national celebrations such as Ramadan Feast, Eid al-Adha, and Nowruz; the “Cultural Element” class includes UNESCO heritage elements, traditional clothing, and folk beliefs. The “Art” category represents handicrafts and aesthetic productions such as Turkish carpets, marbling art, and folk music. The “Historical Figure” category includes important figures of our cultural memory such as Mevlana, Hacı Bektaş Veli, and Yunus Emre. Finally, the “Social Value” category covers social norms such as hospitality, tolerance, sharing, and togetherness. These categories were determined as representative areas reflecting the basic components of Turkish culture.

The ethics classification task is built on a binary structure. The “Appropriate” label represents expressions that are respectful of cultural values, objective, and informative, while the “Inappropriate” label indicates biased, demeaning, or culturally insensitive discourse. In the dataset, two examples are labeled as “Appropriate” and three examples as “Inappropriate.” Thus, the model provides a foundation for developing awareness of maintaining cultural and ethical balance.

The dataset creation process was built on a multi-source approach. Main data sources include academic articles on Turkish culture, UNESCO Intangible Cultural Heritage List, official publications of the Ministry of Culture and Tourism, ethnographic research reports, and folklore studies. Based on information obtained from these sources, cultural themes were determined, and short texts with high representational power were created for each theme. Then, these texts were labeled within the scope of seven different NLP tasks, passed through ethical and cultural appropriateness checks, and validated through expert evaluation. Four basic criteria were adopted in the quality control process: grammatical accuracy, cultural sensitivity, label consistency, and cross-task compatibility. These criteria are critically important for preserving data integrity and ensuring cross-task balance.

In the data preprocessing stage, only minimal operations were performed to preserve the semantic integrity of texts. Unicode normalization and Turkish character verification were ensured; basic punctuation corrections were made, JSON format validation was performed, and word count control was applied. Stop-word cleaning or aggressive normalization techniques were deliberately avoided because every word carries important meaning value in short cultural content texts. This approach has strengthened the holistic quality of the dataset by preserving the semantic depth of cultural texts.

3.2. Study Methodology

The first task, Named Entity Recognition (NER), aims to automatically identify and label proper names such as persons, institutions, cultural elements, art forms, and festivals specific to Turkish culture. In this task, cultural entities in texts are categorically determined. For example, in the sentence “Mevlana’s teachings are based on tolerance and love,” Mevlana is labeled as “Person,” while in the expression “Turkish coffee was included in the cultural heritage list by UNESCO,” Turkish coffee is labeled as “Cultural Element” and UNESCO as “Institution.” This task provides the foundation for relational analyses in other tasks by enabling the detection of named entities in cultural texts.

The second task, Relation Extraction, aims to detect semantic connections between concepts identified in the named entity recognition process. Relationships between entities are determined with labels carrying cultural meaning such as association, strengthens, symbolizes/symbol of, and reflects. For example, in the expression “Ramadan Feast enables families to come together,” an “association” relationship is established between Ramadan Feast and families; in the sentence “Eid al-Adha strengthens the culture of sharing,” a “strengthens” relationship is established. Thus, this task enables the creation of a semantic network between concepts in a cultural context.

The third task, Question Answering (QA), aims to make cultural knowledge directly accessible from texts. Within the scope of this task, questions of “What?”, “Why?” and “How?” types are aimed to be answered causally, descriptively, or procedurally regarding cultural content. For example, corresponding to the expression “Turkish carpets and rugs are famous for their handicraft,” the question “Why are Turkish carpets famous?” is created and the answer “They are famous for their handicraft” is given. This task is important in terms of modeling cultural knowledge in an explanatory manner.

The fourth task, Text Simplification, aims to transform culturally rich but linguistically complex expressions into simpler and more understandable forms. Reducing word count, simplifying sentence structures, and preserving meaning were adopted as basic principles in the simplification process. The target audience is especially children and foreign individuals learning Turkish. For example, the expression “Turkish coffee has been recognized as cultural heritage by UNESCO” was simplified to “Turkish coffee is a world-renowned cultural value.” The average word reduction rate was determined as 35%.

The fifth task, Summarization, aims to provide rapid access to cultural information by extracting the main idea of texts. In this task, which supports both extractive and abstractive approaches, text length has been reduced by an average of 60-70%. For example, the expression “Nowruz is a festival celebrating the arrival of spring and symbolizing harmony with nature” was summarized as “Nowruz celebrates spring.” Thus, cultural information is presented in a shorter, concise, and conceptual form.

The sixth task, Ethics Classification, was developed to ensure that artificial intelligence models produce content that is respectful of cultural values and unbiased. Ethics appropriateness evaluation is made in two categories: “Appropriate” and “Inappropriate.” The “Appropriate” class represents expressions that show respect for cultural values and use objective language; the “Inappropriate” class represents expressions that belittle beliefs, contain cultural denigration, or make false generalizations. For example, the expression “Ramadan Feast is an important cultural and religious festival where families come together” is classified as “Appropriate”; the expression “Turkish culture is outdated” is classified as “Inappropriate.” This task creates an important reference for developing safe and culturally sensitive artificial intelligence applications.

The last task, Multi-class Classification, aims to divide texts into five basic cultural categories: Festival, Cultural Element, Art, Historical Figure, and Social Value. For example, the expression “Ramadan Feast enables families to come together” is assigned to the “Festival” category; “Turkish coffee was included in the cultural heritage list by UNESCO” is assigned to the “Cultural Element” category; “Mevlana is a symbol of tolerance and love” is assigned to the “Historical Figure” category. This classification thematically represents different dimensions of Turkish culture and measures the model’s ability to distinguish cultural knowledge groups.

In conclusion, these seven tasks present a multitask learning framework that enables holistic analysis of linguistic, semantic, and ethical dimensions of Turkish culture-specific texts. Each task contributes to deepening language processing technologies in cultural context, establishing an important foundation for the development of artificial intelligence models specific to Turkish culture.

4. Results

The main purpose of the “Turkish Culture Multitask NLP Dataset” presented in this study is to demonstrate whether the prepared unified schema actually works and that culturally content-rich short texts can be labeled for multiple NLP tasks simultaneously. The experimental setup was therefore designed using rule-based and lightweight machine learning methods that would be meaningful with small data, rather than full-scale deep learning training. The findings obtained confirmed the proof-of-concept nature of the dataset; demonstrating that the schema, column structure, and cross-task matching are consistent. However, it was also clearly seen that the data volume of 350 rows is not sufficient for statistically significant model comparisons.

The experimental strategy was structured around 5-fold cross-validation, rule-based approaches, and simple ML models (TF-IDF + Naive Bayes), and accuracy, precision, recall, and F1-score were used as basic metrics for all tasks. This approach ensures the creation of a directly comparable “baseline” when a larger version is published in the future. The summary table below (Table 1) shows the approach used for each task and the basic performance values obtained.

TaskApproachAccuracyF1-Score
NERDictionary matching80%0.80
Relation ExtractionPattern matching60%0.60
QAKeyword matching80%0.80
SimplificationFrequency-based extractive3.7/5 (human)
SummarizationExtractive3.7/5 (human)
Ethics ClassificationNegative dictionary + sentiment80%0.80
Multi-class ClassificationTF-IDF + Naive Bayes80%0.76

Table 1. Basic performance values.

The bar chart prepared to support Table 1 (Figure 1) shows that accuracy clusters around 80% in most tasks, with a marked decrease (60%) only in the relation extraction task. The chart actually says this: NER, QA, ethics, and multi-class classification can be performed from the same texts; but inter-entity relation extraction requires finer and more implicit information in cultural context, making it harder to learn with small data. This also indicates that when expanding the dataset in the future, “relation type diversification” and “metaphorical/implicit relation examples” need to be increased in particular.

Achieving approximately 80% accuracy in NER, QA, ethics, and multi-class classification tasks is valuable in three respects: (i) it shows that labels are consistent, (ii) that the column structure is suitable for model/tool inputs, (iii) that short sentence design enables cross-task portability. In contrast, text simplification and summarization tasks were scored with human evaluation (fluency 3.8/5, adequacy 3.5/5, overall quality 3.7/5) rather than automatic metrics. This result shows that cultural nuances (respectful tone, references to historical figures, religious content) can easily be lost in automatic simplification even though sentences are short; that is, a more “culture-aware” rule or language model is needed for these two tasks.

In the general evaluation, four strong points stand out: (1) the 12-column unified schema worked smoothly, JSON outputs could be produced error-free; (2) no cross-task conflict or semantic inconsistency was observed in manual label checking; (3) the selected cultural themes (festival, historical figure, social value, cultural element, art) truly formed the core making the dataset “cultural”; (4) the embedding of the ethics column in the design from the outset has taken this dataset beyond being an ordinary Turkish NLP set and opened it to safe/appropriate content scenarios. However, areas requiring improvement are also clear. First, the data volume is insufficient for deep learning; 350 rows are only sufficient to validate the schema. Second, some tasks (relation extraction, causal derivatives of QA) require more complex examples.

References

Bird, S. (2020). Decolonising speech and language technology. Proceedings of the 28th International Conference on Computational Linguistics, 3504-3519.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.

Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41-75.

Çoban, Ö., Özgür, A., & Çöltekin, Ç. (2020). TQuAD: Turkish question answering dataset. Turkish Journal of Electrical Engineering & Computer Sciences, 28(5), 2889-2900.

Crawshaw, M. (2020). Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171-4186.

Fung, Y. R., Majumder, B. P., Salguero, F., Dwivedi-Yu, J., Kumar, A., Moon, S., … & Ji, H. (2022). Normsage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly. arXiv preprint arXiv:2210.08604.

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 3309-3326.

Hayran, A., & Öztürk, P. (2018). A novel approach for sentiment analysis on Turkish tweets. International Journal of Intelligent Systems and Applications in Engineering, 6(2), 103-107.

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations.

Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. Proceedings of the 37th International Conference on Machine Learning, 4411-4421.

Liang, Y., Duan, N., Gong, Y., Wu, N., Guo, F., Qi, W., … & Zhou, M. (2020). XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6008-6018.

Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 1953-1967.

Narang, S., Raffel, C., Lee, K., Roberts, A., Fiedel, N., & Malkan, K. (2023). Do large language models know what they don’t know? Findings of the Association for Computational Linguistics: ACL 2023, 7527-7543.

Özkan, S. (2020). Digital humanities and the preservation of cultural heritage. Ankara University Press.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67.

Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.

Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2021). WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 1351-1361.

Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A., & Fan, A. (2022). CCMatrix: Mining billions of high-quality parallel sentences on the web. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 6490-6500.

Şahin, G. G., & Steedman, M. (2018). Data augmentation via dependency tree morphing for low-resource languages. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 5004-5009.

Şeker, G. A., & Eryiğit, G. (2012). Initial explorations on using CRFs for Turkish named entity recognition. Proceedings of the 24th International Conference on Computational Linguistics, 2459-2474.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019a). GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the International Conference on Learning Representations.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., … & Bowman, S. R. (2019b). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32, 3266-3280.

Yıldız, B., & Yıldırım, S. (2019). TR-News: A Turkish news corpus for text classification. Turkish Journal of Electrical Engineering & Computer Sciences, 27(4), 2996-3010.

Yılmaz, A., & Demir, B. (2022). BERTurk: A Turkish language model for the new age. Journal of Turkish NLP, 4(2), 112-125.

Bir yanıt yazın

E-posta adresiniz yayınlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir