From enthusiasm to disappointment and back again: the use of generative AI in historical research

Contemporary historical research, like other fields of the humanities, is likely at a turning point in its development. Likely, because this will be assessed by historians of science in a few decades’ time. This is not a Copernican-scale breakthrough, but a shift similar to the widespread adoption of personal computers or the internet. The emergence of large language models and the first attempts to use them in the humanities have provoked a range of reactions among historians: from cautious indifference to, perhaps most commonly, strong criticism – as a source of unreliable data or a tool harmful to the educational process. However, there have also been attitudes bordering on techno-utopian ecstasy, promises of the automation of the research craft, or even claims that historians will soon be replaced by their artificial intelligence counterparts (1, 27). Fears regarding this latter possibility are, in any case, a fairly popular topic in journalism and the history-related blogosphere (2, 3), though it seems they are largely exaggerated (23). Today, in an era where scientific discoveries made with the aid of AI are already emerging, or mathematical problems are being solved through the collaboration of models and mathematicians (24), although successes in the natural sciences do not necessarily translate directly to the humanities, the development of artificial intelligence calls for the development of a mature approach to its use in historical research as well.

The history of artificial intelligence itself is not one of linear progress, but rather a cyclical process characterised by sudden surges in expectations followed by ‘AI winters’, during which research in this field experienced a significant decline. Periods of accelerated AI development were usually marked by great optimism regarding what it would be possible to achieve through it. As early as 1960, one of the pioneers of AI, Herbert Simon, predicted that within 20 years machines would be able to perform all human tasks (4). Even today, this statement cannot be considered 100% true, though it is closer to the truth than it was 66 years ago. Currently, in the era of large language models, similarly optimistic trends are prevalent, concerning not only the potential to replace human specialists, but even the vision of a rapid acceleration of scientific research and discoveries thanks to AI.

This vision is not without merit, even in the humanities. Large language models, unlike earlier systems considered to be AI, possess remarkable communication skills (which is important for humanities scholars) and are widely accessible. Their operation, based on billions of neural network parameters, allows for the generation of narratives akin to human speech, even simulating a line of reasoning reminiscent of human thought. From the perspective of, for example, a digital historian, this signifies a shift from the era of databases and statistical analysis towards the possibility of interpretative dialogue with a machine.

The initial enthusiasm within the academic community was sparked not only by the generation of texts, but also by the ability to quickly analyse and summarise numerous publications; what used to take days could now be done in a matter of minutes or hours. Translations of texts from less widely spoken languages, the ability to conduct cross-sectional analyses of large-scale data available in multiple languages (5), correcting errors and improving style – particularly in the case of English texts for researchers for whom it was not their native language – and verifying and auto-correcting the results of OCR processes – all of this sounded like a brave new world, opening up vast possibilities, and in an easy and accessible way. There were even ideas about using generative AI for historical behavioural research, by training HLLMs – historical large language models – on historical text corpora (6).

But the ease and accessibility of these tools have not blinded everyone to the darker side of the ‘AI revolution’. It was soon pointed out that the naive use of chatbots and language models could easily lead one astray. In the simplest case, uncritically pasting AI-generated text into academic publications resulted in embarrassing passages such as “As a language model...”. A more serious problem is model hallucinations, although today (in 2026) this is less of an issue than it was three years ago. It seems to be an inherent feature of LLMs that they will never completely get rid of, at least not models with the current architecture. Models are trained on text created by humans; the training technology requires truly vast amounts of text, so all available sources are utilised. This includes sources containing false information or views that are generally unacceptable; consequently, the models themselves have become the subject of scientific research: how bias affects the data they generate (7), summaries and reports. It was also quickly realised that the model itself is not an all-knowing encyclopaedia; even when supported by online searches, it does not always present true facts, and a naive request to compile literature on a particular topic can end in scandal, as in the case of a certain popular science book (8) or even publications of a more academic nature (9). A significant problem from the historians’ perspective has been the non-deterministic nature of language models; the same question posed to the same model can yield different results (22).

However, these issues do not mean that the initial period of enthusiasm must be followed by a complete rejection of these new tools. A third stage seems more likely: their reintegration into the historian’s toolkit, once unrealistic expectations have been adjusted. An analysis was conducted of the realistic capabilities of LLMs as research tools for historians (12). One of the key conclusions was the recommendation that researchers using artificial intelligence should possess a basic understanding of the technology (13). Knowledge of the tool undoubtedly facilitates its better use, but in order to assess what can actually be expected from large language models, it was necessary to understand what large models actually know about history. Knowledge assessment tests, such as Massive Multitask Language Understanding (MMLU), also include questions from the field of history (the entire benchmark comprises around 16,000 questions across 57 fields, of which approximately 600 relate to history) (14). Initially (in 2021), the GPT-3 model was quite far from achieving good results (scoring just over 50%). However, the rapid development of generative AI meant that by the second half of 2024, the best LLMs (e.g. GPT-4o) were already achieving expert-level performance (>90% accuracy) on historical questions (12). It should be noted, however, that ‘history’ was understood mainly to mean the history of the United States, the history of Europe and general world history. It can be assumed that the history of, for example, Eastern Europe or the east coast of Africa was not overly represented there. By contrast, other studies based on a historical database (the Seshat Global History Databank) have shown (15) that the models’ knowledge is significantly inferior to that of experts in the field (33%–46% accuracy), whilst also revealing significant regional differences – the models performed worse on questions concerning Oceania and sub-Saharan Africa. This, in turn, points to inequalities: certain areas of historical knowledge have been better mapped, others less so; the differences relate to cultures, geography and languages. Generative AI draws its knowledge from sources available to its creators; those that are not digitised or originate from countries with low levels of digitisation become ‘invisible’.

The issues surrounding the use of AI have, of course, prompted a response from the academic community, and standards for its use have begun to emerge: it is not only historical journals that are defining the possibilities and rules for using artificial intelligence in the creation and peer review of publications (17), but, for example, the American Historical Association has published guidelines on the use of AI in history teaching (18). While acknowledging that AI will undoubtedly influence the teaching process, it was emphasised that it is not free from errors and hallucinations, cannot replace historical methodology, and introduces a false sense of certainty where uncertainty exists; nevertheless, it can still be a valuable partner in teaching if the rules for its use are defined.

A common source of criticism regarding the use of language models in historical research is the naive assumption that the model (chatbot) acts as an expert in the field, capable of carrying out the researcher’s day-to-day tasks—such as drawing a map of China from the autumn of 1378, or preparing a study on the administrative history of the village of Pruty Niżne in the 18th and 19th centuries. Commercial language models, despite their ability to draw on knowledge available on the internet, are usually not well-suited to this type of task. This leads to disappointment and criticism, and sometimes even a sense of having ‘beaten’ the AI. Yet AI should not be treated as an oracle; the model is an excellent tool for processing, analysing and transforming text. When combined with RAG (35) and PageIndex (36), allowing for the processing of massive data sources, it can quickly provide information that enables the researcher to interpret them, produce studies and formulate hypotheses. A mature approach to the use of AI in the work of a historian should not involve replacement but rather support. AI can perform partial tasks (e.g. transcription, preliminary classification, entity detection, entity identification), but the assessment of reliability and synthesis of data should remain the responsibility of humans. If repetitive and time-consuming tasks are automated thanks to AI, historians will gain the opportunity to explore areas that were previously labour-intensive and required lengthy data preparation (19). It is precisely this process that best captures the ‘back’ in the title, which does not signify a return to enthusiasm, but a transition to the mature use of new tools, with greater awareness of their actual applications and methodological costs.

Generative artificial intelligence is often associated with chatbots that enable dialogue with AI; in reality, a language model (LLM) is operating in the background. There are probably a dozen or so well-known and popular models, but the number of trained variants of various models can now be estimated at hundreds of thousands. These models differ significantly from one another, for example in terms of size, i.e. the number of parameters, but it is not only the size of the model that matters. A larger model may be capable of more, but the largest model is not always necessary for a given task; it is also worth verifying the model’s bias and prejudices, as well as its limitations in the training data: a typical commercial model may perform worse at analysing texts in a rare language than a specialised open-source model. Tasks that are too general, e.g. “write an essay on the economic situation in Europe in 1946”, may, even in Deep Research mode (when models search the web and analyse the problem at greater length), yield disappointing results (though this may change quickly, given the pace of AI development...). Conversely, the task of summarising Civil War-era diaries provided to the model as context, focusing on the themes of hunger and food procurement, can yield good and useful results. But regardless of how convincing the AI’s output may seem, it is the researcher’s responsibility to verify the results of its work.

An interesting example of a mature approach to the use of artificial intelligence in the humanities is the DeepPast project (20) led by the Big Data Studies Lab at the University of Hong Kong, which examines a corpus of medieval Korean texts with the aim of reconstructing Confucian networks (patron–client and master–disciple relationships). A text corpus of 3.6 GB would be difficult to process without artificial intelligence, but it is not AI that ‘writes history’; rather, it is a tool that prepares data enabling researchers to more easily connect people, institutions and intellectual movements, and to interpret the past. The project emphasises the active role of the historian in guiding the dialogue between human and machine. Additionally, it takes into account the ethical aspect of AI use regarding environmental resource consumption: smaller local models running on local infrastructure are utilised for the work (26).
Another use case for large models involves research into the urban history of Venice between 1740 and 1808, where LLMs were used to assist in searching for and analysing historical cadastral data, including through agents generating SQL queries from natural language, but also coding agents preparing Python programmes that perform analyses to help reconstruct information about the former population, property characteristics and spatiotemporal comparisons in Venice (28).

One of the traditional tasks of AI in historical research involving natural language processing is named entity recognition (NER). It has also been observed in such tasks that large language models are highly versatile (multilingualism, context understanding); furthermore, they do not require costly annotation and model training, and zero-shot (without any further training or examples) or few-shot (where we present the model with only a few examples of correct behaviour). Results from tests conducted on the HIPE-2022 dataset (Identifying Historical People, Places and other Entities, https://hipe-eval.github.io/HIPE-2022/) showed that LLMs can be an interesting alternative to traditional supervised machine learning (30).

Multimodal large models have also found applications in historical research. For example, in research on historical patent documents, the vision capabilities of the GPT-4o model were used to extract data from scanned form sheets in order to populate the Swedish historical patents database: https://svenskahistoriskapatent.se (29). The vision capabilities of the latest models, such as Gemini Pro 3, have also been noted and tested in the field of historical handwritten text recognition (HTR); optimistic assessments even suggested that they might solve the problem of reading English handwriting, for example from the eighteenth century (31).
The usefulness of large models has also been recognized in other cases involving historical mass materials as the subject of research, such as the plant inventory books of the United States Department of Agriculture, where LLMs extract structured data from scanned texts (32), or the example of the “English Catalogue of Books,” published since the mid-nineteenth century, where large models proved more effective at extracting information from catalogue entries than the regular expressions traditionally used for this purpose (33).
It is also worth mentioning here the work carried out at the Tadeusz Manteuffel Institute of History of the Polish Academy of Sciences on the automatic processing of entries from the Geographical Dictionary of the Kingdom of Poland (SGKP), using large models to prepare structured data that can be used to supplement databases (34). All of these examples confirm the usefulness of properly applied AI in increasing the accessibility and usability of historical and archival materials for scholars.

Just as in the past the aim was not to replace the historian’s craft with databases and statistical tools, so today technology is merely a means to an end (19), and large models are, after all, a technology. One so advanced that it can significantly accelerate certain tasks that have so far been carried out by teams of historians and computer scientists, for example in the creation of digital editions of source documents (21), yet it is only the finished edition that begins to serve researchers in advanced analyses of historical sources. And indeed, acceleration is probably the best term for describing the impact of artificial intelligence on historical research.

Another issue is the growing role of AI in the field of programming. Digital historians have often assumed the role of programmers, creating project-specific tools and scripts for their work; today, this possibility is becoming available also to programming laypeople. Systems such as Claude Code or Codex (from OpenAI), as well as specialized tools such as Replit or Lovable, make it possible to build fairly complex applications on the basis of requirements expressed verbally by the user. Skills of this kind do, indeed, also require the acquisition of certain abilities in formulating precise specifications and in the iterative development of applications — and they are now beginning to be taught in digital humanities programmes (25). These recent changes also prompt another kind of reflection: who, in essence, is the digital historian today, or more generally the digital humanist, when technology so significantly lowers the threshold of difficulty? Might this soon become a term of merely historical significance in the literal sense?

Conclusion

The contemporary historian increasingly moves beyond the framework of traditional research methods: multidisciplinary research and the use of achievements from fields sometimes quite distant from the humanities bring new data, new answers, and new questions. Examples include pollen profile studies and social network analysis, which have shed interesting light on the question of the collapse of the state of the first Piasts (10).
This also works in the opposite direction: historical data are becoming, for example, a valuable complement to ecological research and studies on changes in biodiversity (11). Reaching for artificial intelligence, already used in many other fields, therefore seems a natural step. At a time when the use of advanced machine learning methods is finally making it possible to read the Herculaneum scrolls, this “ordinary” AI, available at one’s fingertips, can also contribute to significant achievements in historical research, especially if the technology ceases to be treated as a threat and a substitute for the historian, and instead becomes the historian’s assistant for difficult tasks. There is, of course, a risk that the assistant will become more intelligent than the researcher 😉 (16).

Piotr Jaskulski

Bibliography: