TEXT2NER

Operation description and user manual for the tool converting documents to TEI-XML format with named entity recognition

The TEXT2NER application is designed for the preliminary conversion of historical documents into the TEI-XML format. A document in plain text form is transformed into XML structures (header head, and body with div and p elements). Subsequently, the text is searched for proper names occurring within it: persons and places (localities, countries, etc.).

Proper names in the exact form recorded in the document transcription can be difficult to identify; therefore, before the process of linking names with external reference databases, the application performs "normalization/enrichment" of the name, using the context of its occurrence and a large language model (Gemini 3.1 Flash Lite Preview). Thanks to this, the occurrence of the name "Fridericus" in a Latin document from the turn of the 15th/16th century is recognized as "Fryderyk Jagiellończyk". In this form, the name can be more easily searched in reference databases. It may also happen that the "normalization/enrichment" of the name is ineffective, in which case the original name from the document is used for the search.

The application searches the following databases: wikidata, WikiHum, and in the case of places, geonames. From each, it retrieves a list of the most probable candidates for identification (provided the database returns such a list) along with additional information—for example, in the case of wikidata, in addition to the name and Q identifier, the description of the wikidata item and name aliases are returned. Based on the candidate lists from the reference databases, the name appearing in the document, and the context of its occurrence in the text, an analysis and selection of the most appropriate candidate is performed using a large language model.

The result of the search is the recording of the proper name in the form of a tag, e.g., persName, with key and ref attributes, for example:

<persName key="Fryderyk Jagiellończyk"
ref="https://wikihum.lab.dariah.pl/entity/Q152903">Fridericus</persName>

User Manual

Login

Access to the application is secured; you must log in.

Entering text

Paste the text of the historical document into the main field (5000 character limit). You can also use the list of examples visible below the field.

Analysis

Click "Analyze text". The system will automatically convert the text to TEI-XML format and search for entities in the Wikidata, WikiHum, and GeoNames databases (the analysis may take from several to several dozen seconds).

Results

The system will display two views:

  • XML Code: a preview of the TEI-XML structure, ready to be saved.
  • Text Preview: Interactive reading with tooltips. Hover your mouse over a person or place to see data from external databases.

Data Export

You can copy the result to the clipboard or download it as an .xml file.


Application screenshot