SGKP-Search
Description
Information search engine for the SGKP - Geographical Dictionary of the Kingdom of Poland and Other Slavic Countries (information about the dictionary on Wikipedia). The search engine combines two approaches to information retrieval, which can be described as hybrid search:
- Full-Text Search: Works classically by searching for word occurrences in the text. This is the ideal solution when the user knows the name of the searched object or historical term. The engine handles simple typos (searching with a certain degree of 'fuzziness' or tolerance), which is crucial when working with text generated via Optical Character Recognition (OCR).
- Semantic (Vector) Search: Allows searching for meanings rather than just character strings. This enables the user to enter a query like "mining sites" and receive a list of entries where these specific words do not appear directly, but the text describes mines. This also helps bypass OCR errors (minor scanning errors do not lose the semantic context).
The search engine used is Meilisearch – a lightweight open source engine that supports both fast text and vector search. To create the vector representation of the text (so-called embeddings), the OpenAI model (`text-embedding-3-small`) was used. This model converts text fragments from the dictionary into sequences of numbers (vectors) that "encode" the meaning of the text. The application itself is written in Python using the Flask framework. The SGKP text comes from OCR processing (Tesseract) of 15 volumes of the Dictionary. The import scripts divided the text into entries and sub-entries (some Dictionary entries are collective entries containing descriptions of many locations with the same name) to enable precise location pointing within such a voluminous work.
User Interface
The interface available at ai.ihpan.edu.pl/sgkpsearch is minimalist and focused on usability, containing:
- Search Bar: The user enters any phrase; the Search button starts the process.
- "Keyword vs. Semantic Balance" Slider: Determines the proportion between keyword search (traditional full-text) and vector search (semantic).
- Results List: Presented as a list of found entries ranked by relevance. Each result contains:
- Entry title
- Information about the volume and page; the page number is also a link to the Dictionary scan on the ICM (Interdisciplinary Centre for Mathematical and Computational Modelling at UW) servers.
- Other basic information about the entry, if available. Data for SGKP entries, e.g., names of counties, communes, parishes, owners, and objects occurring in a given location, were acquired through automatic text processing and information extraction by Large Language Models (LLM) as part of the Cultural and intellectual geography of the former Polish lands under the partitions 1865–1918 – digital vademecum project.
- A text fragment with context (where content matching the keywords was found); in the case of full-text search, keywords are highlighted in yellow.
How to Search
- Enter the phrase you are interested in into the search box, e.g., "grain milling" (pl. mielenie ziarna), "Karaim", "beer brewery" (pl. piwo browar), "steam mill" (pl. młyn parowy), "Poniatowski", "Namysłów" - without quotation marks. A version with quotation marks for multiple words would imply searching for an exact phrase. For example, compare search results for: Stanisław August Poniatowski (3k results) and "Stanisław August Poniatowski" (6 results).
- Press "Search" or the Enter key.
- The search engine uses hybrid search, combining keywords with vector (semantic) search to provide the best results. The setting of the "Keyword vs. Semantic Balance" slider determines the impact of a given search type on the result. The default setting of 0% means searching only by keywords; a setting of 100% is an experimental search using only vectors (semantic). Any other value means hybrid search using both methods proportionally.
- Keyword search looks for entries containing the words entered in the search box; the search is performed with some tolerance, i.e., entering the word "mills" will return results containing both that word and "mill", "to a mill", etc.
- Vector (semantic) search finds entries based on the meaning of the entered words; therefore, the phrase "grain milling" will return entries containing words like mill, mills. The same phrase entered with the setting set to keyword-only search would return entries where the specific word "milling" appears.
- Semantic search does not answer questions (like a chat with a language model); it only returns results with a meaning similar to the query, and it does not return a complete, precise list of information.
Results
- If there are more than 20 results, the application displays results divided into pages; navigation buttons are located at the bottom of the results list: "First", "Previous", "Next", "Last".
- Click on the result title (label) to see more details in a separate window (e.g., the full text of the entry).
- For each entry on the results list, the volume number and the page number where the entry begins in the printed version are displayed. The page number is a link leading to the page scan on the ICM servers.
- Use the "Clear" button to reset the view.
Notes:
- The application contains pre-processed SGKP content; typos, errors in entry segmentation, and other processing artifacts may occur.
- The search engine is in the early testing stage; results may be incomplete or inaccurate.
- Search queries are limited to 10 words. Any words beyond the first 10 will be ignored.
- The source code of the application is available in the Github repository (project sgkp_search)
- The source code of the scripts used for automatic information extraction from SGKP entries is available in the Github repository (project sgkp_information_extraction)