In Part 3, I built the "X-Ray machine" for my library, using Solr and Tika to crack open hundreds of PDF and EPub files and make their text searchable. But I still have the "ghost town" problem: a search result for 9780596515805.pdf is technically accurate but contextually useless.
Today, I'm giving my librarian a brain. I've automated the transformation of raw filenames into rich, professional catalog entries using a two-pronged strategy: official records from the Open Library API and local AI intelligence via Ollama.
The Architecture of Intelligence
Even though this site is for my personal use, I didn't want to overload the resources on my network processing the data for all of my newly added eBooks. Instead, I use a decoupled, queue-based architecture. When Search API finishes indexing a document, an event subscriber triggers a background process to enrich the data.
1. The Central Intelligence: EbookUtilitiesService
Every great library needs a lead librarian. I created the custom EbookUtilitiesService specifically for this purpose. The service is pretty much a Swiss Army Knife® and handles the heavy lifting of communicating with external APIs and my local AI servers.
One of its most critical roles is finding the "needle in the haystack"; the ISBN. Since I already indexed the full text in Part 3, my service performs a surgical regex strike on the first 15,000 characters to identify the book:
// Snippet from my EbookUtilitiesService::getISBN()
$isbn_pattern = '/ISBN(?:-1[03])?[:\s]+((?:97[89][- ]?)?\d{1,5}[- ]?\d+[- ]?\d+[- ]?[\dX]+)/i';
if (preg_match($isbn_pattern, $front_matter, $matches)) {
return trim($matches[1]);
}
2. The Fact-Checker: Open Library API
Once I have an ISBN, I don't need to guess. If the ISBN is identified, I trigger the eBook Metadata Enrichment queue. I opted for the Open Library API because it provides a goldmine of canonical data … all for free!
By offloading this to a specific queue, I can handle external API rate limits or network hiccups without affecting other site processes. My service hits their endpoint to pull:
- Official Titles and Subtitles
- Publishers and Publication Years
- High-Resolution Covers – these were downloaded and added to the local Drupal file system which allows me to use built-in Drupal image styles.
Best Practice: When downloading covers, I use Drupal 11's @file.repository service. It replaces the deprecated system_retrieve_file() and is the modern way to handle managed files programmatically:
$file = \Drupal::service('file.repository')->writeData($image_data, $destination, FileExists::Replace);3. The Deep Reader: Local AI with Ollama
What about technical whitepapers or obscure manuals that don't have an ISBN? This is where I use Ollama—a local AI server running on my Intel NUC i7.
This process is handled by a second, distinct queue: the AI Description Generator. I kept this separate because running an LLM locally is resource-intensive. While the Open Library queue is a quick "ask and receive" over the internet, the Ollama queue is a "think and generate" process that puts my NUC's CPU to work.
By passing the first 2,000 characters of the extracted text to a model like llama3.2, I generate a "Librarian's Summary." The secret is in the System Prompt I wrote:
"You are an expert librarian. Summarize the following book content in exactly two concise sentences. Focus on the core theme and target audience. Do not include introductory phrases like 'This book is' or 'In this excerpt.' Start immediately with the summary."
It's not the best way of getting descriptions from AI, but it doesn't cause my NUC to overheat and go into thermal shutdown either.
This post is part of a series. Check out the full roadmap at The Automated Librarian: A Drupal 11 Data Discovery.
4. Keeping it Robust: The Queue Workers
To ensure these calls don't block the UI, I use Drupal's Queue API.
- The Subscriber: My
SearchApiSubscriberlistens for items being indexed. It acts as the traffic cop, deciding which queue an item belongs in. If an ISBN is present, it heads to the Enrichment Queue; if a summary is missing, it heads to the AI Generator Queue. In many cases, an item is added to both to ensure the record is as complete as possible. - The Cleanup: AI can be chatty. Even with a strict system prompt, LLMs love to add conversational filler like "Certainly! Here is a summary..." or "Based on the text provided...". To keep my library fields professional and clean, I use
preg_replacein my queue worker to strip this "chatter" out before saving the entity:
// Stripping the "AI chatter" in the EbookDescriptionGenerator worker
$patterns = [
'/^of the content:?\\s*/is',
'/^Based on the provided (text|content|excerpt):?\\s*/is',
'/^This (book|text|excerpt) is (designed to|about).*?:\\s*/is',
'/^(Certainly!|Sure!|Here is a summary).*?\\n/is',
];
$entity->field_description = trim(preg_replace($patterns, '', $summary));The Results
The transformation is night and day. What was once a list of cryptic filenames is now a visual library with high-res covers, accurate metadata, and summaries that tell me exactly what's inside—all without manual data entry.
What's Next?
My librarian can now read and categorize on command. In Part 5, I'm going to make the search experience more pleasant. To do this, I'll be adding facets and filters to the base search View to help me narrow down what I'm looking for faster.
Swiss Army Knife® is a registered trademark of Victorinox AG.
0 Comments
Login or Register to post comments.