Development

The Automated Librarian: Part 6 - Building a System That Feeds Itself

March 04, 2026

Six posts. One goal: A system that feeds itself.

The payoff for all this architectural groundwork isn't in the code; it's in the silence. It's the moment you drop a messy 300-page PDF into a folder, walk away for lunch, and come back to find it already indexed, tagged by AI, and sitting in your search results.

No 'Add Node' screens. No manual tagging. Just a chain reaction from folder to faceted search.

If you've been following since Part 1, this is the "Was it worth it?" moment. Here is how the loop closes.

TL;DR

The technical investment across this series produces a self-sustaining document ingestion pipeline. The only manual step remaining is placing a PDF file into a directory. Everything else — content creation, metadata generation, full-text indexing, and search availability — happens automatically on a schedule. This pattern isn't limited to eBooks; it applies to any document management workflow.

The Chain Reaction: From Cron to Search Results

Let's look under the hood at what actually happens when a new PDF lands in the source directory. The entire chain fires from a single crontab entry, and each link in that chain maps directly to work we did in a previous post.

Here's the sequence:

Step	What Happens	Origin
1	Crontab fires on schedule	System cron (platform-agnostic)
2	`drush migrate:import ebook_import --update` runs	Part 2: The Migration Engine
3	migrate_plus detects the new PDF via change tracking, creates a new eBook node	Part 2: The Migration Engine
4	Solr/Tika indexes the PDF content — full-text extraction from the binary file	Part 3: Indexing PDF Content
5	AI metadata enrichment processes fire, populating taxonomy terms, summaries, and descriptive fields	Part 4: Local AI & Metadata Makeover
6	Content surfaces in the faceted search UI, immediately discoverable	Part 5: Fine-Tuning Facets & Search API

Six steps. Zero human intervention after the file drop. The linchpin holding all of this together is a single migration configuration file: web/modules/ebook_migration/config/install/migrate_plus.migration.ebook_import.yml

That YAML file defines the source directory, the process pipeline, and the destination entity. It's the contract between "a PDF exists on disk" and "a fully enriched, indexed node exists in Drupal." Every architectural decision from the series converges here.

The Migration as a Persistent Engine

Here's the non-obvious insight that makes this whole thing work: Drupal's migration system isn't a one-shot import tool.

Most developers encounter migrate and migrate_plus during a big replatforming project — move content from legacy CMS to Drupal, run the migration, celebrate, never think about it again. That's the common mental model. And it's underselling the system by a mile.

When you pair migrate_plus with track_changes and the --update flag on your Drush command, the migration becomes a persistent content ingestion engine. It maintains a map of what it's already processed. On every subsequent run, it checks the source against that map, detects new or changed files, and only processes the delta.

The relevant keys in the migration YAML that make this possible:

  
 plugin: ebook_pdf_directory  
 # The stable, accessible path — this IS the "drop files here" contract  
 path: /var/data/ebooks/incoming  
 track_changes: true

That track_changes: true directive is doing the heavy lifting. Combined with the crontab entry calling drush migrate:import ebook_import --update, the migration will re-scan the source path on every run, compare file hashes against its internal map, and only create nodes for genuinely new content.

Think of it like this: the source directory path in that YAML is a contract. It says, "Anything that appears here will be processed." The migrate_plus tracking says, "But I won't process the same thing twice." Together, they create the "just drop files here" workflow — stable, repeatable, and predictable.

Your crontab entry stays platform-agnostic and dead simple. In my case, this entire stack— Solr, and the AI models—is humming away on an Intel NUC i7 in my homelab. It's proof that you don't need a massive cloud budget to build a high-performance document pipeline; you just need the right architecture.

0 2 * * * cd /path/to/drupal && vendor/bin/drush migrate:import ebook_import --update 2>&1

This says, every day at 2 AM (or even every four hours if you need it; just change the cron time), the system checks for new material. Adjust the frequency to match your ingestion volume. That's it. That's the only infrastructure-level configuration outside of Drupal itself.

The Scanned PDF Gotcha

Before you celebrate the "drop it and forget it" promise, there is one big hurdle that will bite you if you're not ready: Image-based PDFs will break your indexing pipeline silently.

Solr/Tika works magic, but only when the PDF contains text data. A PDF created from scanned pages is just a stack of images. The file imports, the node is created, but the full-text index? Empty.

The Cursor Test: If you can't highlight text with your mouse in your PDF viewer, it's a flat image. If you can't read it, the system can't either.

Solr/Tika extracts text from PDFs. That's the magic from Part 3. But it only works when the PDF actually contains text data. A PDF created from scanned pages — and a lot of older or academic eBooks fall into this category — is just a stack of rasterized images wrapped in a PDF container. (You can usually tell by the file size itself because a raster image PDF file will be much larger that a text-based PDF with the same page count.) There's no text layer for Tika to extract. The file imports fine. The node gets created. The AI metadata enrichment might even do its best with whatever limited data is available. But the full-text index? Empty. Your faceted search will never surface the content of that book.

The fix is OCR processing like Tesseract before the file enters the pipeline. If your source material includes scanned documents, you need that OCR step upstream of the ingestion directory.

This is the kind of real-world friction that doesn't show up in architecture diagrams. The pipeline assumes text-based PDFs. If your source material includes scanned documents, you need an OCR pre-processing step — either manual or automated — upstream of the ingestion directory. In my experience with large-ish document collections, this is the single most common reason a file "imports successfully" but remains invisible in search.

Beyond the Bookshelf

The eBook library is the use case that I built across these six posts. But step back and look at what I actually assembled: a pattern for automated document ingestion with AI-enriched metadata and full-text search.

By automating the 'Librarian,' we've shifted the cost of document management from recurring manual labor to a one-time architectural investment.

The building blocks are generic, but the efficiency is real because it is just a pattern and patterns, by design, are repeatable.

migrate_plus as a persistent ingestion engine — any structured or semi-structured source, not just PDFs.
Local AI for metadata enrichment — classification, summarization, tagging. The business rules change; the pipeline doesn't.
Solr/Tika for full-text indexing — works for contracts, research papers, internal documentation, compliance records. Anywhere humans need to search inside documents.
Faceted Search API UI — the discovery layer that makes the indexed content actually useful.

Swap "eBooks" for "vendor contracts" or "regulatory filings" or "internal knowledge base documents." The architecture holds. What changes is the migration source plugin, the AI prompt, and the facet configuration. The Drupal-native patterns — the service container, the migration framework, the Search API abstraction — those are the foundation that makes the whole thing repeatable.

That's the real payoff of this series. Not just a working eBook library, but a reusable architectural pattern for self-sustaining content ingestion. Limited only by the business requirement and — honestly — your imagination for what to throw into the hopper next.

Stop wrestling with messy sources and start building something scalable.

Start the Brainstorm

I'd love to hear about your specific project. Reach out today and let's make the Odyssey a shared one.

The Automated Librarian: Part 6 - Building a System That Feeds Itself

The Chain Reaction: From Cron to Search Results

The Migration as a Persistent Engine

The Scanned PDF Gotcha

Beyond the Bookshelf

Author

Ron Ferguson

Next Blog