Devblog 3.2


Building a Retroactive Entity Linking Agent with FastAPI and spaCy

One of the core challenges in any intelligence or research platform is connecting the dots between raw text and structured data. Names appear in entries, but unless those names are formally tied to known entities, they are just strings. A feature I recently shipped closes that gap automatically.

I built a RetroactiveLinkingAgent that scans every entry in a tenant’s corpus, extracts named entity spans using a spaCy-backed NER microservice, and attempts to link each span to an existing entity in the database. The agent runs as a FastAPI background task, so triggering it returns immediately with a job record that the frontend can poll for progress.

The resolution logic works in three tiers. An exact or alias match gets auto-confirmed and written directly to the entry_entity_spans table. A fuzzy match using Postgres pg_trgm similarity, along with ambiguous types like organizations or dates, gets routed to a human review queue instead. For clear-cut cases where the NER service identifies a person or location that does not exist in the system yet, the agent creates a new entity on the fly and links it.

That auto-creation step is where I ran into an accuracy problem. My current configuration auto-creates entities for PERSON, LOC, and GPE (geopolitical entities: cities, countries, states). GPE turns out to be noisy. spaCy frequently tags incidental location references as GPE, which inflates the auto-linked count with low-value or incorrect entities. I am evaluating whether to drop GPE from the auto-create list entirely and push those spans to the review queue instead.

The pipeline also supports cooperative cancellation, letting a manager stop a long run mid-stream without killing the server process.


← Previous