
Project: Warkan – a Cost-Efficient AI Assistant for the Blog
Tools: Python, FastAPI, scikit-learn (TF-IDF), REST API, WordPress API, Cloudflare Workers AI
Description: A custom blog chatbot built on a Retrieval-Augmented Generation (RAG) architecture that combines semantic search with responses generated by a language model.
Objective: The goal was to build an assistant that answers only using the blog’s real content, while maintaining control over context, system logic, and the cost of model queries.
Data: The data is sourced from blog posts dynamically fetched through the WordPress REST API.
Repository:
See on GitHub
Introduction
Warkan is a backend application designed to search blog content and generate summaries based on the most relevant posts.
The solution does not rely solely on a generative language model. A key component is the retrieval layer, which selects the relevant context before it is passed to the model. This ensures that responses are grounded in specific content rather than generated freely.
The project runs as a public web application connected to the website.
Data
The data is sourced from blog posts dynamically fetched through the WordPress REST API.
The scope of processed information includes:
- post title,
- excerpt,
- full content,
- URL address.
The content is cleaned of HTML and converted into plain text before further analysis.
Solution Architecture
The project was designed as a lightweight Retrieval-Augmented Generation (RAG) architecture consisting of three layers.
1) Data Layer
The application retrieves posts from a selected blog category and stores them in the server’s memory.
The index is built based on the article’s title, excerpt, and full content.
A TTL cache mechanism is used to define the maximum lifetime of the index. Once it expires, the data is refreshed.
2) Retrieval Layer
Based on the content, a TF-IDF index is created using the scikit-learn library.
The user’s query is transformed into the same vector representation, and cosine similarity is then calculated against all posts.
Only the most relevant results (Top-K) that meet the minimum similarity threshold are returned.
Additionally, a contextual snippet is generated around the matched keyword.
3) Generation Layer
Only the selected search results along with their contextual snippets are passed to the language model.
The prompt was designed to:
- prevent the addition of new sources,
- enforce the exact number of described results,
- control the length of the response,
- maintain a consistent response language.
This approach reduces the risk of hallucinations and increases the system’s predictability.
Design Decisions
The project implements solutions that improve stability and control over system behavior:
- separation of the retrieval layer from the generation layer,
- limiting the number of results passed to the model,
- a minimum similarity threshold,
- control over the length of the provided context,
- handling situations where the API returns no response,
- storing the index in memory to reduce operational costs.
The priority was ensuring system predictability and maintaining conscious control over queries sent to the language model.
Results
Application:
- dynamically retrieves and indexes blog content,
- searches posts based on semantic similarity,
- generates concise summaries based exclusively on the retrieved content,
- operates as a public component of the website,
- includes a mechanism that controls costs and the frequency of data refresh.
Further Development
Planned extensions:
- multilingual support,
- more precise context filtering.
A cookie with your coffee?