Querying arXiv preprints using Apache Airflow
I experimented with Apache Airflow to schedule hourly workflows fetching recent preprint articles from different arXiv categories via the public arXiv.org REST API. These articles are then stored in a PostgreSQL database via a custom-built fastAPI-based REST API.
The setup looks like this:
The code is fully dockerized and available on GitHub along with more detailed documentation.