workday-scraper-api

Flask API service for automated job scraping with database backend

📖 API Documentation

The OpenAPI Swagger UI for this API is available at:

👉 View the API Docs

Features

Headless scraping of Workday‐powered job postings
Persist scraped data in a SQLite database
Expose a RESTful Flask API to query and trigger scrapes
Configurable via environment variables and .env
Structured logging to console and rotating log files
Automated changelog and daily ingestion via GitHub Actions

Architecture

Core: Python 3.12, Flask
Scraper: Vendored Workday-scraper logic under app/scraper_pkg
Storage: SQLite (jobs.db)
CLI: run.py powered by Click—supports scrape & serve commands
API: Blueprint jobs_bp exposes /jobs/... routes
Config: python-dotenv + app/config.py environment-driven

Getting Started

Prerequisites

Python 3.12
Git
(Optional) Conda or virtualenv

Clone & Install

git clone https://github.com/jharemza/workday-scraper-api.git
cd workday-scraper-api

# Using virtualenv
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root to override defaults (see app/config.py):

# Path to SQLite DB
JOBS_DB_PATH=./jobs.db

# Scrape settings
SCRAPE_LIMIT=20

# API server settings
API_HOST=127.0.0.1
API_PORT=5000

# Logging
LOG_LEVEL=INFO

Initialize the Database

On first run the table is auto-created. To reset or customize:

sqlite3 jobs.db << 'EOF'
DROP TABLE IF EXISTS job_postings;
# paste the CREATE TABLE DDL from app/db.py here
EOF

Usage

CLI Commands

Scrape all companies

python run.py scrape

Scrape specific companies

python run.py scrape -c "M&T Bank" -c "Acme Corp"

Start the API server

python run.py serve

API Endpoints

All responses are JSON.

Method	Path	Description
GET	`/jobs/all`	List all current job postings
GET	`/jobs/today`	Jobs scraped on the current date
GET	`/jobs/company/{company}`	All current jobs for a given company
GET	`/jobs/company/{company}/new`	Jobs added today for a given company
POST	`/jobs/scrape`	Trigger a fresh scrape (body: `{"companies": [...]}`)

Directory Structure

.
├── .github/
│   └── workflows/     # CI/CD (release & daily ingest)
├── app/
│   ├── main.py        # Flask app & logging setup
│   ├── routes.py      # API endpoints
│   ├── db.py          # SQLite schema & CRUD
│   ├── config.py      # env-driven settings
│   ├── scraper.py     # orchestrates vendored scraper + DB upserts
│   └── scraper_pkg/   # vendored workday_scraper modules
├── docs/
│   └── openapi.yaml   # (optional) OpenAPI spec
├── logs/
│   └── app.log        # auto-rotated logs
├── tests/             # pytest suite
├── .env               # environment overrides (gitignored)
├── jobs.db            # SQLite DB (auto-generated)
├── README.md
├── requirements.txt
└── run.py             # CLI commands (scrape & serve)

Logging

Console: Verbose, timestamped output
File: logs/app.log (rotates at 10 MB, keeps 5 backups)
Level: Controlled by LOG_LEVEL (DEBUG, INFO, etc.)

Testing

pytest --cov=app --cov-report=xml tests/

CI/CD

.github/workflows/release.yml: Auto-update CHANGELOG.md on tags/schedule
.github/workflows/ingest.yml: Daily or manual scrape & optional DB commit

License

This project is licensed under the MIT License.

Example Requests

You can interact with the API directly via curl or import these into Postman.

Using `curl`

Replace {{API_HOST}} and {{API_PORT}} with your configured values (defaults: 127.0.0.1:5000).

List all jobs

curl -X GET http://{{API_HOST}}:{{API_PORT}}/jobs/all

Jobs scraped today

curl -X GET http://{{API_HOST}}:{{API_PORT}}/jobs/today

All current jobs for a company

curl -X GET http://{{API_HOST}}:{{API_PORT}}/jobs/company/"M&T%20Bank"

New jobs for a company (added today)

curl -X GET http://{{API_HOST}}:{{API_PORT}}/jobs/company/"M&T%20Bank"/new

Trigger a scrape for one or more companies

curl -X POST http://{{API_HOST}}:{{API_PORT}}/jobs/scrape \
  -H "Content-Type: application/json" \
  -d '{"companies": ["M&T Bank", "Acme Corp"]}'

Using Postman

Create a new Collection called “Workday Scraper API”.
Add a Request for each endpoint:
- Method: GET or POST
- URL: http://127.0.0.1:5000/jobs/all (or other endpoints)
- Headers: for POST, set Content-Type: application/json
- Body (raw JSON) for /jobs/scrape:
```
{
  "companies": ["M&T Bank", "Acme Corp"]
}
```
Save and Send—you’ll see the JSON response in Postman’s response pane.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github		.github
app		app
docs		docs
tests		tests
.coverage		.coverage
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
coverage.xml		coverage.xml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

workday-scraper-api

📖 API Documentation

Table of Contents

Features

Architecture

Getting Started

Prerequisites

Clone & Install

Configuration

Initialize the Database

Usage

CLI Commands

API Endpoints

Directory Structure

Logging

Testing

CI/CD

License

Example Requests

Using `curl`

Using Postman

About

Uh oh!

Releases 34

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

workday-scraper-api

📖 API Documentation

Table of Contents

Features

Architecture

Getting Started

Prerequisites

Clone & Install

Configuration

Initialize the Database

Usage

CLI Commands

API Endpoints

Directory Structure

Logging

Testing

CI/CD

License

Example Requests

Using curl

Using Postman

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 34

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using `curl`

Packages