Skip to content

Add optional sharded on-disk layout for images and cache#52

Open
hauxir wants to merge 2 commits into
masterfrom
feature/sharded-storage
Open

Add optional sharded on-disk layout for images and cache#52
hauxir wants to merge 2 commits into
masterfrom
feature/sharded-storage

Conversation

@hauxir
Copy link
Copy Markdown
Owner

@hauxir hauxir commented Apr 14, 2026

Summary

  • New SHARD_STORAGE setting (default off). When enabled, files live under 2-level × 2-char subdirectories derived from the filename prefix, e.g. /images/ab/cd/abcd1234.jpg.
  • Keeps each directory small at TB scale — `readdir`, backups, admin list, cache eviction all stay fast.
  • URLs unchanged. Sharding is purely an internal layout detail; `X-Sendfile`/nginx need no config changes.
  • New helper `resolve_existing_path` falls back to the flat location, so mixed-era installs keep working while you migrate.
  • New `app/migrate_shard.py` script (idempotent, dry-run supported) relocates existing flat files into their shard paths.

Tradeoffs / caveats

  • Segment length (2) and depth (2) are constants, not configurable — chosen to match git/S3 conventions.
  • Works best with `NAME_STRATEGY=uuidv4`. For `randomstr` (5-char names), sharding technically works but bucket density is low; noted in the setting's comment.
  • Run the migration script with the app stopped (or uploads paused) on both `IMAGES_DIR` and `CACHE_DIR` before flipping the setting.

Test plan

  • With `SHARD_STORAGE=False`, uploads/GET/DELETE behave exactly as today (flat layout)
  • With `SHARD_STORAGE=True`, a new upload lands at `/images/ab/cd/abcd...ext` and is readable via the same URL
  • Resize variants land in the same shard path under `/cache/`
  • DELETE removes both the image and its cached resized versions
  • After toggling `SHARD_STORAGE=True` without migrating, legacy flat files are still served (mixed mode)
  • `python app/migrate_shard.py /images --dry-run` shows planned moves; without `--dry-run` it actually relocates them; re-running is a no-op

🤖 Generated with Claude Code

Flat IMAGES_DIR/CACHE_DIR directories become slow for readdir, backups,
and cache sweeps once they hold millions of entries. When SHARD_STORAGE
is enabled, files are stored under two 2-char subdirectories derived
from the filename prefix (e.g. /images/ab/cd/abcd1234.jpg), keeping
every directory small. URLs are unchanged — sharding is purely an
on-disk detail.

resolve_existing_path falls back to the flat location so mixed-era
installs keep working; run app/migrate_shard.py once against each
directory to relocate legacy flat files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Add optional sharded on-disk layout for images and cache

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement optional sharded on-disk layout for images and cache directories
• Add SHARD_STORAGE setting to organize files under 2-level subdirectories
• Introduce resolve_path() and resolve_existing_path() helpers for path resolution
• Create migration script to relocate existing flat files to sharded layout
Diagram
flowchart LR
  A["File Upload/Request"] --> B{"SHARD_STORAGE<br/>Enabled?"}
  B -->|No| C["Flat Layout<br/>/images/file.jpg"]
  B -->|Yes| D["Sharded Layout<br/>/images/ab/cd/file.jpg"]
  E["resolve_existing_path()"] --> F{"Check Flat<br/>First"}
  F -->|Found| G["Return Flat Path"]
  F -->|Not Found| H{"Check Sharded<br/>if Enabled"}
  H -->|Found| I["Return Sharded Path"]
  H -->|Not Found| J["Return None"]
  K["migrate_shard.py"] --> L["Move Files to<br/>Sharded Layout"]
Loading

Grey Divider

File Changes

1. app/settings.py ⚙️ Configuration changes +6/-0

Add SHARD_STORAGE configuration setting

• Add new SHARD_STORAGE boolean setting (default False)
• Include documentation explaining sharding benefits and best practices
• Note that sharding works best with NAME_STRATEGY=uuidv4

app/settings.py


2. app/imgpush.py ✨ Enhancement +40/-3

Add path resolution functions for sharded storage

• Add _shard_segments() helper to compute 2-level shard path components
• Implement resolve_path() to map filenames to flat or sharded paths with optional parent
 directory creation
• Implement resolve_existing_path() to support mixed-era installs by checking flat location first,
 then sharded
• Update get_random_filename() to search correct directory based on shard layout
• Update delete_image() to use resolve_existing_path() and compute cache pattern in correct
 shard directory

app/imgpush.py


3. app/app.py ✨ Enhancement +5/-5

Integrate sharded path resolution into endpoints

• Replace direct os.path.join() calls with imgpush.resolve_path() in upload_image() for output
 path
• Replace direct os.path.join() calls with imgpush.resolve_path() in get_image() for resized
 cache path
• Update get_image() to use imgpush.resolve_existing_path() for reading images
• Simplify file existence check from os.path.isfile(path) to path is None after
 resolve_existing_path()

app/app.py


View more (1)
4. app/migrate_shard.py ✨ Enhancement +66/-0

Add migration script for flat to sharded layout

• Create new standalone migration script to relocate files from flat to sharded layout
• Implement migrate() function that scans base directory and moves files to shard paths
• Support --dry-run flag to preview moves without modifying disk
• Skip files with names too short for sharding and non-file entries
• Make script idempotent and safe to re-run

app/migrate_shard.py


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented Apr 14, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Path traversal file serving🐞
Description
get_image() accepts {filename:path} and passes it into
resolve_existing_path()/resolve_path() which build filesystem paths via os.path.join(...)
without any realpath+prefix containment check, so requests containing ../ or absolute path
components can escape IMAGES_DIR/CACHE_DIR and be served/written outside the intended
directories (worst case when SHARD_STORAGE=True).
Code

app/app.py[R217-230]

   w: str = Query(default=""),
   h: str = Query(default=""),
) -> FileResponse:
-    path = os.path.join(settings.IMAGES_DIR, filename)
+    path = imgpush.resolve_existing_path(settings.IMAGES_DIR, filename)

-    if not os.path.isfile(path):
+    if path is None:
       raise HTTPException(status_code=404, detail="File not found")

   filename_without_extension, extension = os.path.splitext(filename)

-    if (w or h) and os.path.isfile(path) and extension not in (".mp4", ".svg"):
+    if (w or h) and extension not in (".mp4", ".svg"):
       try:
           width = imgpush.get_size_from_string(w)
           height = imgpush.get_size_from_string(h)
Evidence
The route uses {filename:path}, meaning the client-controlled filename can include slashes and
traversal sequences. get_image() then serves whatever resolve_existing_path() returns via
FileResponse/X-Sendfile, but neither get_image() nor resolve_existing_path() enforces that
the resolved path remains under IMAGES_DIR. Additionally, when sharding is enabled, shard segments
are derived from the raw filename and joined directly; if a segment starts with /,
os.path.join can discard base_dir, enabling base-dir escape in the sharded branch as well.

app/app.py[193-247]
app/imgpush.py[52-79]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_image()` uses a client-controlled `{filename:path}` and turns it into a filesystem path via `resolve_existing_path()`/`resolve_path()` without verifying the final resolved path is contained within the intended base directory. With `SHARD_STORAGE=True`, `_shard_segments()` can also produce segments beginning with `/`, which can cause `os.path.join()` to drop `base_dir`.
## Issue Context
- This is a security boundary: the service must never serve or create files outside `IMAGES_DIR`/`CACHE_DIR`.
- `delete_image()` already uses `realpath` + prefix checks, but `get_image()` does not.
## Fix Focus Areas
- app/app.py[193-247]
- app/imgpush.py[52-79]
## Implementation notes
- Prefer changing the route to `/{filename}` (no `:path`) if nested paths are not required.
- In `resolve_path` and `resolve_existing_path`, reject any `filename` that is absolute or contains path separators (or normalize and enforce containment):
- `candidate = os.path.realpath(os.path.join(base_dir, ...))`
- verify `candidate.startswith(os.path.realpath(base_dir) + os.sep)`
- raise/return `None` if invalid.
- Ensure sharded segment generation uses a safe key (e.g., `os.path.basename(filename)` and/or strips separators) so segments can never start with `/` or contain separators.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Flat cache orphaned🐞
Description
When SHARD_STORAGE is enabled after previously running with flat cache files, delete_image()
only deletes cached variants within the sharded cache directory derived from the original filename,
leaving legacy flat cached resize files undeleted.
Code

app/imgpush.py[R215-220]

   # Also sanitize cache path - escape glob special chars in filename
   safe_filename = os.path.basename(image_path)
   filename_without_ext, extension = os.path.splitext(safe_filename)
-    cache_pattern = os.path.join(settings.CACHE_DIR, f"{glob.escape(filename_without_ext)}_*x*{glob.escape(extension)}")
+    cache_dir = os.path.dirname(resolve_path(settings.CACHE_DIR, safe_filename))
+    cache_pattern = os.path.join(cache_dir, f"{glob.escape(filename_without_ext)}_*x*{glob.escape(extension)}")
   cached_files = glob.glob(cache_pattern)
Evidence
Resized images are written to resolve_path(CACHE_DIR, resized_filename, ...), which will change
the cache directory when sharding is enabled. delete_image() now computes a single cache_dir
from resolve_path(CACHE_DIR, safe_filename) and only globs within that directory, so any existing
flat cache entries in the base cache directory will not match and will remain on disk.

app/app.py[225-246]
app/imgpush.py[196-225]
app/imgpush.py[57-64]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`delete_image()` only searches for cached variants under the sharded directory computed by `resolve_path(CACHE_DIR, safe_filename)`. If the instance previously generated flat cache files (pre-sharding), those files will not be deleted, leaving orphaned cache data.
## Issue Context
Mixed-era installs are explicitly supported for images via `resolve_existing_path()`, but cache cleanup does not provide a similar mixed-mode behavior.
## Fix Focus Areas
- app/imgpush.py[196-225]
- app/app.py[225-246]
## Implementation notes
- When deleting cache variants, search both:
- the flat cache directory (`CACHE_DIR`) pattern, and
- the sharded cache directory computed from the filename.
- Alternatively, implement `resolve_existing_path()` for cache variants (or a helper returning both candidate directories) and delete matches from both locations.
- Keep glob escaping as-is to avoid glob injection.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Migration target collision🐞
Description
migrate_shard.py claims it is “idempotent and safe to re-run” but it renames each top-level file
into its shard target without checking whether the target path already exists, so partial migrations
or concurrent writes can cause exceptions and stop the run.
Code

app/migrate_shard.py[R34-46]

+            segments = shard_segments(entry.name)
+            if any(not s for s in segments):
+                print(f"skip (name too short for shard): {entry.name}", file=sys.stderr)
+                skipped += 1
+                continue
+            target_dir = os.path.join(base_dir, *segments)
+            target = os.path.join(target_dir, entry.name)
+            if dry_run:
+                print(f"would move {entry.path} -> {target}")
+            else:
+                os.makedirs(target_dir, exist_ok=True)
+                os.rename(entry.path, target)
+            moved += 1
Evidence
The script docstring claims rerun-safety, but migrate() constructs a target path and calls
os.rename(entry.path, target) without any exists guard or exception handling. If a target
already exists (e.g., from a prior partial run), the rename can fail and abort the migration before
completing.

app/migrate_shard.py[2-47]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`migrate_shard.py` is documented as idempotent/rerun-safe, but it unconditionally renames each file to its shard target. If the target already exists, the script can fail mid-run.
## Issue Context
The script warns to pause uploads, but rerun-safety is still valuable for interrupted runs, operator mistakes, and partial migrations.
## Fix Focus Areas
- app/migrate_shard.py[26-47]
## Implementation notes
- Before renaming, check `os.path.exists(target)`:
- if it exists, print a warning and skip (or verify same size/hash and then delete source).
- Wrap per-file move in `try/except` so one failure doesn’t abort the whole migration; keep counts for moved/skipped/failed.
- Consider using `os.replace` only if overwriting is explicitly intended (it contradicts “safe to re-run” otherwise).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread app/app.py
…ion collisions

- Reject filenames that are not a plain basename in resolve_path and
  resolve_existing_path, closing a pre-existing path-traversal vector
  on GET and the newly-added cache write path.
- delete_image now globs cached variants in both flat and sharded cache
  directories so mixed-era installs don't leave orphaned cache files.
- migrate_shard.py skips existing targets, catches per-file OSError,
  and reports a failure count / non-zero exit so an interrupted or
  partial run can be re-run safely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant