Skip to content

Implement float array compression using ALP#133

Merged
XiangpengHao merged 15 commits into
XiangpengHao:mainfrom
proteetpaul:float_array
Mar 30, 2025
Merged

Implement float array compression using ALP#133
XiangpengHao merged 15 commits into
XiangpengHao:mainfrom
proteetpaul:float_array

Conversation

@proteetpaul
Copy link
Copy Markdown
Contributor

@proteetpaul proteetpaul commented Mar 26, 2025

This commit implements the following:

@proteetpaul proteetpaul marked this pull request as ready for review March 26, 2025 04:43
@XiangpengHao
Copy link
Copy Markdown
Owner

I saw cargo is not happy with some of the format: https://github.com/XiangpengHao/liquid-cache/actions/runs/14100016722/job/39537945145#step:5:1

Can you run a cargo fmt to make sure the format is aligned? @proteetpaul

@XiangpengHao
Copy link
Copy Markdown
Owner

Now we get complaints from clippy: https://github.com/XiangpengHao/liquid-cache/actions/runs/14115617222/job/39547324519?pr=133#step:7:1

Can you fix them as well? you can run cargo clippy to reproduce the warnings @proteetpaul

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 88.61048% with 50 lines in your changes missing coverage. Please review.

Project coverage is 82.66%. Comparing base (982d7bf) to head (4d356f2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/liquid_parquet/src/liquid_array/float_array.rs 87.50% 29 Missing and 2 partials ⚠️
src/liquid_parquet/src/liquid_array/mod.rs 0.00% 7 Missing ⚠️
src/liquid_parquet/src/cache/mod.rs 0.00% 6 Missing ⚠️
src/liquid_parquet/src/liquid_array/ipc.rs 96.62% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   82.33%   82.66%   +0.32%     
==========================================
  Files          36       37       +1     
  Lines        7864     8302     +438     
  Branches     7864     8302     +438     
==========================================
+ Hits         6475     6863     +388     
- Misses       1195     1243      +48     
- Partials      194      196       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XiangpengHao
Copy link
Copy Markdown
Owner

I noticed the implementation is quite similar to https://crates.io/crates/alp

Should we use that implementation? @proteetpaul They're known good people and friends.

@proteetpaul
Copy link
Copy Markdown
Contributor Author

I noticed the implementation is quite similar to https://crates.io/crates/alp

Should we use that implementation? @proteetpaul They're known good people and friends.

I chose to reimplement it since their implementation used vectors and native Rust types, whereas Liquid uses Arrow arrays and Arrow types

@XiangpengHao
Copy link
Copy Markdown
Owner

I noticed the implementation is quite similar to https://crates.io/crates/alp
Should we use that implementation? @proteetpaul They're known good people and friends.

I chose to reimplement it since their implementation used vectors and native Rust types, whereas Liquid uses Arrow arrays and Arrow types

I see, that makes sense. Did you use any of their code? I checked that they are open sourced under Apache 2.0 (same as us), so it should be fine. Just want to find a way to acknowledge them if we use/based on their implementation.

@proteetpaul
Copy link
Copy Markdown
Contributor Author

I see, that makes sense. Did you use any of their code? I checked that they are open sourced under Apache 2.0 (same as us), so it should be fine. Just want to find a way to acknowledge them if we use/based on their implementation.

I used their code in a few places. Should we add an acknowledgement at the beginning of float_array.rs?

@XiangpengHao
Copy link
Copy Markdown
Owner

I see, that makes sense. Did you use any of their code? I checked that they are open sourced under Apache 2.0 (same as us), so it should be fine. Just want to find a way to acknowledge them if we use/based on their implementation.

I used their code in a few places. Should we add an acknowledgement at the beginning of float_array.rs?

That would be great!

@XiangpengHao
Copy link
Copy Markdown
Owner

The implementation looks really good to me, I tweaked the benchmark a little bit and here's the performance I got. Basically we can encode at 1GB/s and decode at 5GB/s, quite impressive!

float32_liquid_encode/size_8192
                        time:   [35.908 µs 36.022 µs 36.135 µs]
                        thrpt:  [864.82 MiB/s 867.53 MiB/s 870.28 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high severe

Benchmarking float32_liquid_encode/size_16384: Collecting 100 samples in estimated 5.20
float32_liquid_encode/size_16384
                        time:   [60.343 µs 61.820 µs 63.962 µs]
                        thrpt:  [977.14 MiB/s 1011.0 MiB/s 1.0115 GiB/s]
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) low severe
  3 (3.00%) high mild
  3 (3.00%) high severe

Benchmarking float32_liquid_encode/size_24576: Collecting 100 samples in estimated 5.07
float32_liquid_encode/size_24576
                        time:   [83.386 µs 83.530 µs 83.657 µs]
                        thrpt:  [1.0944 GiB/s 1.0960 GiB/s 1.0979 GiB/s]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking float64_liquid_encode/size_8192: Collecting 100 samples in estimated 5.131
float64_liquid_encode/size_8192
                        time:   [84.146 µs 85.498 µs 87.087 µs]
                        thrpt:  [717.68 MiB/s 731.01 MiB/s 742.76 MiB/s]
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  16 (16.00%) high severe

Benchmarking float64_liquid_encode/size_16384: Collecting 100 samples in estimated 5.35
float64_liquid_encode/size_16384
                        time:   [109.91 µs 114.39 µs 119.02 µs]
                        thrpt:  [1.0256 GiB/s 1.0671 GiB/s 1.1106 GiB/s]

Benchmarking float64_liquid_encode/size_24576: Collecting 100 samples in estimated 5.00
float64_liquid_encode/size_24576
                        time:   [137.26 µs 141.76 µs 146.90 µs]
                        thrpt:  [1.2464 GiB/s 1.2916 GiB/s 1.3340 GiB/s]
Found 24 outliers among 100 measurements (24.00%)
  1 (1.00%) high mild
  23 (23.00%) high severe

Benchmarking float32_liquid_decode/size_8192: Collecting 100 samples in estimated 5.010
float32_liquid_decode/size_8192
                        time:   [5.4902 µs 5.4932 µs 5.4965 µs]
                        thrpt:  [5.5522 GiB/s 5.5555 GiB/s 5.5586 GiB/s]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Benchmarking float32_liquid_decode/size_16384: Collecting 100 samples in estimated 5.03
float32_liquid_decode/size_16384
                        time:   [11.544 µs 11.562 µs 11.580 µs]
                        thrpt:  [5.2709 GiB/s 5.2788 GiB/s 5.2873 GiB/s]

Benchmarking float32_liquid_decode/size_24576: Collecting 100 samples in estimated 5.00
float32_liquid_decode/size_24576
                        time:   [16.483 µs 16.499 µs 16.514 µs]
                        thrpt:  [5.5439 GiB/s 5.5491 GiB/s 5.5542 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking float64_liquid_decode/size_8192: Collecting 100 samples in estimated 5.035
float64_liquid_decode/size_8192
                        time:   [9.0405 µs 9.0831 µs 9.1277 µs]
                        thrpt:  [6.6868 GiB/s 6.7196 GiB/s 6.7513 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking float64_liquid_decode/size_16384: Collecting 100 samples in estimated 5.06
float64_liquid_decode/size_16384
                        time:   [17.462 µs 17.510 µs 17.567 µs]
                        thrpt:  [6.9488 GiB/s 6.9713 GiB/s 6.9907 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

Benchmarking float64_liquid_decode/size_24576: Collecting 100 samples in estimated 5.09
float64_liquid_decode/size_24576
                        time:   [26.315 µs 26.395 µs 26.482 µs]
                        thrpt:  [6.9144 GiB/s 6.9372 GiB/s 6.9582 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

@XiangpengHao
Copy link
Copy Markdown
Owner

I haven't fully understand every line of code, but the implementation is so good and also well tested, I tend to merge this anyway. This is very impressive work, thank you again @proteetpaul

@XiangpengHao XiangpengHao merged commit 6586222 into XiangpengHao:main Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants