Skip to content
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication…
C++ Java Rust Python TypeScript Ruby Other
Branch: master
Clone or download
yutannihilation and nealrichardson ARROW-7045: [R] Preserve factor in Parquet roundtrip
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`.

The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same.

Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits:

9227e7e <Hiroaki Yutani> Fix test
4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties()
dfd08cb <Hiroaki Yutani> Add failing tests

Authored-by: Hiroaki Yutani <yutani.ini@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Latest commit 4634c89 Jan 8, 2020
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github ARROW-7328: [CI] GitHub Actions should trigger on changes to GitHub A… Jan 6, 2020
c_glib ARROW-7488: [GLib] Fix typos and broken links Jan 3, 2020
ci ARROW-7502: [Integration] Remove Spark patch not needed Jan 7, 2020
cpp ARROW-7500: [C++][Dataset] Remove std::regex usage Jan 7, 2020
csharp ARROW-7318: [C#] TimestampArray serialization failure Jan 3, 2020
dev ARROW-7502: [Integration] Remove Spark patch not needed Jan 7, 2020
docs ARROW-7463 : [Doc] fix a broken link and typo Dec 23, 2019
format ARROW-7463 : [Doc] fix a broken link and typo Dec 23, 2019
go ARROW-7357: [Go] migrate to x/xerrors Dec 9, 2019
integration ARROW-7101: [CI] Refactor docker-compose setup and use it with GitHub… Nov 12, 2019
java ARROW-7437: [Java] ReadChannel#readFully does not set writer index co… Dec 28, 2019
js ARROW-7470: [JS] fix typos Dec 26, 2019
matlab [Release] Update versions for 1.0.0-SNAPSHOT Sep 30, 2019
python ARROW-7087: [Python] Metadata disappear from pandas dataset Jan 7, 2020
r ARROW-7045: [R] Preserve factor in Parquet roundtrip Jan 8, 2020
ruby ARROW-7479: [Rust][Ruby][R] Fix typos Jan 5, 2020
rust ARROW-7479: [Rust][Ruby][R] Fix typos Jan 5, 2020
testing @ 90ae758 ARROW-4219: [Rust] [Parquet] Initial support for arrow reader. Oct 14, 2019
.clang-format ARROW-3313: [R] Move .clang-format to top level. Add r/lint.sh script… Sep 26, 2018
.clang-tidy ARROW-2981: [C++] improve clang-tidy usability Jun 14, 2019
.clang-tidy-ignore ARROW-3313: [R] Move .clang-format to top level. Add r/lint.sh script… Sep 26, 2018
.dir-locals.el ARROW-4930: [C++] Improve find_package() support Nov 5, 2019
.dockerignore ARROW-7489: [CI] Fix typos Jan 3, 2020
.env ARROW-7374: [Dev] [C++] Fix cuda-cpp docker build Dec 16, 2019
.gitattributes ARROW-5488: [R] Workaround when C++ lib not available Jun 12, 2019
.gitignore ARROW-6494: [C++][Dataset] Implement PartitionSchemes Oct 5, 2019
.gitmodules ARROW-4459: [Testing] Add arrow-testing repo as submodule Feb 8, 2019
.hadolint.yaml ARROW-6214: [R] Add R sanitizer docker image Sep 19, 2019
.pre-commit-config.yaml ARROW-4909: [CI] Use hadolint to lint Dockerfiles Mar 18, 2019
.readthedocs.yml ARROW-1142: [C++] Port over compression toolchain and interfaces from… Jun 23, 2017
CHANGELOG.md ARROW-7163: [Doc] Fix double-and typos Nov 13, 2019
CODE_OF_CONDUCT.md ARROW-4006: Add CODE_OF_CONDUCT.md Dec 15, 2018
CONTRIBUTING.md ARROW-7489: [CI] Fix typos Jan 3, 2020
LICENSE.txt ARROW-6341: [Python] Implement low-level bindings for Dataset Dec 13, 2019
Makefile.docker ARROW-6214: [R] Add R sanitizer docker image Sep 19, 2019
NOTICE.txt ARROW-5934: [Python] Bundle arrow's LICENSE with the wheels Jul 15, 2019
README.md ARROW-7101: [CI] Refactor docker-compose setup and use it with GitHub… Nov 12, 2019
appveyor.yml ARROW-7333: [CI][Rust] Remove duplicated nightly job Dec 6, 2019
cmake-format.py ARROW-4363: [CI] [C++] Add CMake format checks Feb 11, 2019
docker-compose.yml ARROW-7489: [CI] Fix typos Jan 3, 2020
header ARROW-259: Use Flatbuffer Field type instead of MaterializedField Aug 18, 2016
run-cmake-format.py ARROW-7169: [C++] Vendor uriparser library Nov 20, 2019

README.md

Apache Arrow

Build Status Coverage Status Fuzzit Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain a number of distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

You can’t perform that action at this time.