Adding data files to Python package with setup.py#
setup.py vs pyproject.toml#
pyproject.toml
is the new Python project metadata specification standard since PEP 621. As per PEP 517, and as per one of the comments of this StackOverflow thread, in some rare cases, we might have a chicken and egg problem when using setup.py
if it needs to import something from the package it's building. The only thing that pyproject.toml
cannot achieve for the moment is the installation in editable mode, where we must use setup.py
. Another advantage of setup.py
is that we can compute some variables dynamically during the build time as it's a Python file.
Nevertheless, setup.py
is still a widely used solid tool to build Python package. This post will discuss how to add data files (non Python files) to a Python wheel package built by setup.py
, the source distribution files (sdist .tar.gz files, .zip for Windows) are not covered by this post.
Adding data files#
With parameter package_data for files inside a package#
Official doc: https://docs.python.org/3/distutils/setupscript.html#installing-package-data
package_data
accepts wildcard, but from the given example, the data files must exist inside a Python module folder (coexist with file __init__.py
), you cannot use package_data
to include files from non module folders, for e.g. the folder conf
where there's no __init__.py
file inside.
setup(...,
packages=['mypkg'],
package_dir={'mypkg': 'src/mypkg'},
package_data={'mypkg': ['data/*.dat']},
)
With parameter data_files for any files#
official doc: https://docs.python.org/3/distutils/setupscript.html#installing-additional-files
Warning
distutils
is deprecated, and will be remove in Python 3.12 as per PEP 632, the migration path is to simply use setuptools.
setup(...,
data_files=[
('bitmaps', ['bm/b1.gif', 'bm/b2.gif']),
('config', ['cfg/data.cfg']),
({dest_folder_path_in_wheel}, [{source_file_path_relative_to_setup.py_script}]),
],
)
From the above example, we can see that:
data_files
accepts any files from any folder, in contrast topackage_data
which accepts files inside a package folder.data_files
takes files one by one, we can not use the wildcard like * to specify a set of source files.- after build, there's a
.whl
wheel file generated, thesource_file_path_relative_to_setup
will be added to the path{package_name}-{package_version}.data/data/{dest_folder_path_in_wheel}/{source_file_name}
, and the Python files are added to{module_name}/{python_package_original_path}
. If you want to put the data files at the original path, you need to replace{dest_folder_path_in_wheel}
with../../{data_files_original_path}
, the first two..
is just to escape two folder levels from{package_name}-{package_version}.data/data/
.
With file MANIFEST.in#
From my understanding and tests, MANIFEST.in
file is only for sdist, so out of the scope of this post which talks about bdist wheel package only.
Parameter zip_safe#
If you're using old-fashion egg file, to reference data files inside package, should put zie_safe=False
during built. Otherwise, for modern Python packaging, this parameter is obsolete.
Loading data files#
A very good sum-up can be found in this StackOverflow thread.
Loading data files packaged by package_data#
- With importlib.resources, importlib.metadata or their backports importlib_resources importlib_metadata.
# to read file from module_a/folder_b/file.json
import importlib.resources
import json
# open_text is deprecated in Python3.11 as only support files in Python modules
# see below example how to use `importlib.resources.files`
json.load(importlib.resources.open_text("module_a.folder_b", "file.json"))
Check this doc for migration from pkg_resources
.
- With deprecated pkg_resources from setuptools of pypa.io, and some examples from here or here.
!!! warning
[pkg_resources](https://setuptools.pypa.io/en/latest/pkg_resources.html) is deprecated due to some performance issue, and also need to install third-party setuptools for the run which should only be used during the build.
# to read file from module_a/folder_b/file.json
import json
import pkg_resources
json.load(pkg_resources.resource_stream("module_a", "folder_b/file.json"))
Loading data files packaged by data_files#
As data files packaged by data_files
parameter could be in any folder, not necessarily inside a Python module with __init__
file, in such case the new importlib.resources.open_text
can not be used anymore, and indeed marked as deprecated in Python 3.11.
- Use stdlib
importlib.resources.files
to read file frommodule_a/folder_b/file.json
!!! note
This method can also be used to [load data files packaged by package_data](#loading-data-files-packaged-by-data_files)
try:
# new stdlib in Python3.9
from importlib.resources import files
except ImportError:
# third-party package, backport for Python3.9-,
# need to add importlib_resources to requirements
from importlib_resources import files
import json
# with `data_files` in `setup.py`,
# we can specify where to put the files in the wheel package,
# so inside the module_a for example
with open(files(module_a).joinpath("folder_b/file.json")) as f:
print(json.load(f))
- Use deprecated third-party
pkg_resources
to read file frommodule_a/folder_b/file.json
import json
import pkg_resources
# use `data_files` in `setup.py`, we can specify where to put the files,
# so inside the module_a for example
json.load(pkg_resources.resource_stream("module_a", "folder_b/file.json"))
- Use stdlib
pkgtuil.get_data
You can find an example in this StackOverflow thread. All the answers and the comments are worth reading. Be aware that pkgutil.get_date()
could be deprecated too one day.