~/.whale file structure
Overview
This section provides an in-depth overview of the ~/.whale
file structure. Unlike most command-line tools, whale requires data storage: warehouse credentials, scraped metadata, metadata caches for speedier access, user-generated content, logs, etc. We therefore use the ~/.whale
directory to store all of this information.
While whale will happily manage these files automatically without any user intervention, understanding how it does so can be helpful in building more advanced functionality on top of whale.
Whale installs all files within the ~/.whale
path. This path contains the following directory structure:
Subdirectories
bin
This folder contains the whale binary, whale
, if whale is built from source (homebrew manages all binaries within its own file structure). The whale
binary within can be executed directly, though we recommend setting an alias in your shell's rc
file so wh
can be executed without having to remember this path.
The whale
binary within is not portable without the rest of the directory, as it relies on the other directories for metadata scraping, caching, and storage.
config
This folder contains all whale configuration files. Most notably, connections.yaml
is stored here, which contains a list of all registered warehouse connections.
connections.yaml
This file contains all warehouse connections. This file can be directly accessed by running wh connections edit
, and new connections are configured through the wh init
workflow.
Each connection is stored as a separated yaml document within the connections.yaml
file, meaning they are separated by ---
. Unlike most warehouse interfaces, we chose to use configuration keys that differ for each warehouse for easier end-user comprehension.
See the Connection configuration section for warehouse-specific instructions. For most use cases, however, we recommend simply running through the wh init
workflow to add new warehouses (you can repeat this command multiple times without worrying about clearing your existing connections).
libexec
libexec
houses the virtual environment containing the whale-pipelines
installation, as well as some scripts that are accessed by the rust-side of the program (build_script.py
and run_script.py
).
The name libexec
comes from the unix /usr/libexec
directory, which houses binaries that are not intended to be accessed by the end-user. However, because these are written entirely in python and stored as a virtual environment, it is completely acceptable to modify these directly.
logs
Contains all system logs. In particular, cron.logs
within contains logs from any registered cron job, which can be useful for debugging broken ETL processes.
manifests
In order to speed up search execution and avoid indexing stale (deleted) tables, we store a cache of all table names in manifest.txt
within this directory that only contains tables from the most recent metadata scrape. During a scraping job, these tables are appended to a temporary manifest named tmp_manifest_<NUMBER>.txt
file (where NUMBER
is appended to prevent simultaneous scraping jobs from appending to the same temporary manifest). Upon completion of the ETL job, this manifest is copied to manifest.txt
.
This logic and the manifests within generally need not be understood by the end-user (or most developers). However, this design pattern should be leveraged when adding the ability to search over other non-table assets.
metadata
The metadata directory stores all warehouse metadata. When typing enter
on a selected wh
-searched file, documents within this directory are opened.
metrics
The metrics directory stores all calculated metrics (along with a timestamp of when the metrics were calculated). The folder structure follows the same structure as the metadata folder, except the table name is used as a folder to house (and prevent collisions over) metric names: warehouse_name/catalog.schema.table/metric-name.md
.
templates
The templates directory is where users can add their own Jinja2 templates. When named in the form warehouse-connection-name.sql
, these templates are pre-pended to any queries run against the warehouse with connection name warehouse-connection-name
. See the Jinja2 templating section for more details. Connection names can be found by running wh connections
, in the name
field of each yaml block.
Last updated