Selective indexing
It's often useful to only pull down metadata about certain tables or schemas. This will improve search speed, reduce scraping time, and reduce load on your warehouse.
To this end, we support regex and/or where clause appending to our metadata extractors (see warehouse-specific details below). The listed keys can be simply added to the appropriate warehouse-specific section in your connections.yaml file, accessible through:
Note: For the where_clause_suffix
-enabled warehouses, we follow the amundsen-databuilder pattern, for those familiar with it. For simplicity, we provide a list of metadata tables (e.g. information_schema
) and their aliases, as well as a list of a few notable column names that are commonly used for filtering. If you write a statement that includes a %
, you'll need to escape it (replace it with %%
), as we use the python library SQLAlchemy to execute your query, and python reserves %
for string formatting.
Once you run a scraping job (wh pull
, manually or through cron), we will never delete any table stubs, to avoid removing any personal documentation, so if you want to remove filtered tables scraped before adding a filter, they'll have to be deleted manually.
Database-specific indexing
For all sources except for Bigquery (for which project_id
is already an accepted configuration key), we supply a configuration key database
which enables users to pull data only from specific databases/clusters/catalogs under a connection.
We ubiquitously use the name database
for simplicity, though it often is described differently for different warehouse types. E.g. for postgres and snowflake it's typically called "database", while the ANSI sql standard labels this the "catalog", and amundsen calls this layer the "cluster", following the typical pattern in Hive & Presto-based setups with production-development replication.
Source-specific instructions
Bigquery: included_tables_regex
Because we access Bigquery through the google client API, we don't supply a where_clause
like with the other warehouses. You can instead add the included_tables_regex
with an associated regex expression. If the regex expression is matched, the table will be indexed.
The table to be matched by this regex will follow the format: project_id.dataset.table_name
, so dataset- level restrictions can also occur here.
Hive Metastore: where_clause_suffix
Available metadata tables:
TBLS
(alias:t
)DBS
(alias:d
)PARTITION_KEYS
(alias: p)TABLE_PARAM
(alias:tp
)
Notable fields:
d.NAME
(table schema)t.TABLE_NAME
Postgres: where_clause_suffix
Available metadata tables:
INFORMATION_SCHEMA.COLUMNS
(alias:c
)PG_CATALOG.PG_STATIO_ALL_TABLES
(alias:st
)PG_CATALOG.PG_DESCRIPTION
(alias:pgtd
)
Notable fields:
c.table_schema
c.table_name
Presto: where_clause_suffix
Available metadata tables:
information_schema.columns
(alias:a
)information_schema.views
(alias:b
)
Notable fields:
a.table_catalog
a.table_schema
a.table_name
Redshift
Available metadata tables: None (all the constituent tables are combined into a larger table)
Notable fields:
cluster
schema
name
(table name)
Snowflake
Available metadata tables:
TABLES
(alias:t
)COLUMNS
(alias:c
)
Notable fields:
c.TABLE_NAME
c.TABLE_SCHEMA
Last updated