Create an Apache Drill Data Source
Disclaimer - Beta Feature
Support for Apache Drill Data Source Manager (DSMs) is in beta. Beta features are available for users to test and provide feedback. They do not have their implementation finalized. The behavior or interface for these features may change in the future.
Do not use beta features in your production environment.
Known limitations
- Date arithmetic:
WEEK
granularities are not supported (such asWEEK
orDOW
)DOY
granularity is not supported- Not all
period-over-period
functionality works due to partially missingINTERVAL
shifting in Drill
- Functions:
MEDIAN
(or any alternative likePERCENTILE_CONT
) analytics function is not supported by Drill.- You can implement your own
MEDIAN
and plug it into Drill, see How to develop and install custom functions into Apache Drill. - Or you may use third party solutions:
MEDIAN
and other statistics functions have already been implemented by Drill community (by @cgivre), see drill-stats-function on Github.
- You can implement your own
GREATEST
andLEAST
functions treat NULL values incorrectly- Some
WINDOW
frames are not supported SUM
inCASE
may not workREGR_R2
is not supported- When using aggregations with an empty dimensionality and when all values are NULL, report result may be incorrect
Deployment
Note
GoodData uses driver version 1.20.2
.
You can run Apache Drill in a docker container. The image for Apache Drill is available on Dockerhub.
The following example demonstrates how to start GoodData.CN with Apache Drill using Minio to serve as S3 storage:
version: '3.7'
services:
gooddata-cn-ce:
image: gooddata/gooddata-cn-ce:2.3.0
ports:
- "3000:3000"
- "5432:5432"
volumes:
- gooddata-cn-ce-data:/data
environment:
LICENSE_AND_PRIVACY_POLICY_ACCEPTED: "YES"
drill:
image: apache/drill:1.20.2
ports:
- '8047:8047'
- '31010:31010'
volumes:
volumes:
- drill-data:/data
# Inject JDBC drivers for data sources which you want to manage with Apache Drill, e.g.:
- ./db-drivers/POSTGRESQL/postgresql-42.2.16.jar:/opt/drill/jars/3rdparty/postgresql-42.2.16.jar
- ./db-drivers/VERTICA/vertica-jdbc-10.0.1-2.jar:/opt/drill/jars/3rdparty/vertica-jdbc-10.0.1-2.jar
- ./db-drivers/REDSHIFT/RedshiftJDBC42-no-awssdk-1.2.50.1077.jar:/opt/drill/jars/3rdparty/RedshiftJDBC42-no-awssdk-1.2.50.1077.jar
- ./db-drivers/MSSQL/mssql-jdbc-8.4.1.jre11.jar:/opt/drill/jars/3rdparty/mssql-jdbc-8.4.1.jre11.jar
- ./db-drivers/SNOWFLAKE/snowflake-jdbc-3.12.9.jar:/opt/drill/jars/3rdparty/snowflake-jdbc-3.12.9.jar
- ./db-drivers/ADS/datawarehouse-jdbc-driver-bundle-3.5.1.jar:/opt/drill/jars/3rdparty/datawarehouse-jdbc-driver-bundle-3.5.1.jar
# If needed, override default settings
- ./ds_managers/drill/drill-override.conf:/opt/drill/conf/drill-override.conf
# Register default storage plugins
- ./ds_managers/drill/storage-plugins-override.conf:/opt/drill/conf/storage-plugins-override.conf
stdin_open: true
tty: true
minio:
image: minio/minio:RELEASE.2021-08-25T00-41-18Z
volumes:
- minio-data:/data
ports:
- '19000:9000'
- '19001:19001'
environment:
MINIO_ACCESS_KEY: tiger_abcde_k1234567
MINIO_SECRET_KEY: tiger_abcde_k1234567_secret1234567890123
command: server --console-address ":19001" /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
volumes:
gooddata-cn-ce-data:
drill-data:
minio-data:
Note
By passing the environment variable LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES
you agree to the terms and conditions in the GOODDATA.CN COMMUNITY EDITION LICENSE AGREEMENT, including GoodData’s Privacy Policy. Please read it carefully. In order to use GoodData.CN, you must agree to the terms and conditions therein.
Prepare Apache Drill for GoodData
To learn how to register Data Sources to Apache Drill, refer to the official Apache Drill documentation for connecting a Data Source.
For additional considerations, refer to Preparing Data Source Managers for GoodData.
Data Source Details
- The following considerations apply when you are configuring the JDBC URL:
- If you start Apache Drill as a docker container, you can connect using this URL:
jdbc:drill:drillbit=drill:31010
. - If you run Apache Drill outside of a docker container, consult the official Apache Drill documentation for configuring the JDBC URL.
- There are no limits for the driver setup. For all possibilities, see the official documentation.
- If you start Apache Drill as a docker container, you can connect using this URL:
- Basic authentication is most likely supported but is untested. You can test authentication by specifying the
user
andpassword
. - You can set
enableCaching
totrue
andcachePath
to["dfs", "data"]
- Learn more about the pre-aggregation caching in Cache Management.
You must configure the writable storage plugin so that the path for dfs.data
points to the local filesystem.
You can find more information in the official Apache Drill documentation for Configuring Storage Plugins.
You can configure the DSM through the web UI, or you can store the configuration into the file storage-plugins-override.conf
and mount it as a volume into the container.
The following example is a snippet that demonstrates the configuration settings for the Apache Drill DSM:
"storage": {
dfs: {
type: "file",
connection: "file:///",
enabled: true,
workspaces: {
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"data": {
"location": "/data",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
formats: {
"parquet": {
"type": "parquet"
},
.... add other formats based on your needs ....
}
}
}
}
Performance Tips
If you want to query large datasets or even join large datasets from different data sources, we recommend you first snapshot the datasets into Apache Drill (CREATE TABLE AS) and then querying the table snapshots.
Query Timeout
Query timeout is not supported for Apache Drill yet.