This page provides you with instructions on how to extract data from Drip and load it into Delta Lake on Databricks. (If this manual process sounds onerous, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is Drip?
Drip is an eCommerce Customer Relationship Management platform that gathers and organizes customer data, then enables online retailers to craft personalized omnichannel buying experiences.
What is Delta Lake?
Delta Lake is an open source storage layer that sits on top of existing data lake file storage, such AWS S3, Azure Data Lake Storage, or HDFS. It uses versioned Apache Parquet files to store data, and a transaction log to keep track of commits, to provide capabilities like ACID transactions, data versioning, and audit history.
Getting data out of Drip
You can collect data from Drip’s servers using webhooks and user-defined HTTP callbacks. Set up the webhook in your Drip account, and define a URL that your script listens to and from which it can collect the data.
Sample Drip data
Once you've set up webhooks and HTTP endpoints, Drip will begin sending data via the POST request method. Data will be enclosed in the body of the request in JSON format. Here's a sample of what that data might look like.
{ "id": "z1togz2hcjrkpp5treip", "status": "active", "email": "john@acme.com", "custom_fields": { "name": "John Doe" }, "tags": ["Customer", "SEO"], "time_zone": "America/Los_Angeles", "utc_offset": -440, "created_at": "2017-06-21T10:31:58Z" "ip_address": "123.123.123.123", "user_agent": "Mozilla/5.0", "lifetime_value": 2000, "original_referrer": "https://google.com/search", "landing_url": "https://www.drip.co/landing", "prospect": true, "base_lead_score": 30, "lead_score": 65, "user_id": "123" }
Preparing Drip data
You need to map all the data fields in the JSON data from your webhook into a schema that can be inserted into your database. For each value in the response, you need to identify a predefined datatype (i.e. INTEGER, DATETIME, etc.) and build a table that can receive them.
Loading data into Delta Lake on Databricks
To create a Delta table, you can use existing Apache Spark SQL code and change the format from parquet
, csv
, or json
to delta
. Once you have a Delta table, you can write data into it using Apache Spark's Structured Streaming API. The Delta Lake transaction log guarantees exactly-once processing, even when there are other streams or batch queries running concurrently against the table. By default, streams run in append mode, which adds new records to the table. Databricks provides quickstart documentation that explains the whole process.
Keeping Drip data up to date
Once you’ve coded up a script or written a program to get the data you want and move it into your data warehouse, you’re going to have to maintain it. If Drip modifies its webhook implementation, or sends a field with a datatype your code doesn't recognize, you may have to modify the script. If your users want slightly different information, you definitely will have to.
Other data warehouse options
Delta Lake on Databricks is great, but sometimes you need to optimize for different things when you're choosing a data warehouse. Some folks choose to go with Amazon Redshift, Google BigQuery, PostgreSQL, or Snowflake, which are RDBMSes that use similar SQL syntax, or Panoply, which works with Redshift instances. Others choose a data lake, like Amazon S3. If you're interested in seeing the relevant steps for loading data into one of these platforms, check out To Redshift, To BigQuery, To Postgres, To Snowflake, To Panoply, and To S3.
Easier and faster alternatives
If all this sounds a bit overwhelming, don’t be alarmed. If you have all the skills necessary to go through this process, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to move data from Drip to Delta Lake on Databricks automatically. With just a few clicks, Stitch starts extracting your Drip data, structuring it in a way that's optimized for analysis, and inserting that data into your Delta Lake on Databricks data warehouse.