Cloud Dataprep on GCP: No-Code ETL Magic

Interactive Transformations and Intuitive GUI

5 min readFeb 1, 2024

Cloud Dataprep is a data preparation tool by Alteryx on GCP that seamlessly integrates with BigQuery and Dataflow. It is an intelligent cloud data service that revolutionizes data exploration, cleansing, and preparation for analytics and machine learning.

Following Trifacta’s acquisition by Alteryx in the first quarter of 2022, Alteryx can now offer its analytics solution across on-premises, hybrid, and cloud environments.

Dataprep excels in ETL with its smooth GUI and predictive interaction features. Integrated with BigQuery and Dataflow, it ensures high-speed performance.

Bigquery Architecture — Simplified

Bigquery is Google’s fully managed, serverless, highly scalable data warehouse that comes with a built-in petabyte…

psrajpoot.medium.com

Dataprep Features

1. Predictive Transformation:

Leverages a proprietary inference algorithm to interpret data transformation intent, automatically generating ranked suggestions and patterns.
ML-driven suggestions and predictive interactions streamline transformations.

2. Rich Transformations:

Offers hundreds of transformation functions (aggregation, pivot, joins, etc.) for quick data manipulation.
Enables one-click application of various transformations for data refinement.

3. Optimized Processing:

Automatically selects the best processing engine for fast transformation based on data volume and locality.
Utilizes BigQuery for in-place ELT transforms, Dataflow, or Photo (in-memory engine) for smaller volumes.

4. Active Profiling:

Interactive visual representations enable data exploration, discovery, cleansing, and transformation.
Dynamic visualizations of key statistical information aid in interpreting large data volumes.

5. Data Quality Enhancement:

Implements data quality rules to monitor and enhance data accuracy, completeness, consistency, validity, and uniqueness.
Provides comprehensive insights into data cleanliness.

6. Collaboration and Sharing:

Facilitates real-time collaboration on workflows among multiple users.
Allows users to create copies of quality work as templates for independent projects.

7. Comprehensive Connectivity:

Extensive connectivity with various data sources such as Salesforce, Oracle, MySQL, etc., in addition to standard connectors like BigQuery and Cloud Storage.
Enriches self-service analytics with diverse data sources for comprehensive insights.

1. Connectivity and Exploration

In addition to standard connections like BigQuery, Cloud Storage, Microsoft Excel, and Google Sheets, Dataprep offers integration with numerous data sources such as Salesforce, Oracle, Microsoft SQL Server, MySQL, PostgreSQL, and many more.

Flow

Datasets, recipes, and outputs can be grouped into flows, providing organizational units in the platform.

A flow enables you to visualize the connections and relationships between datasets as they are intertwined throughout the data transformation process.

Imported Datasets

An imported dataset within Dataprep serves as a reference to the source data. It is generated when specifying the file(s) or table(s) to be accessed through a connection.

2. Data Transformation

With Dataprep you can do almost all the common data transformations like joins, unions, imputations, aggregations, etc.

Sample

A sample represents a subset of the complete dataset. When working with your data in the interface, you observe a portion of the entire dataset, enabling real-time preview of any changes made.

Dataprep generates samples automatically for display and manipulation, optimizing performance.
You can easily customize sample size, scope, and generation method (Random, Filter-based, Anomaly-based, Stratified, Cluster-based).

Recipe

A recipe is a sequence of transformation steps for modifying your source dataset.

Using suggestions, toolbar options, or data histogram values, new steps are added. Modifications can be made through the Transform Builder — a context panel where transformations are adjusted and changes previewed.

Transforming Diverse Data

Easily manipulate CSV, JSON, relational tables, or massive SaaS application datasets, regardless of size — from megabytes to petabytes.

Pattern Identification and Application

Efficiently identify and leverage data patterns using columnar pattern matching.
Apply regular expressions or Dataprep patterns within your recipes to locate and transform specific data patterns.

Standardization and Clustering

Group similar values based on spelling or pronunciation, creating standardized and consistent clusters.

Dataprep Patterns

Define patterns within curly braces ({}) enclosed by backticks (`) in recipe steps for precise data manipulation.

3. Get Results

Output

Output refers to a user-defined collection of files or tables and their designated locations where results are stored upon completion of the job running on the recipe.

Each recipe has the capability of having multiple outputs, with options such as GCS, BigQuery, and more.

Job

A job involves executing the transformation steps outlined in recipes, sequentially across the entire dataset, to produce the desired results.

During execution, the recipe steps are applied in the specified order to the imported dataset(s), generating the corresponding output.

Running Environments

Bigquery: If your data resides on Bigquery Dataprep will do a BQ pushdown to run directly on the BQ engine.
Dataflow: If your data is stored somewhere else like GCS, your jobs will be executed using Dataflow.
Trifacta Photon
Best suited for small to medium-sized jobs, handling data up to approximately 1GB.

4. Automate and Orchestrate

Schedule

A schedule lets you automate your flow, running jobs at specified intervals.

Plan

A plan enables the connection of multiple flows, incorporating more intricate logic between them.

Data Pipeline Orchestration:

Sequentially and conditionally chain data preparation tasks.
Notify users of success or failure.
Trigger external tasks like Cloud Functions.
Utilize robust APIs for seamless integration into enterprise-wide solutions.

Enterprise-Scale Automation and Operationalization

Continuous Deployment:
Import/export recipes across editions and versions.
Configure Dataflow or BigQuery for optimal performance.
Advanced APIs for automated software development life cycles and monitoring.

Performance Optimization in Dataprep

The performance in Dataprep is influenced by:

Number of Steps: The more steps involved in the process, the greater the impact on performance.
Width of the Data: The wider the dataset — its number of columns or fields — the more it can affect performance due to increased processing requirements.
Complexity of Transformation: Intricate transformations can significantly impact performance, especially those involving extensive data manipulations or computations.

Advanced security

Utilize Google IAM roles alongside specific access rights in BigQuery, Cloud Storage, and Google Sheets to tailor individual data access control. IAM roles manage permissions at different levels in BigQuery and Cloud Storage, allowing precise control over who accesses what.

An ETL Pipeline Architecture with Dataprep

Dataprep seamlessly integrates into your analytics pipeline as an ETL component. Moreover, if your data is in BigQuery, there’s no need for extraction or loading. Dataprep converts the recipe to SQL and performs BigQuery pushdown to execute the code directly on BQ.

Your data can originate from various sources such as GCS, BQ, Google Sheets, etc.

With Dataprep, you link these datasets and output destinations (similar to sources like BQ and GCS), create the recipe (steps for transformation) using the GUI (without any coding), and then submit the job to run on Dataflow, BigQuery, or Photon.

After execution, your data is stored according to your linked destination, such as BQ or GCS.

Now, your clean and transformed data is ready for analysis (utilizing tools such as Looker, Power BI, or Tableau) or for machine learning applications (leveraging Vertex AI or BQML).