Caroline Vidal

Senior MarTech Engineer

← Back to Blog

Faster Data Prep & Pipelines with Gemini in BigQuery

Go from raw data to ML-powered customer segments without writing a single line of code. BigQuery's AI tooling is impressive, but what does it actually change about our day-to-day work?


Most posts about AI in BigQuery will tell you it saves time. That's true, but it's not the interesting part. The more useful question is: what does it actually change about how you work?

For years, we have known that data teams spend the majority of their time just getting data ready for analysis. Surveys show that up to 80% of a data practitioner's time is still consumed by data discovery, cleaning and preparation. Nearly 20% of that goes on searching and gathering data alone, followed by hours writing SQL to handle missing values, duplicates and complex joins.

The reason this bottleneck persists is simple: data preparation is still largely manual. To move faster, we need to shift from manual data wrangling to AI-assisted data engineering. This is something that Gemini in BigQuery can help with.


What is Gemini in BigQuery?

Gemini in BigQuery is an AI assistant built into the BigQuery interface. It can help with data exploration, cleaning and preparation, writing and troubleshooting queries, visualising results, and simplifying pipeline setup. It is not a single feature but a set of capabilities embedded across the platform. This post will explore how to use these capabilities in practice by covering the Data Preparation feature and showing how Gemini can enhance BigQuery Pipelines.

The impact is not just individual productivity. It changes how work is distributed across teams: engineers get time back from repetitive prep tasks, analysts can skip the boilerplate SQL to reach advanced insights faster, and technical marketers with basic SQL knowledge can answer questions themselves.


Data Preparation

Data Preparation is a BigQuery feature that uses Gemini to give you intelligent suggestions for cleaning and transforming your data. You can work using natural language prompts, AI-generated suggestions, or standard SQL, whichever fits the task or your preference. It has a low-code visual interface, which makes it straightforward to design even fairly complex transformation pipelines.

How it works

Opening a BigQuery table directly in Data Preparation gives you a live preview of the data. An important thing to know is that none of the transformations you make here affect the original table. The first step is to set a destination table, which is where the final clean data will be written. This means you can experiment and iterate freely without risking production data.

The Data Preparation interface in BigQuery

As soon as the table opens, Gemini starts suggesting potential improvements based on the structure and content of the data. Selecting a column makes the suggestions specific to that column, and you can preview any change before deciding to apply it.
In practice, the suggestions are most useful for the obvious stuff like inconsistent casing, mixed date formats, nulls. Where they fall short is business logic. The AI sees patterns in the data, it doesn't know what the data is supposed to mean. That's still on you.

In the top right corner, you'll find options to quickly transform the data. Aggregate, Join, Transform, and Drop handle some of the most common tasks in a few clicks.

From suggestions to custom prompts

Custom natural language prompt in the Data Preparation panel

When an AI suggestion is close but not quite right, you can click "Edit" and adjust the prompt directly.

When none of the suggestions fits, you can write your own transformation from scratch using the Transform option. In this example, I wrote a prompt to convert all revenue to EUR. The first result was ok, but a quick prompt refinement to round to two decimals got it exactly right.
Use the first result as a starting point, not always a final answer. The edit loop is fast enough that prompt refinement feels less like debugging and more like conversation.

Every transformation is recorded in the Applied Steps panel, so the pipeline is fully auditable and easy to adjust later.

Joining tables

Data Preparation also makes joining tables straightforward. Gemini suggests a join key automatically and provides insights into how the join will affect your data. In this case it told me the join would keep all rows from both tables, which is exactly what I needed and a useful sanity check before running anything.
This is the kind of check that catches fanout bugs before they hit production. If you have ever joined on a non-unique key and wondered why your row count tripled, having the expected join behaviour surfaced upfront is genuinely useful, not just for beginners.

JOIN suggestion panel with Gemini's join key and explanation

Once the transformations are ready, you can execute them immediately or schedule the pipeline to run on a set frequency. You also have full control over how the data is written, whether that is a full refresh, append, or incremental update.

The end result is a clean, repeatable, scheduled transformation pipeline, in this case, built entirely without writing any code.


BigQuery Pipelines

Data Preparation is a powerful tool on its own, but the real step change happens when you need to connect it to something bigger, like advanced ML models, complex SQL transformations or a fully automated end-to-end workflow. That is where BigQuery Pipelines comes in.

Pipelines are the bridge between ad-hoc analysis and production-grade data engineering. While complex, enterprise-wide dependencies might still require a dedicated tool like Cloud Composer or Airflow, BigQuery Pipelines allow you to orchestrate straightforward workflows directly where the data lives. No context switching, no managing separate infrastructure.

Pipelines let you take the assets you have already built, such as Data Preparation recipes, SQL queries and Python notebooks, and chain them together into a single automated sequence (a Directed Acyclic Graph, or DAG).
You can import existing assets or create new ones directly inside the pipeline. It is powered by Dataform under the hood, which means you can also open the pipeline in Dataform for further development if needed.

Building the pipeline

You create a pipeline from the BigQuery Explorer in a few straightforward steps. Once created, you start adding tasks.

I started by adding the Data Preparation flow I had already built. There's no need to recreate it, you simply import it right onto the pipeline canvas to be connected later.

Next, I added a query task. Since I want to demonstrate how we can build this without writing code, I used a Gemini prompt to generate SQL that calculates RFM metrics (Recency, Frequency, and Monetary value) for each customer. The prompt is saved at the top of the task, which acts as inline documentation. Anyone looking at this pipeline later can see exactly what the SQL was meant to do and why.

SQL task showing Gemini prompt and generated RFM query

Then, I added a Python Notebook task and prompted Gemini to run a K-Means clustering algorithm on those RFM metrics. This automatically groups the customers into distinct segments and writes the results directly back to BigQuery.

Using the "Run after" dropdown, I can set the execution order: Data Preparation first, then the SQL transformation, then the notebook.

Pipeline canvas showing all three tasks connected in sequence

This means that in a single scheduled workflow you have a no-code cleaning step feeding into a SQL transformation, which then feeds directly into a machine learning model. No context switching and no manual handoffs between tools.
This is the kind of workflow that would previously have meant coordinating between a data prep script, a scheduled query, and a separate notebook environment. Here everything lives in one place: one execution log, one schedule, one pipeline to maintain.

When notebooks run as pipeline tasks, the executed copy of the notebook, including all output cells, is stored in a Cloud Storage bucket. So for scheduled runs, you have a full record of what actually ran and what it produced.

Press "Run All Tasks" and the pipeline executes. In production, you would simply schedule this to run automatically every night, or as often as your business needs.

The output in this case is a BigQuery table of customer segments, RFM scores and cluster assignments, ready to feed directly into your CRM or marketing automation tool.

In this example, the entire pipeline was built without writing a single line of code by hand.


Considerations

Going from raw data to ML-powered segmentation this quickly is genuinely impressive. But with tools this capable, it matters how you use them.

A few things worth keeping in mind:

  • You still need the skills to validate the output. Gemini generates the code, but you need to review it. Having a working knowledge of SQL and Python is what allows you to catch errors and understand whether the result actually makes sense.
  • Be critical. Apply your domain knowledge. Does the output data look correct? Cross-check a few rows manually before trusting the result at scale.
  • Cost awareness. When using pipelines you are paying for BigQuery compute, Colab Enterprise runtimes if you are running notebooks, and Cloud Logging. On large tables, always check the bytes processed estimate before running. Gemini-generated SQL will not always be the most efficient query.
  • Quality in, quality out. No matter how good the tooling is, it cannot fix bad underlying data. The preparation step matters precisely because clean inputs produce trustworthy outputs.

The Future: Human + Machine

These features and capabilities change what is possible technically. But the more interesting question is how they change the shape of the work itself.
These tools make it faster to get to an answer. They do not make it easier to know if the answer is right. That still requires judgment, domain knowledge, and the willingness to push back on a result that looks reasonable but isn't.

McKinsey's State of AI in 2025 report looked at what separates high-performing AI teams from everyone else. Two findings are particularly relevant to the workflow we just built:

Intentional oversight. AI high performers are significantly more likely to have defined processes for determining how and when model outputs require human validation.

Workflow transformation. They do not just use AI to do the old process slightly faster. They are nearly three times more likely to have fundamentally redesigned their workflows around what AI is actually good at.

This is the shift from using AI as a standalone tool to building what McKinsey calls hybrid intelligence which means combining the speed and scale of AI and automated pipelines with the judgment and domain expertise of a person in the loop.

In the pipeline we built in this post, the AI handled the heavy lifting, the data cleaning, the SQL, and the clustering logic. But it was the human who provided the intent, validated the output, and turned those segments into something a business can actually act on. That is not just a faster way to work. It is a more reliable one.

The real return on these tools is not the time you save. It is the time you get back to reinvest in deeper analysis, in better questions, in work that actually requires human judgment.

The analyst's mindset is still key. Critical thinking, domain expertise, and the ability to validate what the AI produces are not going away. They are becoming more important, not less, because the gap between a good output and a plausible-but-wrong one is now much harder to spot at a glance.

The new skill is Human + Machine. Not choosing one over the other, but knowing when to trust the output, when to push back, and how to build workflows where both are doing what they are best at.


Sources: Integrate.io: Data Preparation Efficiency Stats · DataMites: Data Cleaning in Data Science · McKinsey & Company: The State of AI in 2025

← Back to Blog