Skip to main content

Using Python Jupyter

Using Python Jupyter with BigQuery

Jupyter notebooks are a great way to explore, analyze, and visualize your Amazon data stored in BigQuery.

Prerequisites

  • You have completed the setup from How to connect (BigQuery integration + Google Cloud credentials).
  • You have Python 3.10 or higher installed.

Step 1: Install dependencies

Create a project directory and install the required packages:

bash
1mkdir my-datadoe-analysis && cd my-datadoe-analysis
2pip install jupyter google-cloud-bigquery pandas db-dtypes matplotlib

Step 2: Start Jupyter

Launch the notebook server:

bash
1jupyter notebook

This opens Jupyter in your browser. Create a new notebook with a Python 3 kernel.

Step 3: Set up the BigQuery client

In the first cell, import the packages and create a BigQuery client. You can find your project ID and dataset ID on the DataDoe BigQuery Integrations page.

python
1from google.cloud import bigquery
2import pandas as pd
3
4project = "YOUR_PROJECT_ID"
5dataset = "YOUR_DATASET_ID"
6
7client = bigquery.Client(project=project)

Step 4: Run a test query

Create a simple query to verify the connection - daily sales for the last 30 days:

python
1query = f"""
2SELECT
3  order_date,
4  SUM(item_price_value) AS sales_value
5FROM `{project}.{dataset}.amazon_order_items_with_cogs`
6WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
7GROUP BY order_date
8ORDER BY order_date
9"""
10
11df = client.query(query).to_dataframe()
12df["order_date"] = pd.to_datetime(df["order_date"])
13print(df.head())
14df.to_csv("sales_value_last_30d.csv", index=False)

Step 5: Query product stats with ads data

This more advanced query joins orders and ads tables to compute product-level stats including ACoS, ROAS, CPC, and CTR:

python
1orders_table = f"`{project}.{dataset}.amazon_order_items_with_cogs`"
2ads_table = f"`{project}.{dataset}.amazon_ads_performance_by_child_asin_and_date`"
3
4query = f"""
5WITH
6  orders AS (
7    SELECT
8      child_asin,
9      product_name,
10      SUM(quantity) AS units_sold,
11      SUM(item_price_value) AS sales_value
12    FROM {orders_table}
13    WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
14    GROUP BY child_asin, product_name
15  ),
16  ads AS (
17    SELECT
18      child_asin,
19      product_name,
20      SUM(ad_spend) AS ad_spend,
21      SUM(ad_clicks) AS clicks,
22      SUM(ad_impressions) AS impressions,
23      SUM(ad_units_sold_14d) AS ad_conversions,
24      SUM(ad_sales_14d) AS ad_attributed_sales
25    FROM {ads_table}
26    WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
27    GROUP BY child_asin, product_name
28  )
29SELECT
30  COALESCE(o.child_asin, a.child_asin) AS child_asin,
31  COALESCE(o.product_name, a.product_name) AS product_name,
32
33  IFNULL(o.units_sold, 0) AS units_sold,
34  IFNULL(o.sales_value, 0) AS sales_value,
35
36  IFNULL(a.ad_spend, 0) AS ad_spend,
37  IFNULL(a.clicks, 0) AS clicks,
38  IFNULL(a.impressions, 0) AS impressions,
39  IFNULL(a.ad_conversions, 0) AS ad_conversions,
40  IFNULL(a.ad_attributed_sales, 0) AS ad_attributed_sales,
41
42  SAFE_DIVIDE(a.ad_spend, NULLIF(o.sales_value, 0)) AS acos_on_total_sales,
43  SAFE_DIVIDE(o.sales_value, NULLIF(a.ad_spend, 0)) AS roas_on_total_sales,
44  SAFE_DIVIDE(a.ad_spend, NULLIF(a.clicks, 0)) AS cpc,
45  SAFE_DIVIDE(a.clicks, NULLIF(a.impressions, 0)) AS ctr
46FROM orders o
47FULL OUTER JOIN ads a
48  USING (child_asin, product_name)
49ORDER BY sales_value DESC
50"""
51
52df_product_stats = client.query(query).to_dataframe(create_bqstorage_client=False)
53df_product_stats.to_csv("product_stats.csv", index=False)

Step 6: Find your top products

Aggregate and rank products by sales value:

python
1top = (
2    df_product_stats
3    .groupby(["product_name", "child_asin"], as_index=False)
4    .agg(
5        sales_value=("sales_value", "sum"),
6        units_sold=("units_sold", "sum"),
7        ad_spend=("ad_spend", "sum"),
8        ad_conversions=("ad_conversions", "sum"),
9        clicks=("clicks", "sum"),
10        impressions=("impressions", "sum"),
11    )
12)
13top["roas"] = top["sales_value"] / top["ad_spend"].replace({0: pd.NA})
14top = top.sort_values("sales_value", ascending=False).head(20)
15
16print(top)
17top.to_csv("top_products.csv", index=False)

Step 7: Visualize the data

Bar chart — top 5 products by sales

python
1import matplotlib.pyplot as plt
2
3top_five = top.head(5)
4plt.figure(figsize=(10, 4))
5plt.bar(top_five["child_asin"], top_five["sales_value"], color="skyblue")
6plt.xlabel("Child ASIN")
7plt.ylabel("Sales Value")
8plt.title("Top 5 Products by Sales Value")
9plt.show()

Full example notebook

A complete, ready-to-run Jupyter notebook with more examples is available on GitHub: DataDoe BigQuery Jupyter Example

Powered by

DataDoe