Unleashing the Power of BigQuery in Data Science: A Practical Overview

Jillani Soft Tech
7 min readFeb 19, 2024

--

By 🌟Muhammad Ghulam Jillani(Jillani SoftTech), Senior Data Scientist and Machine Learning Engineer🧑‍💻

Image by Author Jillani SoftTech

Unlocking the potential of Google Cloud Platform (GCP) introduces us to BigQuery, Google’s formidable cloud-based data warehouse solution. Boasting the key functionalities of Dremel, Google’s in-house data warehousing marvel, BigQuery operates under the hood with a columnar data storage approach and employs a tree architecture to parallelize queries across an array of machines. The result? Lightning-fast SQL query executions on tables housing billions of rows, with results returned in mere seconds. What’s more, despite the extensive computational machinery involved, the cost remains remarkably low, at just $5.00 per terabyte of data processed. With Google handling the intricate infrastructure management, users are left free to focus on insights generation.

The real magic unfolds when we bring BigQuery into the realm of data science. With the ability to swiftly query massive datasets, analysts can now tackle complex business questions with unprecedented agility. Dashboards built on aggregated data can be crafted swiftly, revolutionizing decision-making processes.

Image by Author Jillani SoftTech

However, the transition to BigQuery doesn’t guarantee instant optimization for data science workflows. While it excels in querying vast amounts of data, BigQuery’s data processing and transformation capabilities are confined to basic SQL-like functions. Complex data manipulations necessitate external processing, which introduces a new challenge: the sluggishness of data extraction from BigQuery. This issue becomes palpable when dealing with hundreds of thousands of rows or more.

In Python, leveraging libraries like google-cloud-big query and pandas-gbq facilitates data transfer from BigQuery to pandas data frames. Yet, as highlighted in community discussions, this process isn’t without its hurdles. For instance, extracting 2 million rows from the New York Yellow Taxi 2018 dataset on BigQuery might take a mere 18 seconds for the query execution, but transferring the result into the pandas data frame via the to_dataframe() function can drag on for almost 6 minutes.

from google.cloud import bigquery as bq
import time
import gcsfs
import pandas as pd
client = bq.Client()

taxi_query="""
#standardSQL
SELECT *
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
LIMIT 2000000
"""

job_config = bq.QueryJobConfig()
now=time.time()
query_job=client.query(taxi_query,location='US')
res=query_job.result()
print('query took:',round(time.time()-now,2),'s')

However, loading the result into a pandas data frame via the to_dataframe() function takes almost 6 minutes:

now=time.time()
yellow_taxi=res.to_dataframe()
print('load to pandas from BQ took:',round(time.time()-now,2),'s')

The frustration of waiting minutes for data loading, especially when the query itself was lightning-fast, underscores a critical challenge. Surprisingly, in scenarios where both BigQuery and traditional SQL databases are available, the latter might outpace BigQuery in data extraction speed, owing to the smaller dataset sizes.

In the realm of R, the bigrquery package presents a similar conundrum, albeit with somewhat improved data transfer performance.

Navigating the landscape of BigQuery in data science requires a nuanced understanding of its capabilities and limitations. While its prowess in handling mammoth datasets is unparalleled, efficient integration with data science workflows demands strategic optimization and potentially leveraging complementary tools and technologies. As data scientists continue to explore the boundaries of BigQuery’s potential, innovations in data transfer efficiency and seamless integration are poised to enhance its utility even further.

Optimizing Data Exploration with BigQuery: Practical Workarounds for Data Scientists

Exploratory data analysis lays the foundation for robust data-driven insights, but with colossal datasets, traditional workflows can hit roadblocks. In the realm of Google Cloud Platform (GCP), BigQuery emerges as a potent solution, yet efficient integration with data science workflows demands strategic optimization. Let’s delve into practical workarounds to expedite data exploration leveraging BigQuery.

1. Limiting Query Size:

Streamlining queries by limiting result sizes or applying filters and aggregations reduces data transfer overhead. However, this approach risks sampling bias, particularly in the era of big data where representative samples are pivotal.

2. Dumping to CSV:

A clever workaround involves exporting query results to CSV, leveraging Cloud Storage for storage and retrieval. Despite the initial setup overhead, this method proves remarkably efficient, often speeding up data extraction by up to tenfold. By querying data once and accessing stored CSV files, repeatability is ensured, albeit with some management overhead.

Exporting to GCS:

# Export to GCS
now = time.time()
destination_uri = "gs://{name}-us/*taxi2018.csv"
dataset_ref = client.dataset("taxi_us", project=project)
table_ref = dataset_ref.table("2018")

extract_job = client.extract_table(
table_ref,
destination_uri)
extract_job.result() # Waits for job to complete
print('create table and write to GCS took:', round(time.time() - now, 2), 's')

Reading from GCS:

now = time.time()
gcs = gcsfs.GCSFileSystem(project=project)
destination_uri = "gs://{name}-us/*taxi2018.csv"
files = gcs.glob(destination_uri)
df = pd.concat([pd.read_csv('gs://' + f) for f in files], ignore_index=True)

print('read csv took:', round(time.time() - now, 2), 's')

3. BigQuery Storage:

The beta release of BigQuery Storage revolutionizes data access, offering lightning-fast retrieval directly from BigQuery. While SQL-like queries are limited, data sampling remains feasible through column-based filtering. Although initial tests demonstrate accelerated data transfer into pandas, further optimization is anticipated through multi-streaming capabilities.


import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
import fastavro


credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(
credentials=credentials,
project=project
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)


table = bigquery_storage_v1beta1.types.TableReference()
table.project_id = "bigquery-public-data"
table.dataset_id = "new_york_taxi_trips"
table.table_id = "tlc_yellow_trips_2018"

# Select columns to read
# Limit rows to PICK UP ID==48

read_options = bigquery_storage_v1beta1.types.TableReadOptions()
read_options.row_restriction = 'pickup_location_id = "48"'
read_options.selected_fields.append("vendor_id")
read_options.selected_fields.append("passenger_count")
read_options.selected_fields.append("trip_distance")
read_options.selected_fields.append("rate_code")
read_options.selected_fields.append("store_and_fwd_flag")
read_options.selected_fields.append("payment_type")
read_options.selected_fields.append("fare_amount")
read_options.selected_fields.append("extra")
read_options.selected_fields.append("mta_tax")
read_options.selected_fields.append("tip_amount")
read_options.selected_fields.append("tolls_amount")
read_options.selected_fields.append("imp_surcharge")
read_options.selected_fields.append("total_amount")
read_options.selected_fields.append("pickup_location_id")
read_options.selected_fields.append("dropoff_location_id")

parent = "projects/{}".format(your_project_id)
session = bqstorageclient.create_read_session(
table, parent, read_options=read_options
)

now=time.time()

# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
reader = bqstorageclient.read_rows(position)

# Parse all Avro blocks and create a dataframe. This call requires a
# session, because the session contains the schema for the row blocks.
bq_storage = reader.to_dataframe(session)
print("Time Taken With BigQuery Storage:",round(time.time()-now,2),'s')

4. BigQuery ML:

Google’s BigQuery ML presents an intriguing paradigm shift, enabling machine learning model building directly within BigQuery using SQL queries. While promising, its scope is currently confined to linear and logistic regression models, limiting its applicability to complex data transformations.

In conclusion, harnessing the full potential of BigQuery for data exploration demands a nuanced understanding of its capabilities and innovative workarounds. By optimizing query strategies and embracing emerging technologies like BigQuery Storage, data scientists can navigate the complexities of big data analytics with agility and efficiency.

Elevating Data Production Pipelines with BigQuery: Advanced Strategies for Scalable Solutions

In the realm of production-grade data processing, efficiency and robustness are paramount. While BigQuery emerges as a powerhouse for querying structured data with SQL, transitioning to production pipelines demands innovative solutions to overcome performance hurdles and ensure scalability. Let’s delve into advanced strategies tailored for the seamless integration of BigQuery into production environments.

1. Embracing TensorFlow:

Contrary to conventional wisdom, TensorFlow isn’t solely reserved for unstructured data. In fact, it boasts robust capabilities for handling structured data, including models like linear regression, logistic regression, and boosted trees. Leveraging the BigQueryReader class, TensorFlow seamlessly accesses BigQuery, facilitating the integration of machine learning pipelines within production workflows.

2. Harnessing Spark:

With built-in connectors to Spark, BigQuery opens doors to a plethora of distributed computing libraries, including H2O.ai’s Sparkling Water. While the synergy between BigQuery and Spark holds immense potential, it’s essential to note that for users leveraging Pyspark, data must first be exported to Cloud Storage before Spark can ingest it.

In Conclusion:

While BigQuery’s prowess in rapid SQL-based queries is undeniable, bridging the gap between data extraction and production pipelines poses challenges. For seamless integration into production-grade workflows, strategic approaches such as limiting query sizes or leveraging TensorFlow and Spark connectors are essential. Personally, opting to dump data to CSV and orchestrating workflows from there proves efficient and facilitates repeatable analyses.

🤝 Stay Connected and Collaborate for Growth

  • đź”— LinkedIn: Join me, Muhammad Ghulam Jillani of Jillani SoftTech, on LinkedIn. Let’s engage in meaningful discussions and stay abreast of the latest developments in our field. Your insights are invaluable to this professional network. Connect on LinkedIn
  • 👨‍💻 GitHub: Explore and contribute to our coding projects at Jillani SoftTech on GitHub. This platform is a testament to our commitment to open-source and innovative solutions in AI and data science. Discover My GitHub Projects
  • đź“Š Kaggle: Immerse yourself in the fascinating world of data with me on Kaggle. Here, we share datasets and tackle intriguing data challenges under the banner of Jillani SoftTech. Let’s collaborate to unravel complex data puzzles. See My Kaggle Contributions
  • ✍️ Medium & Towards Data Science: For in-depth articles and analyses, follow my contributions at Jillani SoftTech on Medium and Towards Data Science. Join the conversation and be a part of shaping the future of data and technology. Read My Articles on Medium

Looking Ahead:

As the landscape of data processing continues to evolve, there’s a pressing need for further enhancements to the google-cloud-bigquery library. Key improvements, including interfaces to Spark/Dask for lazy data reading and support for dplyr-like transformations, promise to streamline data access and processing, further solidifying BigQuery’s position as a cornerstone of modern data science workflows.

By adopting these advanced strategies and embracing the evolving capabilities of BigQuery, data scientists can unlock new dimensions of efficiency and scalability, paving the way for transformative insights in production environments.

--

--

Jillani Soft Tech
Jillani Soft Tech

Written by Jillani Soft Tech

Senior Data Scientist & ML Expert | Top 100 Kaggle Master | Lead Mentor in KaggleX BIPOC | Google Developer Group Contributor | Accredited Industry Professional

No responses yet