A decade in Data Science: the rise and fall of big data

Thom Hopmans · June 13, 2024

A decade in Data Science: the rise and fall of big data

As I approach the ten-year milestone in my career, I find myself reflecting on the incredible journey and shifts in the industry. Over the years, I’ve witnessed the ebb and flow of countless trends, tools, and technologies. One trend that particularly fascinates me lately is the evolving concept of “big data”. What was once a buzzword synonymous with massive scale and complexity has subtly transformed, bringing new tools and methodologies that make big data feel not so big anymore.

In my early days, the rise of MapReduce revolutionized the way we processed large datasets. It enabled distributed computing, allowing us to handle datasets that were previously hard to deal with locally. Before MapReduce, a simple 8 GB CSV file could already provide quite a challenge to analyze efficiently, even on those heavy Lenovo Thinkpad laptops. MapReduce suddenly made it possible to analyze 100 GB files on a remote cluster with a reasonable chance of success.

This paradigm shift was quickly followed by the advent of Apache Spark, which offered even faster processing capabilities and more sophisticated queries. Tools like Google BigQuery and Snowflake then made it possible to query vast amounts of data efficiently without needing to manage infrastructure. These innovations collectively pushed the boundaries of what we could achieve with data analysis on large datasets, and making it accessible to a broader audience.

However, as we stand today, there is a noticeable shift back to the local machine. Despite the enormous data volumes we still handle, advancements in hardware and software have made it feasible to process significant datasets on a single machine, i.e. by going back to vertical scaling. Enter DuckDB, which has redefined my approach to querying large datasets. DuckDB, in particular, offers the simplicity of SQL with the power to handle large volumes of data entirely in-memory, providing remarkable speed and efficiency. This shift allows us to harness the benefits of big data without the overhead of complex distributed systems. MotherDuck’s Big Data Is Dead post provides remarkable insights, such as “the vast majority of enterprises have data warehouses smaller than a terabyte” and “most data is rarely queried”, that underscore why the shift of going back to a single machine is so interesting.

This transition back to localized, in-memory processing marks an exciting era in data science. It’s a testament to how far we’ve come and how technology continually evolves to meet our needs more efficiently. For me, using tools like DuckDB and Polars to query large datasets on my machine feels like coming full circle, blending the lessons learned from the past decade with cutting-edge innovations.

As I look ahead, I am eager to see what new trends will emerge and how they will further reshape the landscape of big data.

Simple Example Using DuckDB

Here’s a quick demonstration of how simple it is to use query a large dataset with DuckDB.

First, you’ll need to install DuckDB. You can do this easily via pip.

pip install duckdb

We load the New York City Taxi and Limousine Commission (TLC) Trip Record Data set for this example.

import duckdb
import pandas as pd

con = duckdb.connect()

# Download a sample CSV file
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"

# Load data directly from the parquet file URL into a DuckDB database
con.execute(f"CREATE TABLE IF NOT EXISTS trips AS SELECT * FROM '{url}';")

Let’s add some more months to simulate a large dataset. Each parquet file is roughly 50 megabytes (compressed), so that should create a simple “big data” analysis.

# Additional Parquet file URLs to be added to the existing table
additional_urls = [
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet",
]

# Insert data from additional Parquet files into the existing `trips` table
for url in additional_urls:
    con.execute(f"INSERT INTO trips SELECT * FROM '{url}';")

Let’s start querying!

# Perform a simple query to get the average trip distance
result = con.execute("SELECT AVG(trip_distance) FROM trips").fetchall()
print(f"Average trip distance: {result[0][0]} miles")

# Perform another query to find the number of trips per passenger count
passenger_counts = con.execute(
    """
    SELECT passenger_count, COUNT(*)
    FROM trips GROUP BY passenger_count 
    ORDER BY passenger_count
"""
).fetchdf()
print(passenger_counts)

>>> Average trip distance: 4.057574092698775 miles
>>>     passenger_count  count_star()
>>> 0               0.0        614470
>>> 1               1.0      30012198
>>> 2               2.0       6014208
>>> 3               3.0       1485955
>>> 4               4.0        841971
>>> 5               5.0        516739
>>> 6               6.0        339322
>>> 7               7.0           101
>>> 8               8.0           312
>>> 9               9.0            56
>>> 10              NaN       1449518

That’s all you need for a simple, yet efficient and fast analysis!

This example shows the simplicity and power of using DuckDB for large datasets. The ability to process and analyze substantial datasets directly on your machine opens up new possibilities and efficiencies in the world of data science. Trivially, there are trade-offs to consider between processing data in-memory with DuckDB or distributing the workload across multiple workers using tools like Apache Spark. But taking into consideration that most companies do not have unreasonable large datasets and the costs of running high-memory instances have decreased significantly, using in-memory databases such as DuckDB have become quite a useful tool in the Data Science toolkit.