When Not to Use Apache Spark for Data Processing: A Guide for Data Engineers

4 min readJan 29, 2023

Recently, I got asked by a professional on LinkedIn that is building his startup, whether or not they should use Apache Spark to scrape and process data for job search purposes. Instead of focusing on telling them when or how to use Apache spark, I felt it will more beneficial for them to know when not to use Apache spark, that way they don’t fall into a trap.

Apache Spark is an open-source analytics tool for big data processing. The Apache spark project was started in 2009 It was open-sourced in 2010 at UC Berkeley AMPLab and was open-sourced in 2010. In 2013, with the growing community building Spark and its impact, it was handed to the Apache Spark foundation 2013.

It is widely used for data processing due to its ability to handle large amounts of data, its efficient processing capabilities, and its scalability. However, as with any tool, it is important to know when it is not the best choice for a particular use case. In this article, we will explore some of the scenarios when data engineers should not use Apache Spark for data processing.

Small Data Volumes

Apache Spark is designed to handle large amounts of data, and its overhead for small data volumes can outweigh its benefits. For data volumes under 100 MB, it is often more…

When Not to Use Apache Spark for Data Processing: A Guide for Data Engineers

Written by Ioudom Foubi Jephte