Considering Modern Data Integration Methods
As more organizations implement robust data solutions and technology advances, there is always debate about what the best architecture, methods, and applications are to bring these solutions to life. Relational structured databases referred to as the modern data warehouse have been a favorite for it’s flexibility, ease of use, standardization, and scalability. Consolidating all data to a single source of truth, organizing in logical fact and dimension tables, and making available to users in curated zones breaks down silos and creates autonomy of use among other benefits. With the sheer amount of data available and speed of data creating events however, modern data warehouses are becoming larger and more expensive.
Modern data warehousing has been a term used most often to describe structured, relational data warehouses populated by batch processing from a variety of sources. This type of data platform makes data more accessible to end users, standardized across applications, and easily scalable.
Over the last few years a new style of data warehousing has been gaining market share. Data streaming initially was intrinsically linked to IoT devices and IoT devices alone. Growth in popularity has widened that scope to any structure that needs to be analyzed in near real time.
Data streaming is a shift in mindset from the modern data warehouse or the data lakehouse. Currently dominated by Apache Kafka, streaming requires the application of on-the-fly processing and transformation at an event-by-event level. Instead of designing for batch loading of data at rest, architecture of the end-to-end solution needs to focus on constant and consistent data in motion.
As streaming becomes more prevalent, is there a reason to maintain traditional batch processing of data? Are there significant benefits to a 100% streaming platform? Ultimately, if data can be batch processed it can be streamed. But should it be?
Bringing all data into a system at a specific, triggered point in time can be a cumbersome process even when working with incrementally loaded data sets. This style of data migration is more predictable and controllable but can result in long run times, CPU spikes, and bottlenecking as the batch job runs. The longer a job runs and takes to populate end user analytics and dashboards, the less timely that data resulting in lower impact or less accurate conclusions. However, streaming individual events at the time of event creation decreases the likelihood of high impact processing, turning the influx of data from a long run at a regular period to a trickle of individual events throughout the day.
Another problem often encountered with batch processing is all or none error detection. A job may be running successfully, complete to 80%, encounter an error, and fail from a single record or an added element in the schema. Resolving that error requires a resource to understand the error, remedy, and restart the entire job. If that error is encountered during a time when no one is monitoring the system it could be the next business day until the error is even detected. In a streaming solution, as an individual event encounters an error that event is diverted to an error log and the pipeline remains open for the next event to process smoothly. These error logs can then be reviewed anytime and those events can either be corrected and reprocessed or determined invalid.
Let’s focus on a specific error: either the source or the sink of your pipeline is down.
With traditional batch processing if the source is unavailable the pipeline fails at extraction. There may be a failsafe to retry the run a certain number of times, but most often this results in data not being migrated to the sink platform. At that point an individual would have to check the status of the connection and manually restart the run. A similar scenario plays out if the failure is at the connection to the sink platform.
With streaming, both source and sink connection failures will have little impact on the pipelines themselves, and no manual intervention is needed when the connection becomes available again. Best practice streaming solutions such as Apache Kafka have storage capabilities. When a sink connection fails, the streaming pipeline maintains any events sent within that storage while continually checking if the connection is back online. Once it is, those events process normally with the backlog of data. This will cause a slight spike in expected processing depending on how large the data backlog is but is not reliant on business hours or manual restarts. Source applications will also maintain a backlog until a connection is back online. The streaming platform will continually ping the source until available, and then will process any events since the time of the last fully processed event.
Data streaming remains a relatively new technology that in the past has been most often applied to niche use cases. Throughout 2022 and the start of 2023 however, the demand for streaming resources has gotten larger as more organizations look to implement near real time solutions. Currently, 8 out of 10 Fortune 100 companies use Apache Kafka which is considered the industry standard for data streaming. Unfortunately, the talent pool for these skills has caused a lag in the ability for organizations to implement and support these streaming data platforms. Data streaming engineers need to have all of the skills of traditional data engineers including the ability to design, architect, and build data infrastructure while also communicating the outcomes effectively to business stakeholders. Streaming engineers also need to understand the shift in conceptual construction from batch processing, implement the specific technologies used in the industry, and deconstruct business logic to now apply to data in motion. The gap between available talent and open positions continues to grow, making implementation and support of streaming platforms a challenge.
Traditional data solutions have a similar but less drastic gap. While traditional data engineering continues to be in high demand, there is a higher ratio of individuals with those skill sets in the job market. This coupled with the longer history of more standard data solutions makes traditional data warehousing a safer implementation path.
In today’s cloud-centric world, both streaming and batch processing rely on per unit pay scales. Given that, the assumption would be there are no pricing differences when the same data is being loaded. However, warehousing platforms provide dynamic scaling to fit the size of data being loaded and the pricing as those warehouses get bigger is not linear. For example, Snowflake’s various warehouse sizes are priced per unit by the hour:
Instead of the price increasing at a one-to-one ratio of compute to price, prices grow 200% for each step up. When using batch processing, all data is processed at once. If the warehouse size is increased to decrease the overall runtime of the batch job, price per X amount of data is increasing as well. Because streaming data is processed continuously, the amount of data remains small and does not require much change if any to the warehouse size, keeping price per X amount of data low. Additionally, the error reporting differences outlined above result in fewer and smaller reruns required for streaming data, removing the cost of large batch reruns completely.
Resource procurement also plays a role in pricing differences between the two styles of implementation. In the United States, the median base salary for a traditional data engineer is $128,942 while the base salary for a Kafka engineer is $140,000 (Talent.com). So not only is the search for talent going to be longer and more expensive due to the existing skills gap, but those resources will continue to be more expensive after joining an organization.
Using 100% streaming across a data platform may be an interesting experiment and applicable in some instances, but for most current use cases would be impractical.
Particularly at the outset of migration to data streaming, a hybrid model may be the best solution for your organization. Within any data schema there is always that one or those few tables that take up 90% of the scheduled batch file and have GBs of new data every day. Offloading just those specific tables from that schema into a streaming solution to start could be the best option for optimizing the overall solution.
Each of the factors above have to come into consideration when deciding how to structure your organization’s data platform. There may already be in house resources either with experience in or a willingness to learn Kafka or another streaming service. Because of difficulty recruiting talent, you may prefer to continue with higher cost batch files for the time. Each organization is unique and the makeup of your data solution needs to be the right mix specifically for you.
Creating a data roadmap identifying all of your data needs, wants, and realistic capabilities is key in ensuring your platform supports the organization most efficiently and effectively over all techniques. Streaming may not be part of your initial implementation but a goal for phase two or three. CTI Data has worked with companies to help identify which processing type fits their vision for long term data use and build out the steps to get there.
The diagram below shows just how complex data solutions can become. Ensuring all of these pieces of the puzzle come together cohesively results in a functional, user friendly and results driven data solution that includes both streaming and batch depending on the use case. Without a comprehensive plan of action though, the elements below are bound to remain siloed and ineffectual even after significant investment in implementation.
By 2032, streaming analytics is expected to have a market value of approximately $86.5B USD (Future Market Insights). Between 2017 and 2021 the market grew by 17.6% annually. As data streaming’s prevalence has grown more users are finding more and more creative ways to architect and implement around an event-based solution instead of a batch-based solution. For now the most functional solution may be hybrid, with streaming supporting event based data transformation while slowly changing dimensions continue to be supported by scheduled data loads. Keeping that in mind, the desire for near real time analysis to drive agile and informed business decisions will only continue moving this trend forward. Streaming is here to stay, but at least for now the power users of the technology know how to maintain the careful balance between this emerging technology and its predecessor.
Amanda Darcangelo is a Senior Consultant, Data & Analytics Practice at CTI. She is a featured speaker at the 2023 Women in Tech Global Conference, May 9th – 12th.
© Corporate Technologies, Inc. | Privacy & Legal