top of page
  • Writer's pictureJohn Kirby

Python for the data lake

Updated: Nov 23, 2023

Techy people in the data world will be well aware that the tech changes quickly. Platforms, products, languages, frameworks, job titles, everything. Not only do they change, but there are far more of them in modern data projects.


Cloud data engineering is no exception. Vendors, including Microsoft, have embraced 3rd party products such as Databricks and Delta Lake. Will these be replaced at some point? There's no such thing as a perfect bit of software, so there's always room for improvement and therefore replacement.


Coding languages within these products are subject not only to the products supporting them, but the workforce that use them. What do engineers like the most? What works best across the different platforms and products?


Python is a top candidate for data engineering. SQL isn't going anywhere for a long time as it's so engrained. Together, they can meet most engineering needs for most projects. Many cloud data engineers are adapting from a classic on prem SQL environment, so having both languages helps that migration.


Why Python?


It's popular. This should help with resourcing, as opposed to a more niche language.


It's relatively simple to learn, which makes it accessible for people who want to get into it or cross-train from other languages.


It's supported by the major data platforms.


It's supports the parallel processing of cloud data platforms, which is one of the main benefits of cloud computing, particularly with spark and Databricks.


Why Not Python?


It's pretty 'loose' as opposed to a 'strict' language like c# or java. This can be off-putting as it needs a good eye kept on it to ensure the dynamically-typed variables aren't misused, and it's sometimes tricky to know what a variable is holding.


It's not properly object-oriented like c# or java, and doesn't have a clean OO feel.


It's not as mature as other languages, so it relies on extensibility like the Pandas library, and its IDE support isn't mature either.


So... do I want a python for my data lake?


Python does a great job of handling data lake workloads. It's a great tool for the job, simple, and well supported by tech and the workforce. One great thing with Databricks is the ability to use multiple languages in notebooks. Use the right tool for the right job. Python is one tool, and a good one.


2 views0 comments
bottom of page