Jul 23, 2025

Limitations in Data lake

DevTechie Inc
Data Engineering

Path to Data Engineering — Day 5

On Day 4, we started diving into Data lakes, how cloud computing helped with its evolution, problems it solved. We briefly touched upon No-SQL DB’s and where they fit in, in the whole picture. We also established that Data lake became a prominent ruleset in Data Architecture when it came to Big Data. In this article, we will dive much deeper into Data lakes and underlying technologies.

So, let’s get going.

With the evolution of cloud computing and Data lake implementation is done through technologies like Amazon S3, Google Cloud Storage etc…

In a data lake, data is ingested from many systems like No-SQL DB’s, Relational DB’s, API’s, File Systems and others in its raw-native format (structured, semi-structured, unstructured) and stored as objects. The schema is then applied at the time of analysis by various tools, providing immense flexibility for different use cases. But going a bit granular, “what do we mean by data is stored as objects?”

Data and file formats in Data lake

So, in Relational Database as we saw in Day 2, data was stored in rows and columns, in data lake( in object storage systems like S3, GCS,Azure blob storage), these objects are files. These files can be in different formats like parquet, csv, orc, JSON, Avro, image, video, sometimes in custom binary format. Parquet, Orc and Avro are optimized columnar formats, that help immensely with performance of analytical queries with Parquet being the most popular.

Metadata via Manifest file

Once these files are written, they cannot be modified in place, meaning, you can delete and re-write the file but cannot update it. Each file/object has a set of key-value pairs that describe it (e.g., creation date, content type, owner, custom tags). This metadata is crucial for organization, discovery, and access control. A manifest file act as a metadata container providing information about the contents, structure and dependencies within a dataset.

Partitioning

This is one of the most important concepts in data lake. Partitioning, like in relational database is done to improve performance. Analytical query engines like Spark, presto, Athena, Hive can perform partition pruning and can reduce the query execution time. As an example, if I have an e-coomerce website where there are customers from all over the world. I can partition them by region first, followed by datetime stamp like year=2024/month=08/. If our queries run, an analysis of new customer this month by region, this partition will be very helpful.

Compaction/Optimization(Small File Problem)

Data lakes often ingest data in small, frequent batches (e.g., minute-by-minute logs). This can lead to a “small file problem” in object storage, where millions of tiny files can degrade query performance and increase metadata overhead. The solution is to have periodic jobs (e.g., using Spark) are run to compact these small files into larger, more optimal files (e.g., 128MB to 1GB per file), typically in a columnar format like Parquet.

Limitations of Data lake

We have discussed before that Data lakes promised ultimate flexibility by providing “schema-on-read” and allowing organizations to store data in its raw, untransformed state. However, soon it was realized that a complete lack of schema management could lead to significant problems, turning a data lake into a “data swamp.”

The problem with No-Schema were:

a. Finding data without any structure within a vast data lake was not optimal

b. The “dump it all in” mentality meant that data was often ingested without any validation rule causing data inconsistencies

c. Identifying and protecting sensitive data (like PII or financial information) became extremely challenging

d. While “schema-on-read” is flexible, inferring schema on the fly for every query, especially on massive datasets, was computationally expensive

Lack of ACID Transactions

a. If a data ingestion job failed midway, you could end up with corrupted or incomplete files in your data lake. This meant manual clean-up and reprocessing, leading to unreliable data.

b. Multiple users or applications writing to the same location simultaneously could lead to race conditions, overwriting data, or inconsistent reads.

c. Updating or deleting specific records was extremely inefficient. It typically required reading the entire dataset, modifying the relevant records, and then rewriting the entire dataset. This was impractical for large datasets and vital for use cases like Change Data Capture (CDC) or GDPR/CCPA compliance.

Dealing with all these issues meant data engineers spent an enormous amount of time building complex, brittle ETL/ELT pipelines to ensure data quality, consistency, and performance.

This all led to the evolution of Data lake architecture and embrace of hybrid approach with Medallion Architecture with different layers like Bronze, silver and Gold layer, having automated schema inference and validation with tools like AWS Glue Data Catalog, utilizing central metadata store.

Conclusion

So in this article we saw that while data lake being very powerful in providing scalability, decopliing, cost-effective storage and support for diverse analytics, still had some limitations. In the upcoming episode we will look deeper into the modern day data lake architecture with layer approach and A.C.I.D properties for better data consistency. Stay tuned for the next one. See you then!

Listen to the full podcast on Day 5 of Data Engineering