- Jul 18, 2025
From Traditional Data warehouse to modern day
- DevTechie Inc
- Data Engineering
Data lake Path to Data Engineering - Day 4
On Day 3, we learned about Relational Database design and data modeling, significance of A.C.I.D properties and how to approach implementation with conceptual, logical and physical design. Day 4, we will start to focus on “Big Data”, how the modern day data lake concept came into existence and how data models or architecture may differ from what we learned in Day 3.
Let’s get going.
How did it all begin?
As we entered the 21st century, the internet, social media, mobile devices, sensors, and IoT devices began generating enormous volumes of data. The 3 V’s of data i.e. Volume, Variety and Velocity became a common way to illustrate that data has very high volume, it is in different formats (Variety) and it is coming to us at a very high speed (Velocity), hence the term “Big Data” is coined. Now traditional Data warehouse systems were limited to handle Big Data. Before we dive further into Data lakes, let’s first understand what a traditional data warehouse system and its limitations are.
Traditional Data warehouse systems
Traditional Data warehouses were built on a “schema-on-write” principle. Thereby, the data need to follow Extract, Transform and load process. Data from different source like API call, another file system, another database etc… would be extracted by a scheduled job, it would go through series of validations, refinements and curation before it would be loaded in to data warehouse. The data warehouse were often built with schemas like Star Schema or SnowFlake schema having Dimension tables for descriptive attributes and Fact tables for storing measurable quantifiable data.
Such projects would take a long time to be implemented. The whole process was expensive and inflexible. There were special tools for such implementations like Informatica, Cognos, MicroStrategy and reliable hardware. Further, Data warehouses were designed for structured data and struggled to handle the rapidly growing influx of semi-structured and unstructured data. In addition, these systems were great for business intelligence and reporting on predefined questions, they were not ideal for exploratory data analysis, machine learning, data science needs which required raw and untransformed data.
Essentially, modern data lake concept came into existence because organizations needed a more flexible, scalable, and cost-effective way to store and analyze the ever-growing volume and variety of “big data” that traditional data warehouses simply weren’t designed to handle
Hadoop — The underlying technology
As companies, especially one dealing with this explosion of data, started looking for different solutions. When in 2003, Google published a research paper on Google File System it laid the groundwork for distributed computing. Apache Hadoop, emerged as a practical implementation of these concepts. It provided a way to store and process massive datasets across commodity hardware, making storage and compute much more cost-effective and scalable.
Amazon S3 — implementing HDFS (with Hadoop)
While there were many companies trying to build a solution for practical implementation of Data lake. Amazon S3 received widespread adoption for its massive scalability, cost-effectiveness, durability and availability, decoupling of storage and compute, security and many more features.
Modern Day Data lake
When the concept of “Data lake” came in, it was supposed to be a place(persistent storage) where raw data from various sources could flow in and be stored in its native format, without requiring upfront structuring. So, the key principle, unlike traditional data warehouse systems, was “schema-on-read”, whereby data is stored in its raw format, and a schema is applied only when the data is read or queried for a specific analytical purpose. As discussed above, with cloud services implementing Hadoop under the hood, services like Amazon S3, Azure Blob Storage and Google Cloud storage, storage became much more cost-effective. Further, Data lake support for advanced analytics, data science and machine learning was key in solving complex workflows. Organizations can now scale their storage needs independent of computer needs bringing in more efficiency. In addition, Data lake act as a central repositories for all types of data — structured, semi-structured and unstructured — in their original, untransformed state.
No-SQL (Not only SQL)
Since we talked about semi-structured and unstructured data, NoSQL databases are often a valuable source for such data. Many modern applications are built using NoSQL databases. NoSQL are often operational databases — they are the transactional backbone of applications that produce data. Some examples of No-SQL databases are column-family stores like Cassandra and ScyllaDB and key-value stores like DynamoDB. We will take a deep dive into No-SQL in our future session.
In short, if we draw a comparison between No-SQL and Data lake or Data warehouse, No-SQL is optimized for high-throughput, low-latency operational workflows e.g. whereas, Data lake and data warehouse is optimized for online analytical processing (OLAP), batch processing, exploratory data analysis, and machine learning model training on large historical datasets.
Conclusion:
So today, we started getting our feet wet in “Big Data” engineering. There is lot more to discuss about Data lakes, No-SQL DB’s, how these technologies work complementary to each other, trade-off’s etc… On Day 5, we will dive deeper into Data lake and will do a comparative study between different technologies. Stay tuned for more! See you in the next session.
List to the full podcast on Day 4 in Data Engineering
