It’s estimated that by 2025 the total amount of data worldwide will swell 10 times to 163 zettabytes. For those not familiar with a zettabyte, it’s a trillion gigabytes. Approximately every two years, the total volume of data doubles. Personal texts, tweets, pictures, and videos, combined with the data generated by Internet of Things (IoT) sensors, add up to an ever-increasing flood of data waiting to be utilized. To handle this flood, many companies are investing in “data lakes.” Data lakes are data platforms designed to store large amounts of structured data (i.e., data with a defined length and format that is easily searchable by traditional methods) and unstructured data (i.e., images and videos that require more complex searching methods) more easily and less expensively than traditional databases (although traditional data warehouses and databases may be part of the total data lake solution).
Organizations are embracing data lakes because it’s relatively easy to populate them with raw data regardless of the format and to quickly begin using the data for analysis or analytics. But this ease of use has caused the formation of what some are now terming “data swamps,” huge pools of data that have become stagnant and from which it’s difficult to extract actionable insights. In fact, many corporations already have data swamps even though they’ve never implemented data lakes. Merely implementing a data lake won’t fix this problem. To drain these data swamps properly, companies will need to build out their data governance models, business processes, and technology as they move to a new data lake paradigm.
TREATING DATA AS AN ASSET REQUIRES DATA GOVERNANCE
In a survey by the Business Application Research Center (see http://bit.ly/2IGmJ3e), fewer than 50% of corporations surveyed agreed that data is highly valued or treated as an asset. Many companies are still struggling to implement viable data governance models in their enterprises. The advent of large data lakes has only complicated this endeavor. While being able to quickly import large amounts of data into data lakes is an important advance, most raw data will require at least a minimal amount of curation to make it useful in the long run. A data governance process that mandates the cataloging of metadata (such as labeling and categorizing the data) from day one can be a good start to avoiding the beginnings of a data swamp. Additionally, data owners need to be designated from day one to control access to the data and provide timely decisions as needed. Nevertheless, corporations with strong existing enterprise data governance processes will need to be careful about applying too much rigidity in the early stages of data lake creation so that easy access to the raw data is available to those who can benefit from utilizing the data in this format.
AGILE BUSINESS/DATA MANAGEMENT PROCESSES REQUIRED
While pumping raw data into a data lake will address some basic corporate data needs, it isn’t a panacea. If implemented haphazardly, or as the only approach, it will lead to more “swamps” than “lakes.” Depending on the application, the data lake will need further curation, cleansing, and quality assurance processes applied in order to provide value to the corporation. The key is to only add as much process as needed to make the data usable for its intended purpose. “Less is more.” To accomplish this, global management consulting firm McKinsey recommends applying agile methodologies to the data management process. Begin with specific key business priorities and work backward to identify the technology solutions needed instead of focusing too much on the technology details up front. Data lakes need to be designed in tiers with the cleanest tiers at the top and “sediment” left drifting in the dirtier data tiers at the bottom of the stack. The upper tiers will require more of the traditional enterprise data management processes (such as structuring, modeling, indexing, cleansing), and the bottommost tier will be the raw data with minimal curation. By building out your data lake one business-use case at a time and only applying the data management processes necessary for each use case, you can form different tiers over time while incorporating important business feedback and allowing the data governance and processes to evolve as the data lake is populated.
TECHNOLOGY DEVELOPMENT AND INTEGRATION
The systems and technology supporting the data lake also will need to be adapted over time. While data lakes usually begin with some underlying database management structure(s) and an analytics engine (such as Hadoop or Spark), these technologies are insufficient by themselves to meet the demanding needs of today’s real-time analytics environments. As organizations’ data lakes mature, they need a fully populated enterprise data catalog to allow easy discovery of data, artificial intelligence applications to continually learn from the data, and real-time data analytics engines that can drive actionable insights to influence a business user action. Data lakes also must evolve to become the corporation’s central data network, encompassing the traditional data silos and breaking down the barriers between different data sources within the organization.
To drain your corporation’s existing data swamps requires a fine balance of data governance, business process improvement, and technology integration. Data lakes provide a foundation, but without the proper curation, attention, and ongoing maintenance, they will only become one more swamp in your enterprise data ecosystem. By using agile methodologies and staying focused on business priorities, you can create small wins that build up into a vibrant data lake that will meet the ongoing needs of your business.