Create unconstrained control in distributed data management
Most data stacks begin governance at the warehouse level, but they don’t know where that ELT data came from or what the context and source is. We have to settle this.
Enterprise data teams face new demands as businesses need faster access to timely information. Data analytics teams are growing from a single team to a larger, more focused team as they support more parts of the business. This puts pressure on centralized data engineering teams to support an increasing number of requests from distributed analytics teams, such as marketing, finance, or product business analytics teams. At the same time, privacy and security requirements require data engineers to take a hard look at data access and usage within their organizations. Increasingly, there is a need for more robust data management.
One way to reduce this friction is to use a modern ELT approach and a combined data stack. This opens the possibility of democratizing access to data in a company. Large organizations should strive to let data analysts “self-service” their data needs while remaining compliant with data governance requirements. By using a delegated control approach, data teams can access the data they need, ensure the data is valuable to their work, and establish unconstrained control.
As more companies move to ELT, this modern approach brings raw data to the front-end that is generally more current and recent, but this change also means analysts have less reliability and confidence in data when ingested. Ensuring reliability and trust in data requires an enhanced level of data governance and management that can monitor who has access to different data streams and adds context about where the data comes from to ensure a team is n fetch data from a QA server instead of getting it from the correct production CRM database.
See also: Data governance: why it’s fundamental and how to implement an effective strategy
This issue is particularly relevant for companies whose core data teams try to support a range of data teams in specific business units and departments. These core data teams end up spending too much time assessing and granting access to data when they could be looking for better ways to streamline data flow to the right teams or focus on data projects to greater impact. Central data teams are pulled in multiple directions and need a better way to manage, prioritize, and track access to data.
At the same time, business function-based teams can be tempted to pull their own S3 channels and create their own data lake if they can’t get the access they need, which makes governance much more difficult. hard. Then, when there is an audit, the access is closed, and suddenly these red teams can no longer do their job.
This issue really hits industries that have high data complexity but traditionally lower levels of governance. Every business needs to know what kind of data goes where. Otherwise, data engineers may discover that PII information is stored insecurely or that various data sources are combined without proper controls. Data engineering team or automated tools are required to verify permissions and access rights to PII or other sensitive data for each request from an analyst, slowing progress.
Today almost every ELT tool is actually a black box. But when it comes to a new data tool or the creation of a BI report, many stakeholders need to approve that data access to ensure governance. A legal team will want to know if PII is present, and if so, limit access to a sales team, for example. Next, security will want to make sure they can perform data audits before making a tool the company standard. And the master data team just needs to know what kind of data is coming into the warehouse so they can figure out which teams have access on the other side.
Data governance is very centered around the warehouse and BI tools today, but it doesn’t look at where the data comes from or verify the completeness or accuracy of that data. Suppose, for example, that a schema changes upstream – what impact does this have on downstream data? And what is the source of the data? What geography? Which column? Was it from a contact table in Salesforce or from a specific page? Without a modern data stack, this context is not always available. But companies need to know the lineage of their data so they can uncover errors or revert steps if there’s an issue that needs fixing.
If companies want to serve all of their internal customers and specific departments without placing too much of a burden on central data teams, they should take the following steps:
- Organize teams to ensure unconstrained control. As data teams become increasingly integrated into business groups, the central data team must provide a standardized technology stack across the enterprise to provide governance. If distributed teams adopt common tools, central data teams can ensure governance is automatically enforced in a standardized way, while individual teams have more access to what they need.
- Establish organization-wide governance policies. As data teams become integrated into a business, different teams traditionally may use a variety of sources, pipelines, and destinations. Governance policies should apply to individual data assets. For example, the sales team needs access to customer information. This policy should then be applied to all sources, pipelines, and destinations. Configuring policies on individual tools makes it very difficult to ensure that the policy is applied correctly and applied consistently. Simplify things by starting governance early. This way you can make sure the data sources are registered and available, so you know what the context is and what type of source and can make sure the right policy is applied.
- Provide visibility into the movement of data. Focus less on cleaning/transforming data coming into a warehouse, and focus more on capturing all the context. Make sure your organization has a complete understanding of “who/what/where” for data so that relevant distributed data teams have access to the appropriate data sources. Maintain schema transformation and organization until you access the data, not while ingesting it. This will save you time and add flexibility. Require teams to collect enough upstream metadata to support downstream access permissions. If a pattern changes, teams should have the data lineage to determine further impacts.
By centralizing on a data stack, providing access structure, and analyzing data flow, enterprises are able to add unconstrained control for their central and dispersed data teams. This helps these companies audit systems and identify who has access to what data while giving them the ability to set the right access policies and eventually seamlessly integrate with the company’s governance toolset. ‘organization.
By taking steps to clearly define the different roles between central data teams and line-of-business analyst teams, large enterprises can better understand and manage how their data is used across the business. By clearly delineating different types of data requests and mapping them to different team needs, organizations can ensure that data is handled correctly while supporting a “self-service” approach that helps analysts perform their work efficiently.