Most organizations decide to migrate their workloads to the cloud in order to reduce IT expenses. However, while cloud adoption often delivers on the promise of scalability and flexibility, many organizations do not achieve their cost reduction goals. In fact, Gartner predicts that 80% of organizations will overshoot their expected cloud costs.
In this article, I’ll explain the concept of data classification, describe data classification services offered by the big three cloud providers, and show how data classification can help you leverage low-cost storage tiers to achieve dramatic savings in the cloud.
What is Data Classification?
Data classification involves the organization of unstructured and structured data into categories that represent types of data. You can use data classification to determine the data types stored in your repositories and learn the location of this data.
Data classification insights can help you achieve the following:
- Prioritize security measures.
- Inform processes related to risk management, regulatory compliance, and legal discovery.
- Identify duplicate and stale data to reduce data storage and maintenance costs.
- Provide IT teams with the information needed to justify requests for data security investments.
- Streamline search and e-discovery to improve decision-making and user productivity.
Data Classification in AWS
AWS offers data classification as part of a fully managed data privacy and security service called Amazon Macie. The service employs machine learning to automatically identify, classify, and protect sensitive data stored in the AWS cloud.
Amazon Macie can detect various sensitive data types, including intellectual property and personally identifiable information (PII). Macie can help you extend your visibility into how sensitive data is stored and accessed.
Macie displays insights in dashboards and also pushes alerts. The service continuously monitors data access activity for anomalous behavior and generates alerts when detecting indicators of accidental data leaks or unauthorized access.
Here are the key benefits of Amazon Macie:
- Extended data visibility—Macie helps security administrators manage visibility into data storage environments, such as Amazon S3.
- Automated data security—Macie employs machine learning to automate several processes, including data discovery, classification, and protection. Additionally, machine learning helps monitor data and detect data activity anomalies.
- Custom alert monitoring—Macie can send its findings to Amazon CloudWatch Events. You can then build custom remediation and alert management for any existing security ticketing systems.
Data Discovery and Classification in Azure
Azure provides a data classification engine that is built into its main database offerings, including Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse. These services provide the following automated discovery and classification capabilities:
- Discovery—scanning a database and identifying columns with sensitive data.
- Labeling—applying sensitivity labels to columns using metadata attributes, which can later be used for auditing.
- Query result-set—ability to calculate sensitive data for query results in real time.
- Visibility—providing a dashboard showing data sensitivity within the Azure portal.
- Default and custom taxonomy—ability to automatically classify data using the engine’s pre-built labels or a custom set of labels.
- Policy management—ability to rank labels, associate with information types, and configure them with string patterns for more complex discovery logic.
Data Classification in Google Cloud
Google Cloud provides Data Catalog, a fully managed data management and classification service. It provides the following key capabilities:
- Search and discovery—performing structured search and predicate-based search covering all metadata of data assets.
- Indexing metadata—ability to index metadata to enable rapid searches and classification.
- Adding tags—providing tag templates that allow different teams to create common metadata about data assets.
- Identifying sensitive data—Data integrates with Cloud Data Loss Prevention (DLP) scans to identify sensitive data in tag templates.
What are Storage Tiers?
Tiering is about finding the best storage option for your data over its entire lifecycle. Not all data is actively used and storing data in a high-performance tier can be a significant cost factor. Therefore, the three major cloud providers all provide several storage tiering options.
Amazon S3 Storage
Amazon S3 offers a variety of storage tiers designed for different use cases. This includes:
- S3 Standard for universal storage of frequently accessed data.
- S3 intelligent-Tiering for data with unknown or constantly changing access patterns.
- S3 Standard-Infrequent Access and One Zone-Infrequent Access for data that needs to be retained for a long time but is accessed less frequently.
- S3 Glacier and Glacier Deep Archive for long-term archiving and digital storage.
Azure Blob Storage
Blob Storage is the Azure equivalent of S3. It provides three storage tiers:
- Hot Access Tier—useful for frequently accessed data or data that needs fast access. It has higher storage costs but lower data retrieval and access costs.
- Cool Access Tier—useful for data that has been stored for at least 30 days but is accessed infrequently. Provides lower storage costs but with the tradeoff of slower retrieval.
- Archive Access Tier—suitable for data stored for more than 180 days, which is rarely or never accessed. Minimizes storage costs but provides very slow access and higher data retrieval costs.
Google Cloud Storage
Google Cloud Storage is Google’s object storage service. It provides three storage tiers:
- Standard—frequently accessed data
- Nearline—for data accessed less than once a month
- Coldline—for data accessed less than once a year
Using Data Classification to Better Leverage Storage Tiers and Save Costs
All three major cloud providers offer powerful data classification services, which can help you automatically assign data to labels or categories. At the same time, these cloud providers let you distribute data between multiple tiers, each of which is suitable for a different type of data and has a different cost structure.
By analyzing data, and identifying which data needs to be accessed frequently and which is not commonly used and can be archived, you can move large quantities of data to lower-cost archive tiers. This can result in dramatic cost savings.
Here is a general process you can follow to optimize your use of storage tiers:
- Select a large, heterogeneous data set.
- Run automated data classification and categorize the data by a criterion like business criticality, frequency of use, or relevance to the data user. The goal of classification should be to identify if the data is really needed for day-to-day operations or is just “nice to have.” Broadly speaking, you should have four categories:
- Very important
- Less important
- Perform a manual check on data that was discovered to be less important. You can draw a representative sample from the dataset and ask data owners to perform a manual review and confirm classification results.
- If data classification was imperfect, you can fine tune it and repeat the process.
- When you are confident that classification is correct, proceed as follows:
- Move data classified as “less important” to an infrequent-access tier, which will still allow data users to access it but may incur costs or additional latency.
- Move data classified as “unimportant” to an archive tier or consider deleting it.
If you repeat this process across multiple datasets, you can generate substantial cost savings. Automating this process and creating a continuous content archiving lifecycle will generate even bigger benefits in the long run.