How to prevent data leakage?

In today’s era of big data, the generation, circulation and application of data are more common and intensive. The development trend of digitalization makes data play a greater value. At the same time, the data security risks behind the application of various technologies have become increasingly prominent. In recent years, frequent data leaks have also caused great attention to data security issues.

What is big data security?

Big data security refers to the measures used to protect data from malicious activities in the process of storing, processing and analyzing data sets. These data sets are extremely large and complex, and cannot be processed through traditional database applications.

There are two forms of big data:

Structured format, including row and column organization of numbers, dates, etc.

Unstructured format, including social media data, PDF files, emails, images, etc.

It is estimated that up to 90% of big data is in unstructured format.

The value of big data is that we can form useful information and conclusions through a large amount of data analysis, guide and improve business processes, promote innovation or predict market trends.

To ensure big data security, three nodes need to be considered:

During data transmission, data is moved from the source location for storage or real-time extraction (usually in the cloud).

Protect data in the storage layer of big data channel (such as Hadoop distributed file system)

Ensure the confidentiality of output data (such as reports and dashboards), which are collected by Apache spark and other analysis engines.

The types of security threats in these key links include inappropriate access control, distributed denial of service (DDoS) attacks, endpoints that generate false or malicious data, or vulnerabilities in libraries, frameworks, and applications used during big data workloads.

What challenges will big data security face?

Due to the complexity of the architecture and environment involved, big data security faces many challenges, such as:

  • Use open source frameworks (such as Hadoop) that did not consider security when originally designed;
  • Relying on distributed computing to process data sets means that more systems may have problems;
  • Ensure the validity and authenticity of log or event data collected from endpoints;
  • Control internal personnel’s access to data mining tools and monitor suspicious behaviors;
  • It is difficult to run standard safety audit;
  • Protect non relational NoSQL databases.

10 best practices for protecting big data


Scalable encryption of static data and transmitted data is critical for implementation across large data pipelines. Scalability is the key point here. In addition to storage formats such as NoSQL, it also needs to encrypt data across the analysis tool set and its output. The power of encryption is that even if the threat tries to intercept data packets or access sensitive files, a well implemented encryption process will make the data unreadable.

User access control

Access control can provide strong protection against a series of big data security issues, such as internal threats and excessive privileges. Role based access can control multi-level access to big data pipelines. For example, data analysts can access analysis tools such as R, but they cannot access tools used by big data developers, such as ETL software. The principle of minimum permission is a good reference point for access control, which restricts access to only the tools and data necessary to perform user tasks.

Cloud security monitoring

The inherent large storage and processing capacity required by big data workloads enable most enterprises to use cloud computing infrastructure and services for big data. Although cloud computing is powerful, exposed API keys, tokens, and misconfigurations are also noteworthy risks in cloud computing. What if someone opens the AWS data Lake in S3 completely and makes it accessible to anyone on the Internet? It is easier to reduce risks by quickly scanning public cloud assets through automatic scanning tools to find security blind spots.

Cloud data backup

In the era of big data and cloud computing, the IT environment is changing with each passing day. Many enterprises are in the process of data transformation. In the face of such a complex and volatile environment, data security is more important. In recent years, the networking trend of traditional industries is more obvious, and more and more users use enterprise website servers. The importance of enterprise data is self-evident, so how to formulate an appropriate backup plan according to the specific needs of enterprise backup? A comprehensive backup strategy should include the following contents: backup software, backup system, data deduplication, data archiving plan and disaster recovery plan.

Centralized key management

In the complex big data ecosystem, the security of encryption requires a centralized key management method to ensure effective policy driven processing of encryption keys. Centralized key management also maintains control over key governance from creation to key rotation. For enterprises running big data workloads in the cloud, byok may be the best choice to achieve centralized key management, without transferring the control of encryption key creation and management to a third-party cloud provider.

Network traffic analysis

In the big data pipeline, there are many data receiving sources and constant traffic, including data from social media platforms and data from user endpoints. Network traffic analysis provides visibility into network traffic and any potential anomalies, such as malicious data from IOT devices or unencrypted communication protocols being used.

Internal threat detection

In the context of big data, internal threats pose a challenge to the confidentiality of company information. Malicious insiders who have access to analysis reports and dashboards may disclose information to competitors or even provide login credentials for sales. Internal threat detection is to check the logs of common business applications, such as RDP, VPN ( debestevpn ), active directory and endpoints. These logs can reveal anomalies worth investigating, such as unexpected data downloads or abnormal login times.

Threat search

Threat search is an active search for undiscovered threats lurking in the network. This process requires experienced network security analysts to develop assumptions about potential threats using information from real-world attacks, threat activities, or relevant findings from different security tools. Big data can actually improve threat tracking by discovering the hidden performance in a large number of security data. As a way to improve the security of big data, threat search will also monitor data sets and infrastructure to search for threatened artifacts.

Incident investigation

Monitoring big data logs and tools generates a lot of information, which usually appears in security information and event management (Siem) solutions. Because too much information is generated, it is easy to cause false positives in Siem solution, making analysts receive too many alerts. Therefore, in an ideal state, an event response tool should be able to provide context for security threats, so as to achieve faster and more effective event investigation.

User behavior analysis

User behavior analysis goes further than internal threat detection. It provides a special tool set to monitor the behavior of users on their interactive systems. Typically, behavior analysis uses a scoring system to create a baseline for normal user, application, and device behavior, and alerts if it deviates from the baseline. With the help of user behavior analysis, we can better detect internal threats and leaked user accounts.

Data filtering detection

Special attention should be paid to the risk of unauthorized data transmission, because once data leakage occurs and is in the big data pipeline that can replicate a large number of potentially sensitive assets, it gives criminals an opportunity to take advantage of it. Detecting data leakage requires in-depth monitoring of outbound traffic, IP address and traffic data. To prevent data leakage, we should pay close attention to the tools that find harmful security errors in code and misconfiguration, as well as data loss prevention and the next generation firewall. At the same time, we should also pay attention to the education and awareness improvement of relevant data security for employees.