Andres Andreu, CISSP, ISSAP, QTE

The Perils of Poor Data Hygiene: Undermining AI Training

Part 2 in the Series: Unlock Artificial Intelligence Potential – The Power Of Pristine Data

The integration of Artificial Intelligence (AI) into cybersecurity has ushered in a new era of sophisticated threat detection, proactive vulnerability assessments, and automated incident response. As organizations increasingly rely on AI to bolster their defenses, the fundamental principle remains that the quality of the data on which they train these advanced systems directly links to their effectiveness. The old saying “garbage in, garbage out” (GIGO) holds true here; there could be a high price to pay for the perils of poor data hygiene: undermining AI training.

Unlock Artificial Intelligence Potential - The Power Of Pristine Data - The Perils of Poor Data Hygiene: Undermining AI Training

In Part 1 we covered the power of pristine data. Neglecting data hygiene can have severe consequences for those who depend on AI for information. Data hygiene is directly correlated to the training and performance of AI models, particularly in the critical domain of cybersecurity. Several common data quality issues can significantly undermine the effectiveness of even the most sophisticated AI algorithms.

Missing Data

One prevalent issue is incomplete data sets, in particular the presence of missing values in datasets (https://www.geeksforgeeks.org/ml-handling-missing-values/). This is a common occurrence in real-world data collections due to various factors such as technical debt, software bugs, human errors, or privacy concerns. The absence of data points for certain variables can significantly harm the accuracy and reliability of AI models. The lack of complete information can also reduce the effective sample size available for training and tuning. This potentially leads to a decrease in a model’s ability to generalize. Furthermore and slightly more complicated, if the reasons behind missing data points are not random, the introduction of bias into some models becomes a real-world concern. In this scenario a model might learn skewed relationships based on the incomplete data set. Ultimately, mishandling missing values can lead to biased and unreliable results, significantly hindering the overall performance of AI models.

Incomplete data can prevent ML models from identifying crucial patterns or relationships that exist within the full dataset. Addressing missing values typically involves either:

Removing data: deleting the rows or columns containing the missing elements. This comes with the risk of reducing a dataset and potentially introducing biased results if the reason for the data to be missing is not based on randomness.
Imputation techniques: employing imputation techniques to fill in the missing values with guessed data. While this preserves the dataset size it can introduce its own form of bias if the guesses are inaccurate.

The fact that missing data can systematically skew a model’s learning process, leading to inaccurate and potentially biased outcomes, highlights the importance of understanding the nature of the missingness. The type of missingness are:

Missing Completely At Random (MCAR)
Missing At Random (MAR)
Missing Not At Random (MNAR)

Understanding the reason at hand directly impacts the strategies for addressing this issue. Arbitrarily filling in missing values without understanding the underlying reasons can be more detrimental than beneficial.

Duplicate Data

Moving beyond missing data elements, another significant challenge is that of duplicate data within training datasets. While the collection of massive datasets has become easier, the presence of duplicate records can considerably impact quality and ultimately the performance and accuracy of AI models trained on this data. This can obviously lead to biased outcomes. Duplicate entries can negatively affect model evaluation by creating a biased evaluation. This occurs primarily when exact or near-duplicate data exists in both training and validation sets, leading to an overestimation of a model’s performance on unknown data. Conversely, if a model performs poorly on the duplicated data point, it can artificially deflate the overall performance metrics. Furthermore, duplicate data can lead to overfitting. This is where a model becomes overly specialized and fails to capture underlying patterns on new unseen data sets. This is particularly true with exact or near duplicates, which can reinforce patterns that may not be real when considering a broader data set.

The presence of duplicate data is also computationally expensive. It increases training costs with necessary computational overhead for preprocessing and training. Additionally, duplicate data can lead to biased feature importance, artificially skewing the importance assigned to certain features if they are consistently associated with duplicated instances. In essence, duplicate entries can distort the underlying distribution of a larger data set. This lowers the accuracy of probabilistic models. It is worth noting that the impact of duplicate data isn’t always negative and can be context-dependent. In some specific scenarios, especially with unstructured data, duplicates might indicate underlying issues with data processing pipelines (https://indicodata.ai/blog/should-we-remove-duplicates-ask-slater/). For Large Language Models (LLMs) the repetition of high-quality examples might appear as near-duplicates. This can sometimes aid in the registering of important patterns (https://dagshub.com/blog/mastering-duplicate-data-management-in-machine-learning-for-optimal-model-performance/). This nuanced view suggests that intimate knowledge of a given data set, and the goals of an AI model, are necessary when strategizing on how to handle duplicate data.

Inconsistent Data

Inconsistent data, or a data set characterized by errors, inaccuracies, or irrelevant information, poses a significant threat to the reliability of AI models. Even the most advanced and sophisticated models will yield unsatisfactory results if trained on data of poor quality. But, inconsistent data can lead to inaccurate predictions, resulting in flawed decision-making with contextually significant repercussions. For example, an AI model used for deciding if an email is dangerous might incorrectly assess risk, leading to business impacting results. Similarly, in security operations, a log analyzing AI system trained on erroneous data could incorrectly classify nefarious activity as benign.

Incomplete or skewed data can introduce bias if the training data does not adequately represent the diversity of the real-world population. This can perpetuate existing biases, affecting fairness and inclusivity. Dealing with inconsistent data often necessitates significant time and resources for data cleansing. This leads to operational inefficiencies and delays in project timelines. Inconsistent data can arise from various sources, including encoding issues, human error during processing, unhandled software exceptions, variations in how data is recorded across different systems, and a general lack of standardization. Addressing this issue requires establishing uniform data standards and robust data governance policies throughout an organization to ensure that data is collected, formatted, and stored consistently. The notion of GIGO accurately describes the direct relationship between the quality of input data and the reliability of the output produced by AI engines.

Here is a table summarizing some of what was covered in Part 2 of this series:

Data Quality Issue	Impact on Model Training	Potential Consequences
Missing Values	Reduced sample size, introduced bias, analysis limitations	Biased and unreliable results, missed patterns
Duplicate Data	Biased evaluation, overfitting, increased costs, biased feature importance	Inflated accuracy, poor generalization
Inconsistent Data	Unreliable outputs, skewed predictions, operational inefficiencies, regulatory risks	Inaccurate decisions, biased models

Part 3 will cover cybersecurity applications and how bad data impacts the ability to unlock artificial intelligence potential – strengthening the notion of the perils of poor data hygiene: undermining AI training.

Unlock Artificial Intelligence Potential – The Power Of Pristine Data

The integration of Artificial Intelligence (AI) into cybersecurity has ushered in a new era of sophisticated threat detection, proactive vulnerability assessments, and automated incident response. As organizations increasingly rely on AI to bolster their defenses, the fundamental principle remains that the quality of the data on which they train these advanced systems directly links to their effectiveness. The old saying “garbage in, garbage out” (GIGO) holds true here; to unlock artificial intelligence potential – the power of pristine data is of the utmost importance.

Part 1 – Defining Data Hygiene and Fidelity in the Context of AI and Machine Learning

Outside of the realm of areas like unsupervised learning, the foundation of any successful AI application lies in the data that fuels its models and learning processes. In cybersecurity, the stakes are exceptionally high. Consider a small security operations team that has a disproportionate scope of responsibility. Rightfully so, this team may rely on a Generative Pre-training Transformer (GPT) experience to balance out the team size against the scope of responsibility. If that GPT back-end data source is not solid this team could suffer due to inaccuracies and time sucks that lead to suboptimal results. The need for clean data is paramount. This goal encompasses two key concepts:

Data Hygiene
Data Fidelity

Data Hygiene

Data Hygiene refers to processes required to ensure that data is “clean”. Meaning it is free from errors, inaccuracies, and inconsistencies (https://www.telusdigital.com/glossary/data-hygiene). Several essential aspects contribute to good data hygiene:

Accuracy: This is fundamental, ensuring that the information is correct and devoid of mistakes such as misspellings or incorrect entries. More importantly, accuracy will have a direct impact in not introducing bias into any learning models.
Completeness: This is equally vital to accuracy in terms of what feeds a given model. Requiring datasets that contain all the necessary information, and avoid missing values that could skew results, is a must.
Consistency: Consistency ensures uniform data formatting and standardizes entries across different datasets, preventing contradictions. This can have a direct impact on the effectiveness of back-end queries. For example, internationally date formats vary. To create an effective time range query, format those stored values consistently.
Timeliness: This dictates that the data must be current and relevant for the specific purpose of training an AI model. This doesn’t exclusively mean current based on the data timestamp, legacy data needs to also be available in a timely fashion.
De-duplication: The data removal process is crucial to maintain accuracy, avoid redundancy, and minimize any potential bias in the model training process.

Implementing a robust data hygiene strategy for an AI project yields numerous benefits, including improved accuracy of AI models, reduced bias, and ultimately saves time and financial resources that organizations would otherwise spend correcting unsatisfactory results (https://versium.com/blog/ais-achilles-heel-the-consequence-of-bad-data). Very much like cybersecurity endeavors themselves, data hygiene cannot be an afterthought. The consistent emphasis on these core hygiene attributes highlights their fundamental importance for any data-driven application. Especially in the critical field of AI. Moreover, maintaining data hygiene is not a one-time effort. It is a continuous set of processes and a commitment that involves regular audits and possible rebuilds of data systems, standardization of data input fields, automation of cleansing processes to detect anomalies and duplicates, and continuous processes to prevent deterioration of quality. This continuous maintenance is essential in dynamic environments such as cybersecurity, where data can quickly become outdated or irrelevant.

Data Fidelity

Data Fidelity focuses on integrity, accurately representing data from its original source while retaining its original meaning and necessary detail (https://www.qualityze.com/blogs/data-fidelity-and-quality-management). It is driven by several key attributes:

Accuracy: In the context of data fidelity, accuracy means reflecting the true characteristics of the data source without distortion. The data has a high level of integrity and has not been tampered with.
Granularity: This refers to maintaining the required level of detail in the data. This is particularly important in cybersecurity where subtle nuances in event logs or network traffic can be critical. A perfect example in the HTTP realm is knowing that a particular POST had a malicious payload but not seeing the payload itself.
Traceability: This is another important aspect, allowing for the tracking of data back to its origin. This can prove vital for understanding the context and reliability of the information as well as providing reliable signals for a forensics exercise.

Synthetic data is a reality at this point. It is increasingly used to populate parts of model training datasets. Due to this, statistical similarity to the original, real-world data is a key measure of fidelity. High data fidelity is crucial for AI and Machine Learning (ML). It ensures models learn from data that accurately mirrors the real-world situations they aim to analyze and predict.

This is particularly critical in sensitive fields like cybersecurity, where even minor deviations from the true characteristics of data could lead to flawed security assessments or missed threats (https://www.qualityze.com/blogs/data-fidelity-and-quality-management). The concept of fidelity extends beyond basic accuracy to include the level of detail and the preservation of statistical properties. This becomes especially relevant when dealing with synthetically generated data or when aiming for explainable AI models.

The specific interpretation of “fidelity” can vary depending on the particular AI application. For instance, in intrusion detection, it might refer to the granularity of some data captured from a specific event. Yet in synthetic data generation, “fidelity” emphasizes the statistical resemblance to some original data set. In explainable AI (XAI), “fidelity” pertains to the correctness of the explanations provided by a model (https://arxiv.org/html/2401.10640v1). While accuracy remains a core component, the precise definition and emphasis of fidelity are context-dependent, reflecting diverse ways in which AI can be applied to the same field.

Here is a table summarizing some of what was covered in Part 1 of this series:

Concept	Definition	Key Attributes	Importance for AI/ML
Data Hygiene	Process of ensuring data is clean	Accuracy, Completeness, Consistency, Timeliness, De-duplication	Improves accuracy, reduces bias, better performance
Data Fidelity	Accuracy of data representing its source	Accuracy, Granularity, Traceability, Statistical Similarity	Ensures models learn from accurate and detailed data, especially nuanced data

Part 2 will cover the perils of poor data hygiene and how this negatively impacts the ability to unlock artificial intelligence potential – the power of pristine data.

Identity Risk Intelligence and it’s role in Disinformation Security

Src: https://soundproofcentral.com/wp-content/uploads/2020/10/How-To-Block-Low-Frequency-Sound-Waves-Bass-e1602767891920.jpg.webp

From Indicators to Identity: A CISOs guide to identity risk intelligence and its role in disinformation security

The power of signals, or indicators, is evident to those who understand them. They are the basis for identity risk intelligence and it’s role in disinformation security. For years, cybersecurity teams have anchored their defenses on Indicators of Compromise (IOCs), such as IP addresses, domain names, and file hashes, to identify and neutralize threats.

Technical artifacts offer security value, but alone, they’re weak against advanced threats. Attackers possess the capability to seamlessly spoof their traffic sources and rapidly cycle through their operational infrastructure. Malicious IP addresses quickly change, making reactive blocking continuously futile. Flagged IPs might be transient The Onion Routing Project (TOR) nodes, not the actual attackers themselves. Similarly, the static nature of malware file hashes makes them susceptible to trivial alterations. Attackers can modify a file’s hash in mere seconds, effectively evading signature-based detection systems. The proliferation of polymorphic malware, which automatically changes its code after each execution, further exacerbates this issue, rendering traditional hash-based detection methods largely ineffective.

Cybersecurity teams that subscribe to voluminous threat intelligence feeds face an overwhelming influx of data, a substantial portion of which rapidly loses its relevance. These massive “blacklists” of IOCs quickly become outdated or irrelevant due to the ephemeral nature of attacker infrastructure and the ease of modifying malware signatures. This data overload presents a significant challenge for security analysts and operations teams, making it increasingly difficult to discern genuine threats from the surrounding noise and to construct effective proactive protective mechanisms. Data overload obscures critical signals, proving traditional intelligence ineffective. Traditional intelligence details attacks but often misses the responsible actor. Critically, this approach provides little to no insight into how to prevent similar attacks from occurring in the future.

The era of readily identifying malware before user execution is largely behind us. Contemporary security breaches frequently involve elements that traditional IOC feeds cannot reveal – most notably, compromised identities. Verizon’s 2024 Data Breach Investigations Report (DBIR) indicated that the use of stolen credentials has been a factor in nearly one-third (31%) of all breaches over the preceding decade (https://www.verizon.com/about/news/2024-data-breach-investigations-report-emea). This statistic is further underscored by Varonis’ 2024 research, which revealed that 57% of cyberattacks initiate with a compromised identity (https://www.varonis.com/blog/the-identity-crisis-research-report).

Essentially, attackers are increasingly opting to log in rather than hack in. These crafty adversaries exploit exposed valid username and password combinations, whether obtained through phishing campaigns, purchased on dark web marketplaces, or harvested from previous data breaches. With these compromised credentials, attackers can impersonate legitimate users and quietly bypass numerous security controls. This approach extends to authenticated session objects, effectively nullifying the security benefits of Multi-Factor Authentication (MFA) in certain scenarios. While many CISOs advocate for MFA as a panacea for various security challenges, the reality is that it does not address the fundamental risks associated with compromised identities. IOCs and traditional defenses miss attacks from seemingly legitimate, compromised users. This paradigm shift necessitates a proactive and forward-thinking approach to cybersecurity, leading strategists to pivot towards identity-centric cyber intelligence.

Identity intelligence shifts focus from technical IOCs to monitoring digital entities. Security teams now ask: ‘Which identities are compromised?’ instead of just blocking IPs. This evolved approach involves establishing connections between various signals, including usernames, email addresses, and even passwords, across a multitude of data breaches and leaks to construct a more comprehensive understanding of both risky identities and the threat actors employing them, along with their associated tactics. The volume of data analyzed directly determines this approach’s efficacy; more data leads to richer and more accurate intelligence. Unusual logins trigger checks for compromised credentials via identity intelligence. Furthermore, it can enrich this analysis by examining historical data to identify patterns of misuse. Recurring patterns elevate anomalies to significant events, indicating broader attacks. Data correlation provides contextual awareness traditional intelligence lacks.

Fundamentally, identity signals play a crucial role in distinguishing legitimate users from imposters or synthetic identities operating within an environment. In an era characterized by remote and hybrid work models, widespread adoption of cloud services, and the ease of leveraging Virtual Private Network (VPN) services, attackers are increasingly attempting to create synthetic identities – fictitious users, IT personnel, or contractors – to infiltrate organizations. They may also target and compromise the identities of valid users within a given environment.

While traditional indicators like the source IP address of a login offer little value in determining whether a user truly exists within an organization’s Active Directory (AD) or whether that user is a genuine employee versus a fabricated identity, an identity-centric approach excels in this area. This excellence is achieved by meticulously analyzing multiple attributes associated with an identity, such as the employee’s email address, phone number, or other Personally Identifiable Information (PII), against extensive data stores of known breached data and fraudulent identities. Identity risk intelligence can unearth data on identities that simply appear risky. For example, if an email address with no prior legitimate online presence suddenly appears across numerous unrelated breach datasets, it could strongly suggest a synthetic profile.

Some advanced threat intelligence platforms now employ entity graphing to visually map and correlate these intricate and seemingly unrelated signals. Entity graphing involves constructing a network of relationships between various signals – connecting email addresses to passwords, passwords to specific data breaches, usernames to associated online personas, IP addresses to user accounts, and so forth. These interconnected graphs can become highly complex, yet they possess a remarkable ability to reveal hidden links that would remain invisible to a human analyst examining raw data.

An entity graph might reveal that a single Gmail address links multiple accounts across different companies and surfaces within criminal forums, strongly implicating a single threat actor who orchestrates activities across various environments. Often, these email addresses utilize convoluted strings for the username component to deliberately obfuscate the individual’s real name. By pivoting on identity-focused nodes within the graph, analysts can uncover associations between threat actors who employ obscure data points. The resulting intelligence is of high fidelity, sometimes pointing not merely to isolated threat artifacts but directly to the human adversary orchestrating a malicious campaign. This represents a new standard for threat intelligence, one where understanding the identity of the individual behind the keyboard is as critical as comprehending the specific Tactics, Techniques, and Procedures (TTPs) they employ.

The power of analyzing signals for threat intelligence is not a new concept. For example, the NSA’s ThinThread project in the 1990s aimed to analyze massive amounts of phone and email metadata to identify potential threats (https://en.wikipedia.org/wiki/ThinThread). ThinThread was designed to sort through this data, encrypt US-related communications for privacy, and use automated systems to audit how analysts handled the information. By analyzing relationships between callers and their contacts, the system could identify potential threats, and only then would the data be decrypted for further analysis.

Despite rigorous testing and demonstrating superior data-sorting capabilities compared to existing systems, ThinThread was discontinued shortly before the 9/11 attacks. The core component of ThinThread, known as MAINWAY, which focused on analyzing communication patterns, was later deployed and became a key part of the NSA’s domestic surveillance program. This historical example illustrates the potential of analyzing seemingly disparate signals to gain critical insights into potential threats, a principle that underpins modern identity risk intelligence.

Real-World Example: North Korean IT Workers Using Disinformation/Synthetic Identities for Cyber Espionage

No recent event more clearly underscores the urgent need for identity-centric intelligence than the numerous documented cases of North Korean intelligence operatives nefariously infiltrating companies by masquerading as remote IT workers. While this scenario might initially sound like a plot from a Hollywood thriller, it is unfortunately a reality that many organizations have fallen victim to. Highly skilled agents from North Korea meticulously craft elaborate fake personas, complete with fabricated online presences, counterfeit resumes, stolen personal data, and even AI-generated profile pictures, all to secure employment at companies in the West. Once these operatives successfully gain employment, data exfiltration, or at the very least the attempt thereof, becomes virtually inevitable. In some particularly insidious cases, these malicious actors diligently perform the IT work they were hired to do, effectively keeping suspicions at bay for extended periods.

In 2024, U.S. investigators corroborated the widespread nature of this tactic, revealing compelling evidence that groups of North Korean nationals had fraudulently obtained employment with American companies by falsely presenting themselves as citizens of other countries (https://www.justice.gov/archives/opa/pr/fourteen-north-korean-nationals-indicted-carrying-out-multi-year-fraudulent-information). These operatives engaged in the creation of entirely synthetic identities to successfully navigate background checks and interviews. They acquired personal information, either by “borrowing” or purchasing it from real citizens, and presented themselves as proficient software developers or IT specialists available for remote work. In one particularly concerning confirmed case, a North Korean hacker secured a position as a software developer for a cybersecurity company by utilizing a stolen American identity further bolstered by an AI-generated profile photo – effectively deceiving both HR personnel and recruiters. This deceptive “employee” even successfully navigated multiple video interviews and passed typical scrutiny.

In certain instances, the malicious actors exhibited a lack of subtlety and wasted no time in engaging in harmful activities. Reports suggest that North Korean actors exfiltrated sensitive proprietary data within mere days of commencing employment. They often stole valuable source code and other confidential corporate information, which they then used for extortion. In one instance, KnowBe4, a security training firm, discovered that a newly hired engineer on their AI team was covertly downloading hacking tools onto the company network (https://www.knowbe4.com/press/knowbe4-issues-warning-to-organizations-after-hiring-fake-north-korean-employee). Investigators later identified this individual as a North Korean operative utilizing a fabricated identity, and proactive monitoring systems allowed them to apprehend him in time by detecting the suspicious activity.

HR, CISOs, CTOs: traditional security fails against sophisticated insider threats. Early detection of synthetic insiders is crucial for preventing late-stage damage. This is precisely where the intrinsic value of identity risk intelligence becomes evident. By proactively incorporating identity risk signals early in the screening process, organizations can identify red flags indicating a potentially malicious imposter before they gain access to the internal network. For example, an identity-centric approach might have flagged the KnowBe4 hire as high-risk even before onboarding by uncovering inconsistencies or prior exposure of their personal data. Conversely, the complete absence of any historical data breaches associated with an identity could also be a suspicious indicator. Consider the types of disinformation security that identity intelligence enables:

Digital footprint verification – by leveraging extensive breach and darknet databases, security analysts and operators can thoroughly investigate whether a job applicant’s claimed identity has any prior history. If an email address or name appears exclusively in breach data associated with entirely different individuals, or if a supposed U.S.-based engineer’s records trace back to IP addresses in other countries, these discrepancies should immediately raise concerns. In the context of disinformation security, digital footprint verification helps to identify inconsistencies that suggest a fabricated identity used to spread false information or gain unauthorized access. Digital footprint analysis involves examining a user’s online presence across various platforms to verify the legitimacy of their identity. Inconsistencies or a complete lack of a genuine online presence can be indicative of a synthetic identity.
Proof of life or Synthetic identity detection – advanced platforms possess the capability to analyze combinations of PII to determine the chain of life, or the likelihood of an identity being genuine versus fabricated. For instance, if an individual’s social media presence is non-existent or their provided photo is identified as AI-generated (as was the case with the deceptive profile picture used by the hacker at KnowBe4), these are strong indicators of a synthetic persona. This is a critical aspect of disinformation security, as threat actors often use AI-generated profiles to create believable but fake identities for malicious purposes. AI algorithms and machine learning techniques play a crucial role in detecting these subtle anomalies within vast datasets. Behavioral biometrics, which analyzes unique user interaction patterns with devices, can further aid in distinguishing between genuine and synthetic identities.
Continuous identity monitoring – even after an individual is hired, the continuous monitoring of their activity and credentials can expose anomalies. For example, if a contractor’s account suddenly appears in a credential dump online, identity-focused alerts should immediately notify security teams. For disinformation security, this allows for the detection of compromised accounts that might be used to spread malicious content or propaganda.

These types of sophisticated disinformation campaigns underscore the critical importance of linking cyber threats to identity risk intelligence. Static IOCs would fail to reveal the inherent danger of a seemingly “normal” user account that happens to belong to a hostile actor. However, identity-centric analysis – meticulously vetting the true identity of an individual and determining whether their digital persona has any connections to known threat activity – can provide defenders with crucial early warnings before an attacker gains significant momentum.

This is threat attribution in action. Prioritizing identity signals, the attribution of suspicious activity to the actual threat actor becomes possible. The Lazarus Group, for instance, utilizes social engineering tactics on platforms like LinkedIn. Via Linkedin they distribute malware and steal credentials, highlighting the need for identity-focused monitoring even on professional networks. Similarly, APT29 (Cozy Bear) employs advanced spear-phishing campaigns, underscoring the importance of verifying the legitimacy of individuals and their digital footprints.

The Role of Identity Risk Intelligence in Strengthening Security Posture

To proactively defend against the evolving landscape of modern threats, organizations must embrace disinformation security strategies and seamlessly integrate identity-centric intelligence directly into their security operations. The core principle is to enrich every security decision with valuable context about identity risk. This means that whenever a security alert is triggered, or an access request is initiated, the security ecosystem should pose the additional critical question: “is this identity potentially compromised or fraudulent?”. By adopting this proactive approach, companies can transition from a reactive posture to a proactive one in mitigating threats:

Early compromised credential detection – imagine an employee’s credentials leak in a third-party breach. Traditional security misses this until active login attempts. Identity risk intelligence alerts immediately upon detection in breaches or dark web dumps. This early warning allows the security team to take immediate and decisive action, such as forcing a password reset or invalidating active sessions. Integrating these timely identity risk signals into Security Information and Event Management (SIEM) and Security Orchestration, Automation and Response (SOAR) systems enables such alerts to trigger automated responses without requiring manual intervention. Taking this further, one can proactively enrich Single Sign-On (SSO) systems and web application authentication frameworks with real-time identity risk intelligence. The following table illustrates recent high-profile data breaches where compromised credentials played a significant role:

Table 1: Recent High-Profile Data Breaches Involving Compromised Credentials (2024-2025)

Organization	Date	Estimated Records Compromised	Attack Vector	Reference
Change Healthcare	Feb 2024	100M+	Compromised Credentials	Reference
Snowflake	May 2024	165+ Orgs	Compromised Credentials	Reference
AT&T	Apr 2024	110M	Compromised Credentials	Reference
Ticketmaster	May 2024	560M	Compromised Credentials (implied)	Reference
UK Ministry of Defence	May 2024	270K	Compromised Credentials (potential)	Reference
New Era Life Insurance Companies	Feb 2025	335K	Hacking	Reference
Hospital Sisters Health System	Feb 2025	882K	Cyberattack	Reference
PowerSchool	Feb 2025	62M	Cyberattack	Reference
GrubHub	Feb 2025	Undisclosed	Compromised Third-Party Account	Reference
DISA Global	Feb 2025	3.3M	Unauthorized Access	Reference
Finastra	Nov 2024 & Feb 2025	400GB & 3.3M	Unauthorized Access	Reference
Legacy Professionals LLP	Feb 2025	215K	Suspicious Activity	Reference
Bankers Cooperative Group, Inc	Aug 2024	Undisclosed	Compromised Email	Reference
Medusind Inc.	Jan 2025	112K	Data Seizure	Reference
TalkTalk	Jan 2025	18.8M	Third-Party Supplier Breach	Reference
Gravy Analytics	Jan 2025	Millions	Unauthorized Access	Reference
Unacast	Jan 2025	Undisclosed	Misappropriated Key	Reference

Identity risk posture for users – leading providers offer something like an “Identity Risk Posture” Application Programming Interface (API). This yields a categorized value that represents the level of exposure or risk associated with a given identity. Meticulous analysis of a vast amount of data about that identity across the digital landscape derives this score. For instance, the types of exposed attributes, the categories of breaches, and data recency are all analyzed. A CISOs team can strategically utilize such a posture value to prioritize decisions and security actions. For example, a Data Security Posture Management (DSPM) solution identifies a series of users with access to specific data resources. If the security team identifies any of those users as having a high-risk posture, they could take action. Actions could include investigations or the mandate of hardware MFA devices. Or even call for more frequent and specialized security awareness training.
Threat attribution and hunting – identity-centric intelligence significantly empowers threat hunters to connect seemingly disparate signals, security events, and incidents. In the event of a phishing attack, a traditional response might conclude by simply blocking the sender’s email address and domain. However, incorporating identity data into the analysis might reveal that the phishing email address previously registered an account on a popular developer forum, and the username on that forum corresponds to a known alias of a specific cybercrime group. This enriched attribution helps establish a definitive link between attacks and specific threat actors or groups. Knowing precisely who is targeting your organization enables you to tailor your defenses and incident response processes more effectively. Moreover, a security team can then proactively hunt for specific traces within a given environment. This type of intelligence introduces a new dimension to threat attribution, transforming anonymous attacks into attributable actions by identifiable adversaries.

Integrate identity risk signals via API into security tools: a best practice. Effective solutions offer API access to vast identity intelligence datasets. These APIs provide real-time alerts and comprehensive risk posture data based on a vast data lake of compromised identities and related data points (e.g. infostealer data, etc). Tailored intelligence feeds continuously provide actionable data to security operations. This enables security teams to answer critical questions such as:

Which employee credentials have shown up in breaches, data leaks, and/or underground markets?
Is an executive’s personal email account being impersonated or misused?
Is an executive’s personal information being used to create synthetic, realistic looking public email addresses?
Are there any fake social media profiles impersonating our brand or our employees?

These identity risk questions exceed traditional network security’s scope. They bring crucial external insight – information about internet activity that could potentially threaten the organization – into internal defense processes.

Furthermore, identity-centric digital risk intelligence significantly strengthens an organization’s ability to progress towards a Zero Trust (ZT) security posture. ZT security models operate on the fundamental principle of “never trust, always verify” – particularly as it relates to user identities. Real-time information about a user’s identity compromise allows the system to dynamically adjust trust levels. For example, if an administrator account’s risk posture rapidly changes from low to high, a system can require re-authentication until investigation and resolution. This dynamic and adaptive response dramatically reduces the window of opportunity for attackers. Proactive interception of stolen credentials and fake identities replaces reactive breach response.

Embracing Identity-Centric Intelligence: A Call to Action

The landscape of cyber threats is in a constant state of evolution, and our defenses must adapt accordingly. IOCs alone fail against modern attackers; identity-focused threats demand stronger protection. CIOs, CISOs, CTOs: identity-centric intelligence is now a critical strategic necessity. As is understanding identity risk intelligence and it’s role in disinformation security. This necessary shift does not necessitate abandoning your existing suite of security tools; rather, it involves empowering them, where appropriate, with richer context and more identity risk intelligence signals.

By seamlessly integrating identity risk data into every aspect of security operations, from authentication workflows to incident response protocols, security teams gain holistic visibility into an attack, moving beyond fragmented views. Threat attribution capabilities then become significantly enhanced, as cybersecurity teams can more accurately pinpoint who is targeting their organization. Identifying compromised credentials or accounts speeds incident response, enabling faster breach containment. Ultimately, an organization can transition into both proactive and disinformation security strategies.

Several key questions warrant honest and critical consideration:

How well do we truly know our users and their associated identities?
How quickly can we detect an adversary if they were operating covertly amongst our legitimate users?

If either of these questions elicits uncertainty, it is time to rigorously evaluate how identity risk intelligence can effectively bridge that critical gap. I recommend you begin by exploring solutions that aggregate breach data and provide actionable insights, such as a comprehensive risk score or posture, which your current security ecosystem can seamlessly leverage.

Identity-centric intelligence is vital against sophisticated attacks, surpassing traditional methods for better breach detection. CISOs enhance breach prevention by viewing identity risk holistically, moving beyond basic IOCs. North Korean attacks and data breaches highlight the urgent need for identity-focused security. Implement identity risk, entity graphing, and Zero Trust for a proactive, resilient security posture. Understanding and securing identities equips organizations to navigate complex future threats effectively. Fundamentally, this requires understanding identity risk intelligence and it’s role in disinformation security.