Understanding XML File Structure and its Risks

Technical Articles

Review Cloudmersive's technical library.

5/28/2025 - Brian O'Neill

xml extensible markup language

XML: The Backbone of Structured Data

XML Basics

Extensible Markup Language (XML) is a fundamental data structure in modern software. It underpins everything from MS Office documents to application configuration files, Application Program Interfaces (APIs), web services, and even firmware packages.

Much like JavaScript Object Notation (JSON), XML is also used to store and transport data in a platform-agnostic way – and it’s remained a standard tool for that purpose across enterprise and developer ecosystems for decades since its conception.

XML Security Concerns

XML’s flexibility as a document – along with its ability to encode highly structured data – also makes it a very attractive target for threat actors. Here at Cloudmersive, our core mission is to protect enterprise systems from complex, highly nuanced threat vectors, and communicating the inherent insecurity of XML structure is an essential part of that.

Article Overview

In this article, we’ll provide a detailed overview of what XML files are, where they came from, and how they’re structured. We’ll then review some different ways threat actors weaponize XML documents to breach enterprise systems. At the end, we’ll discuss how Cloudmersive’s Advanced Virus Scan API uses deep content verification capabilities to identify and block malicious XML documents before they can compromise enterprise systems.

What is XML? Understanding the Extensible Markup Concept

XML was originally designed in the early 1990’s as a simplified and more flexible alternative to Standard Generalized Markup Language (SGML). Up to the point of XML’s formal introduction in 1998, SGML was chiefly used for defining complex document structures in publishing, technical manuals, and enterprise data exchange.

XML was (and continues to be) human-readable, machine-parseable, and platform independent. This led to it quickly becoming the de facto standard for data exchange between different systems – and eventually the basis for several complex document formats.

xml file structure graphic

XML offered self-describing data – meaning tags which explicitly identified data fields, unlike CSV or fixed-width formats. It also delivered on the promise of its name by offering excellent extensibility, which allowed developers to define their own tag schemes via Document Type Definitions (DTD) or XML Schemas (XSDs). It supported complex hierarchies, which were ideal for representing nested data structures; anything from configuration trees to invoices and User Interface (UI) elements in an application. XML parsers were made readily available for numerous programming languages near the time of its release, and updated versions of these parsers are widely available today in popular enterprise programming languages like Java, C#, Python, and more.

The anatomy of a basic XML file uses the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<library>
  <book id="101">
    <title>Securing XML</title>
    <author>Jane Doe</author>
    <year>2023</year>
  </book>
</library>

This structure appears quite basic at first glance, but its simplicity is deceptive. Advanced XML documents can perform powerful dynamic actions.

That includes referencing external entities:

<?xml version="1.0"?>
<!DOCTYPE data [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>

Linking to specific schemas:

<note xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="http://malicious.example.com/schema.xsd">
  <to>User</to>
  <from>Attacker</from>
  <message>Check this out!</message>
</note>

And even embedding scripts or code to interact with the systems parsing the XML file:

<script>
  <![CDATA[
    alert('This is a script!');
  ]]>
</script>

This is where the security issues with XML become prevalent. XML <!ENTITY> declarations can load external files or URLs, any of which can be compromised and difficult to trace. xsi:schemaLocation attributes can pull external schemas, opening the door to malicious payloads or data exfiltration during schema resolution. Deep or recursive nesting in XML files can overwhelm a server’s memory, resulting in widespread service outages.

Like we alluded to earlier, many common and complex file types embed XML internally as part of a hybrid file structure – including all Open Office XML (OOXML) files (.docx, .xlsx, .pptx.), Scalable Vector Graphics (SVG), AndroidManifests (core files in Android apps), and several others. That leaves many of these files vulnerable to XML-based exploits, too.

xml based documents graphic

Common Attack Vectors in XML

Below, we’ll cover some of the most common attack vectors driven by XML document structure.

XML External Entity (XXE) Injection

XXE attacks are the most common and well-known XML based attacks in the digital world today. They occur when threat actors define malicious external entities – like local file paths or remote URLs, for example – in an XML document’s entity declarations. If an enterprise application’s XML parser retrieves malicious entities during processing, an attacker can exfiltrate sensitive documents or gain enterprise network access.

CVE-2019-0228 is one example of a textbook XXE vulnerability. In this case, Apache PDFBox v.2.0.14 did not properly initialize its XML parser, which allowed context-dependent attackers to conduct XXE attacks with specially crafted XFDF documents. Attackers could trick the application parser into exposing sensitive information from the underlying system.

Another more recent example is CVE-2023-20052. In this case, an XXE vulnerability was found in the ClamAV (an open-source antivirus engine predominantly used on mail servers and gateways) DMG file parser which could allow attackers to access sensitive data or execute malicious code. Depending on the sophistication of the attacker involved, malicious code execution can spell fatal disaster for enterprise networks.

The One Web Application Security Project (OWASP) foundation consistently places XXE in its annual Top 10 Web Application Security Risks ranking (note that XXE is now considered a “Security Misconfiguration”; it was previously named as its own category).

Billion Laughs Attacks (XML Bombs)

Attackers can craft XML entities in such a way that they expand exponentially once parsed, consuming memory on the target system and eventually crashing the XML parser. This is typically accomplished by nesting entity references to cause recursive expansion.

billion laughs attack graphic

This form of attack is referred to as a “billion laughs attack” because the original iteration used XML entities named &lol;, &lol1;, etc. which recursively expanded into the string “lol” billions of times. Attackers can use any entity they names they want to achieve the same results.

Embedded Scripts

Threat actors can abuse regular XML documents and hybrid formats (like .docx, for example) to hide malicious scripts or code. Payloads can be hidden within XML tags, relying on downstream engines – like web apps, office software, or custom XML parsers – to interpret and execute them. This danger is particularly prevalent in environments where XLST scripting, macros, or embedded objects are supported.

CVE-2021-43818 is a good real-world example of this danger. This vulnerability affected the lxml Python library. Before version 4.6.5, the HTML cleaner which lxml.html used allowed some types of crafted script content to pass through – including scripts embedded in SVG files (XML-based vector graphics files) using data URIs. Attackers could exploit this vulnerability by injecting malicious scripts into applications which relied on lxml for sanitizing HTML content. This could result in cross-site scripting (XSS) attacks.

Misuse in API Requests

Most of the threats we’ve described to this point are related to XML in document form. It’s important to bear in mind that APIs consume XML data directly from users, too. Without proper validation, insecure API data consumption can open the door for all kinds of injection attacks.

How Cloudmersive Detects XML-Based Threats

xml article cloud security concept

Cloudmersive’s Advanced Virus Scan API detects XML-based threats via deep content verification. This entails unpacking XML documents and unzipping hybrid XML-based formats to uncover obfuscated threats buried deep within the file structure. Scripts, external entity references, and entities with other dangerous behaviors are identified with a combination of behavioral analysis and signature-based scanning techniques.

The Advanced Virus Scan API can be implemented directly into web applications with minimal code changes, and it can be deployed in as a no-code solution in defense of enterprise network entry points.

Conclusion

XML is a powerful, complex data structure which invites risk in myriad forms. Understanding how structured XML-based files can be weaponized is the first step toward defending any application stack from harm.

To learn more about defending against XML-based threats with Cloudmersive’s Advanced Virus Scan API, please reach out to a member of our team.

Technical Articles

XML: The Backbone of Structured Data

XML Basics

XML Security Concerns

Article Overview

What is XML? Understanding the Extensible Markup Concept

Common Attack Vectors in XML

XML External Entity (XXE) Injection

Billion Laughs Attacks (XML Bombs)

Embedded Scripts

Misuse in API Requests

How Cloudmersive Detects XML-Based Threats

Conclusion

Related

800 free API calls/month, with no expiration

API Products

Validate APIs

Natural Language Processing (NLP) APIs

Optical Character Recognition (OCR) APIs

Barcode APIs

Image and Face Recognition and Processing APIs

Virus Scan APIs

Security Threat Detection APIs

Document and Data Conversion APIs

Questions? We'll be your guide.