|
Understanding XML File Structure and its Risks |
5/28/2025 - Brian O'Neill |
XML: The Backbone of Structured DataXML BasicsExtensible Markup Language (XML) is a fundamental data structure in modern software. It underpins everything from MS Office documents to application configuration files, Application Program Interfaces (APIs), web services, and even firmware packages. Much like JavaScript Object Notation (JSON), XML is also used to store and transport data in a platform-agnostic way – and it’s remained a standard tool for that purpose across enterprise and developer ecosystems for decades since its conception. XML Security ConcernsXML’s flexibility as a document – along with its ability to encode highly structured data – also makes it a very attractive target for threat actors. Here at Cloudmersive, our core mission is to protect enterprise systems from complex, highly nuanced threat vectors, and communicating the inherent insecurity of XML structure is an essential part of that. Article OverviewIn this article, we’ll provide a detailed overview of what XML files are, where they came from, and how they’re structured. We’ll then review some different ways threat actors weaponize XML documents to breach enterprise systems. At the end, we’ll discuss how Cloudmersive’s Advanced Virus Scan API uses deep content verification capabilities to identify and block malicious XML documents before they can compromise enterprise systems. What is XML? Understanding the Extensible Markup ConceptXML was originally designed in the early 1990’s as a simplified and more flexible alternative to Standard Generalized Markup Language (SGML). Up to the point of XML’s formal introduction in 1998, SGML was chiefly used for defining complex document structures in publishing, technical manuals, and enterprise data exchange. XML was (and continues to be) human-readable, machine-parseable, and platform independent. This led to it quickly becoming the de facto standard for data exchange between different systems – and eventually the basis for several complex document formats. XML offered self-describing data – meaning tags which explicitly identified data fields, unlike CSV or fixed-width formats. It also delivered on the promise of its name by offering excellent extensibility, which allowed developers to define their own tag schemes via Document Type Definitions (DTD) or XML Schemas (XSDs). It supported complex hierarchies, which were ideal for representing nested data structures; anything from configuration trees to invoices and User Interface (UI) elements in an application. XML parsers were made readily available for numerous programming languages near the time of its release, and updated versions of these parsers are widely available today in popular enterprise programming languages like Java, C#, Python, and more. The anatomy of a basic XML file uses the following structure:
This structure appears quite basic at first glance, but its simplicity is deceptive. Advanced XML documents can perform powerful dynamic actions. That includes referencing external entities:
Linking to specific schemas:
And even embedding scripts or code to interact with the systems parsing the XML file:
This is where the security issues with XML become prevalent. XML Like we alluded to earlier, many common and complex file types embed XML internally as part of a hybrid file structure – including all Open Office XML (OOXML) files (.docx, .xlsx, .pptx.), Scalable Vector Graphics (SVG), AndroidManifests (core files in Android apps), and several others. That leaves many of these files vulnerable to XML-based exploits, too. Common Attack Vectors in XMLBelow, we’ll cover some of the most common attack vectors driven by XML document structure. XML External Entity (XXE) InjectionXXE attacks are the most common and well-known XML based attacks in the digital world today. They occur when threat actors define malicious external entities – like local file paths or remote URLs, for example – in an XML document’s entity declarations. If an enterprise application’s XML parser retrieves malicious entities during processing, an attacker can exfiltrate sensitive documents or gain enterprise network access. CVE-2019-0228 is one example of a textbook XXE vulnerability. In this case, Apache PDFBox v.2.0.14 did not properly initialize its XML parser, which allowed context-dependent attackers to conduct XXE attacks with specially crafted XFDF documents. Attackers could trick the application parser into exposing sensitive information from the underlying system. Another more recent example is CVE-2023-20052. In this case, an XXE vulnerability was found in the ClamAV (an open-source antivirus engine predominantly used on mail servers and gateways) DMG file parser which could allow attackers to access sensitive data or execute malicious code. Depending on the sophistication of the attacker involved, malicious code execution can spell fatal disaster for enterprise networks. The One Web Application Security Project (OWASP) foundation consistently places XXE in its annual Top 10 Web Application Security Risks ranking (note that XXE is now considered a “Security Misconfiguration”; it was previously named as its own category). Billion Laughs Attacks (XML Bombs)Attackers can craft XML entities in such a way that they expand exponentially once parsed, consuming memory on the target system and eventually crashing the XML parser. This is typically accomplished by nesting entity references to cause recursive expansion. This form of attack is referred to as a “billion laughs attack” because the original iteration used XML entities named Embedded ScriptsThreat actors can abuse regular XML documents and hybrid formats (like .docx, for example) to hide malicious scripts or code. Payloads can be hidden within XML tags, relying on downstream engines – like web apps, office software, or custom XML parsers – to interpret and execute them. This danger is particularly prevalent in environments where XLST scripting, macros, or embedded objects are supported. CVE-2021-43818 is a good real-world example of this danger. This vulnerability affected the Misuse in API RequestsMost of the threats we’ve described to this point are related to XML in document form. It’s important to bear in mind that APIs consume XML data directly from users, too. Without proper validation, insecure API data consumption can open the door for all kinds of injection attacks. How Cloudmersive Detects XML-Based ThreatsCloudmersive’s Advanced Virus Scan API detects XML-based threats via deep content verification. This entails unpacking XML documents and unzipping hybrid XML-based formats to uncover obfuscated threats buried deep within the file structure. Scripts, external entity references, and entities with other dangerous behaviors are identified with a combination of behavioral analysis and signature-based scanning techniques. The Advanced Virus Scan API can be implemented directly into web applications with minimal code changes, and it can be deployed in as a no-code solution in defense of enterprise network entry points. ConclusionXML is a powerful, complex data structure which invites risk in myriad forms. Understanding how structured XML-based files can be weaponized is the first step toward defending any application stack from harm. To learn more about defending against XML-based threats with Cloudmersive’s Advanced Virus Scan API, please reach out to a member of our team. |