Security - overlong UTF-8 encoding attack
The overlong UTF-8 encoding attack is a security vulnerability associated with improperly handling or validating UTF-8 encoded data. Here's a breakdown of the issue and its implications:
What is Overlong UTF-8 Encoding?
UTF-8 encodes Unicode characters in variable-length sequences of 1 to 4 bytes. An overlong encoding occurs when a character is encoded using more bytes than necessary. For example:
- The ASCII character
A(U+0041) is typically encoded as a single byte:0x41. - An overlong encoding of the same character might use:
- two bytes:
0xC10x81, - three bytes:
0xE00x810x81.
- two bytes:
How Does the Attack Work?
Overlong encoding attacks exploit systems that do not properly validate UTF-8 inputs. These systems may:
- Treat overlong sequences as valid input.
- Mismatch on how the same data is interpreted in different contexts (e.g., by an application vs. a database).
This can lead to several security vulnerabilities:
- Bypassing Input Validation:
- A filter might check for dangerous characters (like
\0for null terminators) but fail to recognize them when encoded in an overlong sequence.
- A filter might check for dangerous characters (like
- Injection Attacks:
- Overlong sequences can bypass sanitization layers, leading to SQL injection, cross-site scripting (XSS), or other attacks.
- Directory Traversal:
- Overlong sequences for slashes (
/or\) can evade path normalization checks.
- Overlong sequences for slashes (
Example Attack Scenario
A web application prevents null bytes (\0) in filenames to avoid arbitrary file writing:
- A malicious user submits a payload with an overlong encoding of
\0(e.g.,0xC00x80). - The input passes validation but is later interpreted as
\0by the filesystem, potentially enabling an exploit.
Defenses Against Overlong UTF-8 Attacks
-
Strict UTF-8 Validation:
- Ensure the decoding process rejects overlong sequences as invalid UTF-8.
- Many modern libraries and frameworks enforce this by default (e.g., Python's utf-8 codec, modern web browsers).
-
Canonicalization Before Validation:
- Normalize all inputs to their simplest (canonical) form before applying validation or processing.
-
Input Sanitization:
- Validate and sanitize input at every layer of the application stack.
-
Use Modern Libraries:
- Avoid writing custom parsers; rely on trusted, up-to-date libraries that handle UTF-8 correctly.
-
Testing and Auditing:
- Test systems for edge cases and fuzz inputs with invalid or overlong encodings to identify vulnerabilities.
Modern Relevance
While overlong UTF-8 encoding attacks were more prominent in the early 2000s, they are less common today due to improved UTF-8 validation in modern software. However, legacy systems or poorly implemented parsers may still be vulnerable. It is crucial to maintain awareness of these issues in security-sensitive applications.