Security - overlong UTF-8 encoding attack

2 contributors

7 contributions

0 discussions

11 points

Created by:

Olivier-Roger

341

The overlong UTF-8 encoding attack is a security vulnerability associated with improperly handling or validating UTF-8 encoded data. Here's a breakdown of the issue and its implications:

What is Overlong UTF-8 Encoding?

UTF-8 encodes Unicode characters in variable-length sequences of 1 to 4 bytes. An overlong encoding occurs when a character is encoded using more bytes than necessary. For example:

The ASCII character A (U+0041) is typically encoded as a single byte: 0x41.
An overlong encoding of the same character might use:
- two bytes: 0xC1 0x81,
- three bytes: 0xE0 0x81 0x81.

How Does the Attack Work?

Overlong encoding attacks exploit systems that do not properly validate UTF-8 inputs. These systems may:

Treat overlong sequences as valid input.
Mismatch on how the same data is interpreted in different contexts (e.g., by an application vs. a database).

This can lead to several security vulnerabilities:

Bypassing Input Validation:
- A filter might check for dangerous characters (like \0 for null terminators) but fail to recognize them when encoded in an overlong sequence.
Injection Attacks:
- Overlong sequences can bypass sanitization layers, leading to SQL injection, cross-site scripting (XSS), or other attacks.
Directory Traversal:
- Overlong sequences for slashes (/ or \) can evade path normalization checks.

Example Attack Scenario

A web application prevents null bytes (\0) in filenames to avoid arbitrary file writing:

A malicious user submits a payload with an overlong encoding of \0 (e.g., 0xC0 0x80).
The input passes validation but is later interpreted as \0 by the filesystem, potentially enabling an exploit.

Defenses Against Overlong UTF-8 Attacks

Strict UTF-8 Validation:
- Ensure the decoding process rejects overlong sequences as invalid UTF-8.
- Many modern libraries and frameworks enforce this by default (e.g., Python's utf-8 codec, modern web browsers).
Canonicalization Before Validation:
- Normalize all inputs to their simplest (canonical) form before applying validation or processing.
Input Sanitization:
- Validate and sanitize input at every layer of the application stack.
Use Modern Libraries:
- Avoid writing custom parsers; rely on trusted, up-to-date libraries that handle UTF-8 correctly.
Testing and Auditing:
- Test systems for edge cases and fuzz inputs with invalid or overlong encodings to identify vulnerabilities.

Modern Relevance

While overlong UTF-8 encoding attacks were more prominent in the early 2000s, they are less common today due to improved UTF-8 validation in modern software. However, legacy systems or poorly implemented parsers may still be vulnerable. It is crucial to maintain awareness of these issues in security-sensitive applications.

Security - overlong UTF-8 encoding attack

What is Overlong UTF-8 Encoding?

How Does the Attack Work?

Example Attack Scenario

Defenses Against Overlong UTF-8 Attacks

Modern Relevance

Native Advertising

🚀

Dirask - we help you to

solve coding problems.

Ask question.

❤️💻 🙂