Languages
[Edit]
EN

Security - overlong UTF-8 encoding attack

11 points
Created by:
Olivier-Roger
341

The overlong UTF-8 encoding attack is a security vulnerability associated with improperly handling or validating UTF-8 encoded data. Here's a breakdown of the issue and its implications:

 

What is Overlong UTF-8 Encoding?

UTF-8 encodes Unicode characters in variable-length sequences of 1 to 4 bytes. An overlong encoding occurs when a character is encoded using more bytes than necessary. For example:

  • The ASCII character A (U+0041) is typically encoded as a single byte: 0x41.
  • An overlong encoding of the same character might use:
    • two bytes: 0xC1 0x81,
    • three bytes: 0xE0 0x81 0x81.

 

How Does the Attack Work?

Overlong encoding attacks exploit systems that do not properly validate UTF-8 inputs. These systems may:

  1. Treat overlong sequences as valid input.
  2. Mismatch on how the same data is interpreted in different contexts (e.g., by an application vs. a database).

This can lead to several security vulnerabilities:

  • Bypassing Input Validation:
    • A filter might check for dangerous characters (like \0 for null terminators) but fail to recognize them when encoded in an overlong sequence.
  • Injection Attacks:
    • Overlong sequences can bypass sanitization layers, leading to SQL injection, cross-site scripting (XSS), or other attacks.
  • Directory Traversal:
    • Overlong sequences for slashes (/ or \) can evade path normalization checks.

 

Example Attack Scenario

A web application prevents null bytes (\0) in filenames to avoid arbitrary file writing:

  • A malicious user submits a payload with an overlong encoding of \0 (e.g., 0xC0 0x80).
  • The input passes validation but is later interpreted as \0 by the filesystem, potentially enabling an exploit.

 

Defenses Against Overlong UTF-8 Attacks

  1. Strict UTF-8 Validation:

    • Ensure the decoding process rejects overlong sequences as invalid UTF-8.
    • Many modern libraries and frameworks enforce this by default (e.g., Python's utf-8 codec, modern web browsers).
  2. Canonicalization Before Validation:

    • Normalize all inputs to their simplest (canonical) form before applying validation or processing.
  3. Input Sanitization:

    • Validate and sanitize input at every layer of the application stack.
  4. Use Modern Libraries:

    • Avoid writing custom parsers; rely on trusted, up-to-date libraries that handle UTF-8 correctly.
  5. Testing and Auditing:

    • Test systems for edge cases and fuzz inputs with invalid or overlong encodings to identify vulnerabilities.

 

Modern Relevance

While overlong UTF-8 encoding attacks were more prominent in the early 2000s, they are less common today due to improved UTF-8 validation in modern software. However, legacy systems or poorly implemented parsers may still be vulnerable. It is crucial to maintain awareness of these issues in security-sensitive applications.

Donate to Dirask
Our content is created by volunteers - like Wikipedia. If you think, the things we do are good, donate us. Thanks!
Join to our subscribers to be up to date with content, news and offers.
Native Advertising
🚀
Get your tech brand or product in front of software developers.
For more information Contact us
Dirask - we help you to
solve coding problems.
Ask question.

❤️💻 🙂

Join