Windows Installer 乱码的解决方法
记一下最近的.emacs

MKV的EBML格式

Tiger Soldier posted @ 2008年6月30日 10:22 in 未分类 with tags mkv ebml , 8840 阅读

很久以前在原来的Blog里写的,现在转过来……

昨天看MKV的格式文档的时候知道它用的是EBML格式,研究了一下,觉得挺有意思的
原文对这个格式的描述是:

Just like XML, the specific "tags" (IDs in EBML parlance) used in an EBML implementation are arbitrary. However, the semantic of EBML outlines general data types and ID's.

The known basic types are:

  • Signed Integer - Big-endian, any size from 1 to 8 octets
  • Unsigned Integer - Big-endian, any size from 1 to 8 octets
  • Float - Big-endian, defined for 4 and 8 octets (32, 64 bits)
  • String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
  • UTF-8 - Unicode string, zero padded when needed (RFC 2279)
  • Date - signed 8 octets integer in nanoseconds with 0 indicating the precise beginning of the millennium (at 2001-01-01T00:00:00,000000000 UTC)
  • master-element - contains other EBML sub-elements of the next lower level
  • Binary - not interpreted by the parser

As well as defining standard data types, EBML uses a system of Elements to make up an EBML "document." Elements incorporate an Element ID, a descriptor for the size of the element, and the binary data itself. Futher, Elements can be nested, or contain, Elements of a lower "level."

Element IDs are outlined as follows, beginning with the ID itself, followed by the Data Size, and then the non-interpreted Binary itself:

  • Element ID coded with an UTF-8 like system : bits, big-endian
    1xxx xxxx Class A IDs (2^7 -1 possible values) (base 0x8X)
    01xx xxxx xxxx xxxx Class B IDs (2^14-1 possible values) (base 0x4X 0xXX)
    001x xxxx xxxx xxxx xxxx xxxx Class C IDs (2^21-1 possible values) (base 0x2X 0xXX 0xXX)
    0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx Class D IDs (2^28-1 possible values) (base 0x1X 0xXX 0xXX 0xXX)

    Some Notes:
    • The leading bits of the Class IDs are used to identify the length of the ID. The number of leading 0's + 1 is the length of the ID in octets. We will refer to the leading bits as the Length Descriptor.
    • Any ID where all x's are composed entirely of 1's is a Reserved ID, thus the -1 in the definitions above.
    • The Reserved IDs (all x set to 1) are the only IDs that may change the Length Descriptor.
  • Data size, in octets, is also coded with an UTF-8 like system :
  • bits, big-endian
    1xxx xxxx - value 0 to 2^7-2
    01xx xxxx xxxx xxxx - value 0 to 2^14-2
    001x xxxx xxxx xxxx xxxx xxxx - value 0 to 2^21-2
    0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^28-2
    0000 1xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^35-2
    0000 01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^42-2
    0000 001x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^49-2
    0000 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^56-2

    Since modern computers do not easily deal with data coded in sizes greater than 64 bits, any larger Element Sizes are left undefined at the moment. Currently, the Element Size coding allows for an Element to grow to 72000 To, i.e. 7x10^16 octets or 72000 terabytes, which will be sufficient for the time being.

    There is only one reserved word for Element Size encoding, which is an Element Size encoded to all 1's. Such a coding indicates that the size of the Element is unknown, which is a special case that we believe will be useful for live streaming purposes. However, avoid using this reserved word unnecessarily, because it makes parsing slower and more difficult to implement.

  • Data
    • Integers are stored in their standard big-endian form (no UTF-like encoding), only the size may differ from their usual form (24 or 40 bits for example).
    • The Signed Integer is just the big-endian representation trimmed from some 0x00 and 0xFF where they are not meaningful (sign). For example -2 can be coded as 0xFFFFFFFFFFFFFE or 0xFFFE or 0xFE and 5 can be coded 0x000000000005 or 0x0005 or 0x05.

全是英文……嗯,于是简单说一下,EBML是类似于XML那样的层次化结构,每一个元素都有它的ID(就是元素名)和值。另外由于是二进制存储,还含有一个长度值来记录它的值的长度,每个元素的排列是ID,长度,值

然后它的ID和长度的表示方法很有趣,是采用了UTF-8那样的不定长前缀表示法。转换成二进制后,如果以1开头,那么长度是1字节,如果以01开头就是2字节,001开头是三字节……依此类推。前缀之后的值就是其值(当然,ID就和值无关,只是拿来确定ID的长度了)。

对于元素的值,就采用普通的方法存储。值得注意的是,整数之类的值保存与一般二进制会前后颠倒不同,是按从高到低的顺序存储(例如:对于数0xFE65C4,一般的二进制保存到文件后会变成C4 65 FE,而在EBML里就是以FE 65 C4的形式保存的)
举个例子,这是某MKV文件里的一段:

……42 82 88 6D 61 74 72 6F 73 6B 61 42 87 81 01……

于是首先看42,转换为2进制值为01000010(注意这里的42是十六进制,别错当成10进制转换成00101010了XD,另外不足8位的用0补齐)。由于是01开头所以知道ID的长度是2字节,于是42 82就是它的ID。之后是长度:88转换为2进制值为10001000,以1开头,于是长度只有1字幕,去掉前缀1后值为0001000,也就是8,即是说后8个字节都是它的值。查表可知ID 42 82也就是DocType,它的值是字符串,于是把后面8字节6D 61 74 72 6F 73 6B 61按UTF-8解码可得“matroska”(由于是英文字符,UTF-8和ASCII码一样XD)。

之后又是一个新元素,同理得知42 87是其ID,长度是81,也就是长为1字节,后面的01就是它的值。同样可查得42 87代表DocTypeVersion,值类型为整数,于是这一段EBML码解析结果为:

DocType=matroska
DocTypeVersion=1

说了半天,EBML是什么?是Extensible Binary Meta Language(可扩展二进制元语言)的缩写,主要是设计用来描述媒体文件的。

至于如何实现层次化描述,等我有心情再写吧XD

 

  • 无匹配
  • 无匹配

登录 *


loading captcha image...
(输入验证码)
or Ctrl+Enter