| <?xml version="1.0" encoding="utf-8"?> |
| <!DOCTYPE rfc SYSTEM 'rfc2629.dtd'> |
| <?rfc toc="yes" symrefs="yes" ?> |
| |
| <rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14"> |
| |
| <front> |
| <title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title> |
| |
| |
| <author initials="JM" surname="Valin" fullname="Jean-Marc Valin"> |
| <organization>Mozilla Corporation</organization> |
| <address> |
| <postal> |
| <street>650 Castro Street</street> |
| <city>Mountain View</city> |
| <region>CA</region> |
| <code>94041</code> |
| <country>USA</country> |
| </postal> |
| <phone>+1 650 903-0800</phone> |
| <email>jmvalin@jmvalin.ca</email> |
| </address> |
| </author> |
| |
| <author initials="K." surname="Vos" fullname="Koen Vos"> |
| <organization>Skype Technologies S.A.</organization> |
| <address> |
| <postal> |
| <street>Soder Malarstrand 43</street> |
| <city>Stockholm</city> |
| <region></region> |
| <code>11825</code> |
| <country>SE</country> |
| </postal> |
| <phone>+46 73 085 7619</phone> |
| <email>koen.vos@skype.net</email> |
| </address> |
| </author> |
| |
| <author initials="T." surname="Terriberry" fullname="Timothy B. Terriberry"> |
| <organization>Mozilla Corporation</organization> |
| <address> |
| <postal> |
| <street>650 Castro Street</street> |
| <city>Mountain View</city> |
| <region>CA</region> |
| <code>94041</code> |
| <country>USA</country> |
| </postal> |
| <phone>+1 650 903-0800</phone> |
| <email>tterriberry@mozilla.com</email> |
| </address> |
| </author> |
| |
| <date day="17" month="May" year="2012" /> |
| |
| <area>General</area> |
| |
| <workgroup></workgroup> |
| |
| <abstract> |
| <t> |
| This document defines the Opus interactive speech and audio codec. |
| Opus is designed to handle a wide range of interactive audio applications, |
| including Voice over IP, videoconferencing, in-game chat, and even live, |
| distributed music performances. |
| It scales from low bitrate narrowband speech at 6 kb/s to very high quality |
| stereo music at 510 kb/s. |
| Opus uses both linear prediction (LP) and the Modified Discrete Cosine |
| Transform (MDCT) to achieve good compression of both speech and music. |
| </t> |
| </abstract> |
| </front> |
| |
| <middle> |
| |
| <section anchor="introduction" title="Introduction"> |
| <t> |
| The Opus codec is a real-time interactive audio codec designed to meet the requirements |
| described in <xref target="requirements"></xref>. |
| It is composed of a linear |
| prediction (LP)-based <xref target="LPC"/> layer and a Modified Discrete Cosine Transform |
| (MDCT)-based <xref target="MDCT"/> layer. |
| The main idea behind using two layers is that in speech, linear prediction |
| techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform |
| (e.g., MDCT) domain techniques, while the situation is reversed for music and |
| higher speech frequencies. |
| Thus a codec with both layers available can operate over a wider range than |
| either one alone and, by combining them, achieve better quality than either |
| one individually. |
| </t> |
| |
| <t> |
| The primary normative part of this specification is provided by the source code |
| in <xref target="ref-implementation"></xref>. |
| Only the decoder portion of this software is normative, though a |
| significant amount of code is shared by both the encoder and decoder. |
| <xref target="conformance"/> provides a decoder conformance test. |
| The decoder contains a great deal of integer and fixed-point arithmetic which |
| needs to be performed exactly, including all rounding considerations, so any |
| useful specification requires domain-specific symbolic language to adequately |
| define these operations. |
| Additionally, any |
| conflict between the symbolic representation and the included reference |
| implementation must be resolved. For the practical reasons of compatibility and |
| testability it would be advantageous to give the reference implementation |
| priority in any disagreement. The C language is also one of the most |
| widely understood human-readable symbolic representations for machine |
| behavior. |
| For these reasons this RFC uses the reference implementation as the sole |
| symbolic representation of the codec. |
| </t> |
| |
| <t>While the symbolic representation is unambiguous and complete it is not |
| always the easiest way to understand the codec's operation. For this reason |
| this document also describes significant parts of the codec in English and |
| takes the opportunity to explain the rationale behind many of the more |
| surprising elements of the design. These descriptions are intended to be |
| accurate and informative, but the limitations of common English sometimes |
| result in ambiguity, so it is expected that the reader will always read |
| them alongside the symbolic representation. Numerous references to the |
| implementation are provided for this purpose. The descriptions sometimes |
| differ from the reference in ordering or through mathematical simplification |
| wherever such deviation makes an explanation easier to understand. |
| For example, the right shift and left shift operations in the reference |
| implementation are often described using division and multiplication in the text. |
| In general, the text is focused on the "what" and "why" while the symbolic |
| representation most clearly provides the "how". |
| </t> |
| |
| <section anchor="notation" title="Notation and Conventions"> |
| <t> |
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", |
| "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be |
| interpreted as described in RFC 2119 <xref target="rfc2119"></xref>. |
| </t> |
| <t> |
| Various operations in the codec require bit-exact fixed-point behavior, even |
| when writing a floating point implementation. |
| The notation "Q<n>", where n is an integer, denotes the number of binary |
| digits to the right of the decimal point in a fixed-point number. |
| For example, a signed Q14 value in a 16-bit word can represent values from |
| -2.0 to 1.99993896484375, inclusive. |
| This notation is for informational purposes only. |
| Arithmetic, when described, always operates on the underlying integer. |
| E.g., the text will explicitly indicate any shifts required after a |
| multiplication. |
| </t> |
| <t> |
| Expressions, where included in the text, follow C operator rules and |
| precedence, with the exception that the syntax "x**y" indicates x raised to |
| the power y. |
| The text also makes use of the following functions: |
| </t> |
| |
| <section anchor="min" toc="exclude" title="min(x,y)"> |
| <t> |
| The smallest of two values x and y. |
| </t> |
| </section> |
| |
| <section anchor="max" toc="exclude" title="max(x,y)"> |
| <t> |
| The largest of two values x and y. |
| </t> |
| </section> |
| |
| <section anchor="clamp" toc="exclude" title="clamp(lo,x,hi)"> |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| clamp(lo,x,hi) = max(lo,min(x,hi)) |
| ]]></artwork> |
| </figure> |
| <t> |
| With this definition, if lo > hi, the lower bound is the one that |
| is enforced. |
| </t> |
| </section> |
| |
| <section anchor="sign" toc="exclude" title="sign(x)"> |
| <t> |
| The sign of x, i.e., |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| ( -1, x < 0 , |
| sign(x) = < 0, x == 0 , |
| ( 1, x > 0 . |
| ]]></artwork> |
| </figure> |
| </t> |
| </section> |
| |
| <section anchor="abs" toc="exclude" title="abs(x)"> |
| <t> |
| The absolute value of x, i.e., |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| abs(x) = sign(x)*x . |
| ]]></artwork> |
| </figure> |
| </t> |
| </section> |
| |
| <section anchor="floor" toc="exclude" title="floor(f)"> |
| <t> |
| The largest integer z such that z <= f. |
| </t> |
| </section> |
| |
| <section anchor="ceil" toc="exclude" title="ceil(f)"> |
| <t> |
| The smallest integer z such that z >= f. |
| </t> |
| </section> |
| |
| <section anchor="round" toc="exclude" title="round(f)"> |
| <t> |
| The integer z nearest to f, with ties rounded towards negative infinity, |
| i.e., |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| round(f) = ceil(f - 0.5) . |
| ]]></artwork> |
| </figure> |
| </t> |
| </section> |
| |
| <section anchor="log2" toc="exclude" title="log2(f)"> |
| <t> |
| The base-two logarithm of f. |
| </t> |
| </section> |
| |
| <section anchor="ilog" toc="exclude" title="ilog(n)"> |
| <t> |
| The minimum number of bits required to store a positive integer n in two's |
| complement notation, or 0 for a non-positive integer n. |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| ( 0, n <= 0, |
| ilog(n) = < |
| ( floor(log2(n))+1, n > 0 |
| ]]></artwork> |
| </figure> |
| Examples: |
| <list style="symbols"> |
| <t>ilog(-1) = 0</t> |
| <t>ilog(0) = 0</t> |
| <t>ilog(1) = 1</t> |
| <t>ilog(2) = 2</t> |
| <t>ilog(3) = 2</t> |
| <t>ilog(4) = 3</t> |
| <t>ilog(7) = 3</t> |
| </list> |
| </t> |
| </section> |
| |
| </section> |
| |
| </section> |
| |
| <section anchor="overview" title="Opus Codec Overview"> |
| |
| <t> |
| The Opus codec scales from 6 kb/s narrowband mono speech to 510 kb/s |
| fullband stereo music, with algorithmic delays ranging from 5 ms to |
| 65.2 ms. |
| At any given time, either the LP layer, the MDCT layer, or both, may be active. |
| It can seamlessly switch between all of its various operating modes, giving it |
| a great deal of flexibility to adapt to varying content and network |
| conditions without renegotiating the current session. |
| The codec allows input and output of various audio bandwidths, defined as |
| follows: |
| </t> |
| <texttable anchor="audio-bandwidth"> |
| <ttcol>Abbreviation</ttcol> |
| <ttcol align="right">Audio Bandwidth</ttcol> |
| <ttcol align="right">Sample Rate (Effective)</ttcol> |
| <c>NB (narrowband)</c> <c>4 kHz</c> <c>8 kHz</c> |
| <c>MB (medium-band)</c> <c>6 kHz</c> <c>12 kHz</c> |
| <c>WB (wideband)</c> <c>8 kHz</c> <c>16 kHz</c> |
| <c>SWB (super-wideband)</c> <c>12 kHz</c> <c>24 kHz</c> |
| <c>FB (fullband)</c> <c>20 kHz (*)</c> <c>48 kHz</c> |
| </texttable> |
| <t> |
| (*) Although the sampling theorem allows a bandwidth as large as half the |
| sampling rate, Opus never codes audio above 20 kHz, as that is the |
| generally accepted upper limit of human hearing. |
| </t> |
| |
| <t> |
| Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz, |
| unlike some other audio coding standards that use 32 kHz. |
| This was chosen for a number of reasons. |
| The band layout in the MDCT layer naturally allows skipping coefficients for |
| frequencies over 12 kHz, but does not allow cleanly dropping just those |
| frequencies over 16 kHz. |
| A sample rate of 24 kHz also makes resampling in the MDCT layer easier, |
| as 24 evenly divides 48, and when 24 kHz is sufficient, it can save |
| computation in other processing, such as Acoustic Echo Cancellation (AEC). |
| Experimental changes to the band layout to allow a 16 kHz cutoff |
| (32 kHz effective sample rate) showed potential quality degradations at |
| other sample rates, and at typical bitrates the number of bits saved by using |
| such a cutoff instead of coding in fullband (FB) mode is very small. |
| Therefore, if an application wishes to process a signal sampled at 32 kHz, |
| it should just use FB. |
| </t> |
| |
| <t> |
| The LP layer is based on the SILK codec |
| <xref target="SILK"></xref>. |
| It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms, |
| and requires an additional 5 ms look-ahead for noise shaping estimation. |
| A small additional delay (up to 1.5 ms) may be required for sampling rate |
| conversion. |
| Like Vorbis <xref target='Vorbis-website'/> and many other modern codecs, SILK is inherently designed for |
| variable-bitrate (VBR) coding, though the encoder can also produce |
| constant-bitrate (CBR) streams. |
| The version of SILK used in Opus is substantially modified from, and not |
| compatible with, the stand-alone SILK codec previously deployed by Skype. |
| This document does not serve to define that format, but those interested in the |
| original SILK codec should see <xref target="SILK"/> instead. |
| </t> |
| |
| <t> |
| The MDCT layer is based on the CELT codec <xref target="CELT"></xref>. |
| It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to |
| 20 ms, and requires an additional 2.5 ms look-ahead due to the |
| overlapping MDCT windows. |
| The CELT codec is inherently designed for CBR coding, but unlike many CBR |
| codecs it is not limited to a set of predetermined rates. |
| It internally allocates bits to exactly fill any given target budget, and an |
| encoder can produce a VBR stream by varying the target on a per-frame basis. |
| The MDCT layer is not used for speech when the audio bandwidth is WB or less, |
| as it is not useful there. |
| On the other hand, non-speech signals are not always adequately coded using |
| linear prediction, so for music only the MDCT layer should be used. |
| </t> |
| |
| <t> |
| A "Hybrid" mode allows the use of both layers simultaneously with a frame size |
| of 10 or 20 ms and a SWB or FB audio bandwidth. |
| The LP layer codes the low frequencies by resampling the signal down to WB. |
| The MDCT layer follows, coding the high frequency portion of the signal. |
| The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. |
| In the MDCT layer, all bands below 8 kHz are discarded, so there is no |
| coding redundancy between the two layers. |
| </t> |
| |
| <t> |
| The sample rate (in contrast to the actual audio bandwidth) can be chosen |
| independently on the encoder and decoder side, e.g., a fullband signal can be |
| decoded as wideband, or vice versa. |
| This approach ensures a sender and receiver can always interoperate, regardless |
| of the capabilities of their actual audio hardware. |
| Internally, the LP layer always operates at a sample rate of twice the audio |
| bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB |
| and FB. |
| The decoder simply resamples its output to support different sample rates. |
| The MDCT layer always operates internally at a sample rate of 48 kHz. |
| Since all the supported sample rates evenly divide this rate, and since the |
| the decoder may easily zero out the high frequency portion of the spectrum in |
| the frequency domain, it can simply decimate the MDCT layer output to achieve |
| the other supported sample rates very cheaply. |
| </t> |
| |
| <t> |
| After conversion to the common, desired output sample rate, the decoder simply |
| adds the output from the two layers together. |
| To compensate for the different look-ahead required by each layer, the CELT |
| encoder input is delayed by an additional 2.7 ms. |
| This ensures that low frequencies and high frequencies arrive at the same time. |
| This extra delay may be reduced by an encoder by using less look-ahead for noise |
| shaping or using a simpler resampler in the LP layer, but this will reduce |
| quality. |
| However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in |
| the encoder because it is needed for the MDCT overlap, whose size is fixed by |
| the decoder. |
| </t> |
| |
| <t> |
| Both layers use the same entropy coder, avoiding any waste from "padding bits" |
| between them. |
| The hybrid approach makes it easy to support both CBR and VBR coding. |
| Although the LP layer is VBR, the bit allocation of the MDCT layer can produce |
| a final stream that is CBR by using all the bits left unused by the LP layer. |
| </t> |
| |
| <section title="Control Parameters"> |
| <t> |
| The Opus codec includes a number of control parameters which can be changed dynamically during |
| regular operation of the codec, without interrupting the audio stream from the encoder to the decoder. |
| These parameters only affect the encoder since any impact they have on the bit-stream is signaled |
| in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus |
| implementation can add or modify these control parameters without affecting interoperability. The most |
| important encoder control parameters in the reference encoder are listed below. |
| </t> |
| |
| <section title="Bitrate" toc="exlcude"> |
| <t> |
| Opus supports all bitrates from 6 kb/s to 510 kb/s. All other parameters being |
| equal, higher bitrate results in higher quality. For a frame size of 20 ms, these |
| are the bitrate "sweet spots" for Opus in various configurations: |
| <list style="symbols"> |
| <t>8-12 kb/s for NB speech,</t> |
| <t>16-20 kb/s for WB speech,</t> |
| <t>28-40 kb/s for FB speech,</t> |
| <t>48-64 kb/s for FB mono music, and</t> |
| <t>64-128 kb/s for FB stereo music.</t> |
| </list> |
| </t> |
| </section> |
| |
| <section title="Number of Channels (Mono/Stereo)" toc="exlcude"> |
| <t> |
| Opus can transmit either mono or stereo frames within a single stream. |
| When decoding a mono frame in a stereo decoder, the left and right channels are |
| identical, and when decoding a stereo frame in a mono decoder, the mono output |
| is the average of the left and right channels. |
| In some cases, it is desirable to encode a stereo input stream in mono (e.g., |
| because the bitrate is too low to encode stereo with sufficient quality). |
| The number of channels encoded can be selected in real-time, but by default the |
| reference encoder attempts to make the best decision possible given the |
| current bitrate. |
| </t> |
| </section> |
| |
| <section title="Audio Bandwidth" toc="exlcude"> |
| <t> |
| The audio bandwidths supported by Opus are listed in |
| <xref target="audio-bandwidth"/>. |
| Just like for the number of channels, any decoder can decode audio encoded at |
| any bandwidth. |
| For example, any Opus decoder operating at 8 kHz can decode a FB Opus |
| frame, and any Opus decoder operating at 48 kHz can decode a NB frame. |
| Similarly, the reference encoder can take a 48 kHz input signal and |
| encode it as NB. |
| The higher the audio bandwidth, the higher the required bitrate to achieve |
| acceptable quality. |
| The audio bandwidth can be explicitly specified in real-time, but by default |
| the reference encoder attempts to make the best bandwidth decision possible |
| given the current bitrate. |
| </t> |
| </section> |
| |
| |
| <section title="Frame Duration" toc="exlcude"> |
| <t> |
| Opus can encode frames of 2.5, 5, 10, 20, 40 or 60 ms. |
| It can also combine multiple frames into packets of up to 120 ms. |
| For real-time applications, sending fewer packets per second reduces the |
| bitrate, since it reduces the overhead from IP, UDP, and RTP headers. |
| However, it increases latency and sensitivity to packet losses, as losing one |
| packet constitutes a loss of a bigger chunk of audio. |
| Increasing the frame duration also slightly improves coding efficiency, but the |
| gain becomes small for frame sizes above 20 ms. |
| For this reason, 20 ms frames are a good choice for most applications. |
| </t> |
| </section> |
| |
| <section title="Complexity" toc="exlcude"> |
| <t> |
| There are various aspects of the Opus encoding process where trade-offs |
| can be made between CPU complexity and quality/bitrate. In the reference |
| encoder, the complexity is selected using an integer from 0 to 10, where |
| 0 is the lowest complexity and 10 is the highest. Examples of |
| computations for which such trade-offs may occur are: |
| <list style="symbols"> |
| <t>The order of the pitch analysis whitening filter <xref target="Whitening"/>,</t> |
| <t>The order of the short-term noise shaping filter,</t> |
| <t>The number of states in delayed decision quantization of the |
| residual signal, and</t> |
| <t>The use of certain bit-stream features such as variable time-frequency |
| resolution and the pitch post-filter.</t> |
| </list> |
| </t> |
| </section> |
| |
| <section title="Packet Loss Resilience" toc="exlcude"> |
| <t> |
| Audio codecs often exploit inter-frame correlations to reduce the |
| bitrate at a cost in error propagation: after losing one packet |
| several packets need to be received before the decoder is able to |
| accurately reconstruct the speech signal. The extent to which Opus |
| exploits inter-frame dependencies can be adjusted on the fly to |
| choose a trade-off between bitrate and amount of error propagation. |
| </t> |
| </section> |
| |
| <section title="Forward Error Correction (FEC)" toc="exlcude"> |
| <t> |
| Another mechanism providing robustness against packet loss is the in-band |
| Forward Error Correction (FEC). Packets that are determined to |
| contain perceptually important speech information, such as onsets or |
| transients, are encoded again at a lower bitrate and this re-encoded |
| information is added to a subsequent packet. |
| </t> |
| </section> |
| |
| <section title="Constant/Variable Bitrate" toc="exlcude"> |
| <t> |
| Opus is more efficient when operating with variable bitrate (VBR), which is |
| the default. However, in some (rare) applications, constant bitrate (CBR) |
| is required. There are two main reasons to operate in CBR mode: |
| <list style="symbols"> |
| <t>When the transport only supports a fixed size for each compressed frame</t> |
| <t>When encryption is used for an audio stream that is either highly constrained |
| (e.g. yes/no, recorded prompts) or highly sensitive <xref target="SRTP-VBR"></xref> </t> |
| </list> |
| |
| When low-latency transmission is required over a relatively slow connection, then |
| constrained VBR can also be used. This uses VBR in a way that simulates a |
| "bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and |
| AAC (Advanced Audio Coding) call CBR (i.e., not true |
| CBR due to the bit reservoir). |
| </t> |
| </section> |
| |
| <section title="Discontinuous Transmission (DTX)" toc="exlcude"> |
| <t> |
| Discontinuous Transmission (DTX) reduces the bitrate during silence |
| or background noise. When DTX is enabled, only one frame is encoded |
| every 400 milliseconds. |
| </t> |
| </section> |
| |
| </section> |
| |
| </section> |
| |
| <section anchor="modes" title="Internal Framing"> |
| |
| <t> |
| The Opus encoder produces "packets", which are each a contiguous set of bytes |
| meant to be transmitted as a single unit. |
| The packets described here do not include such things as IP, UDP, or RTP |
| headers which are normally found in a transport-layer packet. |
| A single packet may contain multiple audio frames, so long as they share a |
| common set of parameters, including the operating mode, audio bandwidth, frame |
| size, and channel count (mono vs. stereo). |
| This section describes the possible combinations of these parameters and the |
| internal framing used to pack multiple frames into a single packet. |
| This framing is not self-delimiting. |
| Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC3550'/> |
| or Ogg <xref target='RFC3533'/> or Matroska <xref target='Matroska-website'/>) |
| will communicate the length, in bytes, of the packet, and it uses this |
| information to reduce the framing overhead in the packet itself. |
| A decoder implementation MUST support the framing described in this section. |
| An alternative, self-delimiting variant of the framing is described in |
| <xref target="self-delimiting-framing"/>. |
| Support for that variant is OPTIONAL. |
| </t> |
| |
| <t> |
| All bit diagrams in this document number the bits so that bit 0 is the most |
| significant bit of the first byte, and bit 7 is the least significant. |
| Bit 8 is thus the most significant bit of the second byte, etc. |
| Well-formed Opus packets obey certain requirements, marked [R1] through [R7] |
| below. |
| These are summarized in <xref target="malformed-packets"/> along with |
| appropriate means of handling malformed packets. |
| </t> |
| |
| <section anchor="toc_byte" title="The TOC Byte"> |
| <t anchor="R1"> |
| A well-formed Opus packet MUST contain at least one byte [R1]. |
| This byte forms a table-of-contents (TOC) header that signals which of the |
| various modes and configurations a given packet uses. |
| It is composed of a configuration number, "config", a stereo flag, "s", and a |
| frame count code, "c", arranged as illustrated in |
| <xref target="toc_byte_fig"/>. |
| A description of each of these fields follows. |
| </t> |
| |
| <figure anchor="toc_byte_fig" title="The TOC Byte"> |
| <artwork align="center"><![CDATA[ |
| 0 |
| 0 1 2 3 4 5 6 7 |
| +-+-+-+-+-+-+-+-+ |
| | config |s| c | |
| +-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t> |
| The top five bits of the TOC byte, labeled "config", encode one of 32 possible |
| configurations of operating mode, audio bandwidth, and frame size. |
| As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three possible |
| operating modes: |
| <list style="numbers"> |
| <t>A SILK-only mode for use in low bitrate connections with an audio bandwidth |
| of WB or less,</t> |
| <t>A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and</t> |
| <t>A CELT-only mode for very low delay speech transmission as well as music |
| transmission (NB to FB).</t> |
| </list> |
| The 32 possible configurations each identify which one of these operating modes |
| the packet uses, as well as the audio bandwidth and the frame size. |
| <xref target="config_bits"/> lists the parameters for each configuration. |
| </t> |
| <texttable anchor="config_bits" title="TOC Byte Configuration Parameters"> |
| <ttcol>Configuration Number(s)</ttcol> |
| <ttcol>Mode</ttcol> |
| <ttcol>Bandwidth</ttcol> |
| <ttcol>Frame Sizes</ttcol> |
| <c>0...3</c> <c>SILK-only</c> <c>NB</c> <c>10, 20, 40, 60 ms</c> |
| <c>4...7</c> <c>SILK-only</c> <c>MB</c> <c>10, 20, 40, 60 ms</c> |
| <c>8...11</c> <c>SILK-only</c> <c>WB</c> <c>10, 20, 40, 60 ms</c> |
| <c>12...13</c> <c>Hybrid</c> <c>SWB</c> <c>10, 20 ms</c> |
| <c>14...15</c> <c>Hybrid</c> <c>FB</c> <c>10, 20 ms</c> |
| <c>16...19</c> <c>CELT-only</c> <c>NB</c> <c>2.5, 5, 10, 20 ms</c> |
| <c>20...23</c> <c>CELT-only</c> <c>WB</c> <c>2.5, 5, 10, 20 ms</c> |
| <c>24...27</c> <c>CELT-only</c> <c>SWB</c> <c>2.5, 5, 10, 20 ms</c> |
| <c>28...31</c> <c>CELT-only</c> <c>FB</c> <c>2.5, 5, 10, 20 ms</c> |
| </texttable> |
| <t> |
| The configuration numbers in each range (e.g., 0...3 for NB SILK-only) |
| correspond to the various choices of frame size, in the same order. |
| For example, configuration 0 has a 10 ms frame size and configuration 3 |
| has a 60 ms frame size. |
| </t> |
| |
| <t> |
| One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating |
| mono and 1 indicating stereo. |
| </t> |
| |
| <t> |
| The remaining two bits of the TOC byte, labeled "c", code the number of frames |
| per packet (codes 0 to 3) as follows: |
| <list style="symbols"> |
| <t>0: 1 frame in the packet</t> |
| <t>1: 2 frames in the packet, each with equal compressed size</t> |
| <t>2: 2 frames in the packet, with different compressed sizes</t> |
| <t>3: an arbitrary number of frames in the packet</t> |
| </list> |
| This draft refers to a packet as a code 0 packet, code 1 packet, etc., based on |
| the value of "c". |
| </t> |
| |
| </section> |
| |
| <section title="Frame Packing"> |
| |
| <t> |
| This section describes how frames are packed according to each possible value |
| of "c" in the TOC byte. |
| </t> |
| |
| <section anchor="frame-length-coding" title="Frame Length Coding"> |
| <t> |
| When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed |
| length of one or more of these frames is indicated with a one- or two-byte |
| sequence, with the meaning of the first byte as follows: |
| <list style="symbols"> |
| <t>0: No frame (discontinuous transmission (DTX) or lost packet)</t> |
| <t>1...251: Length of the frame in bytes</t> |
| <t>252...255: A second byte is needed. The total length is (second_byte*4)+first_byte</t> |
| </list> |
| </t> |
| |
| <t> |
| The special length 0 indicates that no frame is available, either because it |
| was dropped during transmission by some intermediary or because the encoder |
| chose not to transmit it. |
| Any Opus frame in any mode MAY have a length of 0. |
| </t> |
| |
| <t> |
| The maximum representable length is 255*4+255=1275 bytes. |
| For 20 ms frames, this represents a bitrate of 510 kb/s, which is |
| approximately the highest useful rate for lossily compressed fullband stereo |
| music. |
| Beyond this point, lossless codecs are more appropriate. |
| It is also roughly the maximum useful rate of the MDCT layer, as shortly |
| thereafter quality no longer improves with additional bits due to limitations |
| on the codebook sizes. |
| </t> |
| |
| <t anchor="R2"> |
| No length is transmitted for the last frame in a VBR packet, or for any of the |
| frames in a CBR packet, as it can be inferred from the total size of the |
| packet and the size of all other data in the packet. |
| However, the length of any individual frame MUST NOT exceed |
| 1275 bytes [R2], to allow for repacketization by gateways, |
| conference bridges, or other software. |
| </t> |
| </section> |
| |
| <section title="Code 0: One Frame in the Packet"> |
| |
| <t> |
| For code 0 packets, the TOC byte is immediately followed by N-1 bytes |
| of compressed data for a single frame (where N is the size of the packet), |
| as illustrated in <xref target="code0_packet"/>. |
| </t> |
| <figure anchor="code0_packet" title="A Code 0 Packet" align="center"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | config |s|0|0| | |
| +-+-+-+-+-+-+-+-+ | |
| | Compressed frame 1 (N-1 bytes)... : |
| : | |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| </section> |
| |
| <section title="Code 1: Two Frames in the Packet, Each with Equal Compressed Size"> |
| <t anchor="R3"> |
| For code 1 packets, the TOC byte is immediately followed by the |
| (N-1)/2 bytes of compressed data for the first frame, followed by |
| (N-1)/2 bytes of compressed data for the second frame, as illustrated in |
| <xref target="code1_packet"/>. |
| The number of payload bytes available for compressed data, N-1, MUST be even |
| for all code 1 packets [R3]. |
| </t> |
| <figure anchor="code1_packet" title="A Code 1 Packet" align="center"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | config |s|0|1| | |
| +-+-+-+-+-+-+-+-+ : |
| | Compressed frame 1 ((N-1)/2 bytes)... | |
| : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| | Compressed frame 2 ((N-1)/2 bytes)... | |
| : +-+-+-+-+-+-+-+-+ |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| </section> |
| |
| <section title="Code 2: Two Frames in the Packet, with Different Compressed Sizes"> |
| <t anchor="R4"> |
| For code 2 packets, the TOC byte is followed by a one- or two-byte sequence |
| indicating the length of the first frame (marked N1 in <xref target='code2_packet'/>), |
| followed by N1 bytes of compressed data for the first frame. |
| The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the |
| second frame. |
| This is illustrated in <xref target="code2_packet"/>. |
| A code 2 packet MUST contain enough bytes to represent a valid length. |
| For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 |
| packet whose second byte is in the range 252...255 is also invalid. |
| The length of the first frame, N1, MUST also be no larger than the size of the |
| payload remaining after decoding that length for all code 2 packets [R4]. |
| This makes, for example, a 2-byte code 2 packet with a second byte in the range |
| 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the |
| length of both frames is zero). |
| </t> |
| <figure anchor="code2_packet" title="A Code 2 Packet" align="center"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | config |s|1|0| N1 (1-2 bytes): | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| | Compressed frame 1 (N1 bytes)... | |
| : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| | Compressed frame 2... : |
| : | |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| </section> |
| |
| <section title="Code 3: A Signaled Number of Frames in the Packet"> |
| <t anchor="R5"> |
| Code 3 packets signal the number of frames, as well as additional |
| padding, called "Opus padding" to indicate that this padding is added at the |
| Opus layer, rather than at the transport layer. |
| Code 3 packets MUST have at least 2 bytes [R6,R7]. |
| The TOC byte is followed by a byte encoding the number of frames in the packet |
| in bits 2 to 7 (marked "M" in <xref target='frame_count_byte'/>), with bit 1 indicating whether |
| or not Opus padding is inserted (marked "p" in <xref target='frame_count_byte'/>), and bit 0 |
| indicating VBR (marked "v" in <xref target='frame_count_byte'/>). |
| M MUST NOT be zero, and the audio duration contained within a packet MUST NOT |
| exceed 120 ms [R5]. |
| This limits the maximum frame count for any frame size to 48 (for 2.5 ms |
| frames), with lower limits for longer frame sizes. |
| <xref target="frame_count_byte"/> illustrates the layout of the frame count |
| byte. |
| </t> |
| <figure anchor="frame_count_byte" title="The frame count byte"> |
| <artwork align="center"><![CDATA[ |
| 0 |
| 0 1 2 3 4 5 6 7 |
| +-+-+-+-+-+-+-+-+ |
| |v|p| M | |
| +-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| <t> |
| When Opus padding is used, the number of bytes of padding is encoded in the |
| bytes following the frame count byte. |
| Values from 0...254 indicate that 0...254 bytes of padding are included, |
| in addition to the byte(s) used to indicate the size of the padding. |
| If the value is 255, then the size of the additional padding is 254 bytes, |
| plus the padding value encoded in the next byte. |
| There MUST be at least one more byte in the packet in this case [R6,R7]. |
| The additional padding bytes appear at the end of the packet, and MUST be set |
| to zero by the encoder to avoid creating a covert channel. |
| The decoder MUST accept any value for the padding bytes, however. |
| </t> |
| <t> |
| Although this encoding provides multiple ways to indicate a given number of |
| padding bytes, each uses a different number of bytes to indicate the padding |
| size, and thus will increase the total packet size by a different amount. |
| For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert |
| a single byte after the frame count byte with a value of 254, and append 254 |
| padding bytes with the value zero to the end of the packet. |
| To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after |
| the frame count byte with the values 255 and 0, respectively, and append 254 |
| padding bytes with the value zero to the end of the packet. |
| By using the value 255 multiple times, it is possible to create a packet of any |
| specific, desired size. |
| Let P be the number of header bytes used to indicate the padding size plus the |
| number of padding bytes themselves (i.e., P is the total number of bytes added |
| to the packet). |
| Then P MUST be no more than N-2 [R6,R7]. |
| </t> |
| <t anchor="R6"> |
| In the CBR case, let R=N-2-P be the number of bytes remaining in the packet |
| after subtracting the (optional) padding. |
| Then the compressed length of each frame in bytes is equal to R/M. |
| The value R MUST be a non-negative integer multiple of M [R6]. |
| The compressed data for all M frames follows, each of size |
| R/M bytes, as illustrated in <xref target="code3cbr_packet"/>. |
| </t> |
| |
| <figure anchor="code3cbr_packet" title="A CBR Code 3 Packet" align="center"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | config |s|1|1|0|p| M | Padding length (Optional) : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame 1 (R/M bytes)... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame 2 (R/M bytes)... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : ... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame M (R/M bytes)... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| : Opus Padding (Optional)... | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t anchor="R7"> |
| In the VBR case, the (optional) padding length is followed by M-1 frame |
| lengths (indicated by "N1" to "N[M-1]" in <xref target='code3vbr_packet'/>), each encoded in a |
| one- or two-byte sequence as described above. |
| The packet MUST contain enough data for the M-1 lengths after removing the |
| (optional) padding, and the sum of these lengths MUST be no larger than the |
| number of bytes remaining in the packet after decoding them [R7]. |
| The compressed data for all M frames follows, each frame consisting of the |
| indicated number of bytes, with the final frame consuming any remaining bytes |
| before the final padding, as illustrated in <xref target="code3cbr_packet"/>. |
| The number of header bytes (TOC byte, frame count byte, padding length bytes, |
| and frame length bytes), plus the signaled length of the first M-1 frames themselves, |
| plus the signaled length of the padding MUST be no larger than N, the total size of the |
| packet. |
| </t> |
| |
| <figure anchor="code3vbr_packet" title="A VBR Code 3 Packet" align="center"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | config |s|1|1|1|p| M | Padding length (Optional) : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| : N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame 1 (N1 bytes)... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame 2 (N2 bytes)... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : ... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | |
| : Compressed frame M... : |
| | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| : Opus Padding (Optional)... | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| </section> |
| </section> |
| |
| <section anchor="examples" title="Examples"> |
| <t> |
| Simplest case, one NB mono 20 ms SILK frame: |
| </t> |
| |
| <figure anchor='framing_example_1'> |
| <artwork><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | 1 |0|0|0| compressed data... : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t> |
| Two FB mono 5 ms CELT frames of the same compressed size: |
| </t> |
| |
| <figure anchor='framing_example_2'> |
| <artwork><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | 29 |0|0|1| compressed data... : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t> |
| Two FB mono 20 ms Hybrid frames of different compressed size: |
| </t> |
| |
| <figure anchor='framing_example_3'> |
| <artwork><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | 15 |0|1|1|1|0| 2 | N1 | | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| | compressed data... : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t> |
| Four FB stereo 20 ms CELT frames of the same compressed size: |
| </t> |
| |
| <figure anchor='framing_example_4'> |
| <artwork><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | 31 |1|1|1|0|0| 4 | compressed data... : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| </section> |
| |
| <section anchor="malformed-packets" title="Receiving Malformed Packets"> |
| <t> |
| A receiver MUST NOT process packets which violate any of the rules above as |
| normal Opus packets. |
| They are reserved for future applications, such as in-band headers (containing |
| metadata, etc.). |
| Packets which violate these constraints may cause implementations of |
| <spanx style="emph">this</spanx> specification to treat them as malformed, and |
| discard them. |
| </t> |
| <t> |
| These constraints are summarized here for reference: |
| <list style="format [R%d]"> |
| <t>Packets are at least one byte.</t> |
| <t>No implicit frame length is larger than 1275 bytes.</t> |
| <t>Code 1 packets have an odd total length, N, so that (N-1)/2 is an |
| integer.</t> |
| <t>Code 2 packets have enough bytes after the TOC for a valid frame |
| length, and that length is no larger than the number of bytes remaining in the |
| packet.</t> |
| <t>Code 3 packets contain at least one frame, but no more than 120 ms |
| of audio total.</t> |
| <t>The length of a CBR code 3 packet, N, is at least two bytes, the number of |
| bytes added to indicate the padding size plus the trailing padding bytes |
| themselves, P, is no more than N-2, and the frame count, M, satisfies |
| the constraint that (N-2-P) is a non-negative integer multiple of M.</t> |
| <t>VBR code 3 packets are large enough to contain all the header bytes (TOC |
| byte, frame count byte, any padding length bytes, and any frame length bytes), |
| plus the length of the first M-1 frames, plus any trailing padding bytes.</t> |
| </list> |
| </t> |
| </section> |
| |
| </section> |
| |
| <section title="Opus Decoder"> |
| <t> |
| The Opus decoder consists of two main blocks: the SILK decoder and the CELT |
| decoder. |
| At any given time, one or both of the SILK and CELT decoders may be active. |
| The output of the Opus decode is the sum of the outputs from the SILK and CELT |
| decoders with proper sample rate conversion and delay compensation on the SILK |
| side, and optional decimation (when decoding to sample rates less than |
| 48 kHz) on the CELT side, as illustrated in the block diagram below. |
| </t> |
| <figure> |
| <artwork> |
| <![CDATA[ |
| +---------+ +------------+ |
| | SILK | | Sample | |
| +->| Decoder |--->| Rate |----+ |
| Bit- +---------+ | | | | Conversion | v |
| stream | Range |---+ +---------+ +------------+ /---\ Audio |
| ------->| Decoder | | + |------> |
| | |---+ +---------+ +------------+ \---/ |
| +---------+ | | CELT | | Decimation | ^ |
| +->| Decoder |--->| (Optional) |----+ |
| | | | | |
| +---------+ +------------+ |
| ]]> |
| </artwork> |
| </figure> |
| |
| <section anchor="range-decoder" title="Range Decoder"> |
| <t> |
| Opus uses an entropy coder based on range coding <xref target="range-coding"></xref> |
| <xref target="Martin79"></xref>, |
| which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>. |
| It is very similar to arithmetic encoding, except that encoding is done with |
| digits in any base instead of with bits, |
| so it is faster when using larger bases (i.e., a byte). All of the |
| calculations in the range coder must use bit-exact integer arithmetic. |
| </t> |
| <t> |
| Symbols may also be coded as "raw bits" packed directly into the bitstream, |
| bypassing the range coder. |
| These are packed backwards starting at the end of the frame, as illustrated in |
| <xref target="rawbits-example"/>. |
| This reduces complexity and makes the stream more resilient to bit errors, as |
| corruption in the raw bits will not desynchronize the decoding process, unlike |
| corruption in the input to the range decoder. |
| Raw bits are only used in the CELT layer. |
| </t> |
| |
| <figure anchor="rawbits-example" title="Illustrative example of packing range |
| coder and raw bits data"> |
| <artwork align="center"><![CDATA[ |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | Range coder data (packed MSB to LSB) -> : |
| + + |
| : : |
| + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| : | <- Boundary occurs at an arbitrary bit position : |
| +-+-+-+ + |
| : <- Raw bits data (packed LSB to MSB) | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ]]></artwork> |
| </figure> |
| |
| <t> |
| Each symbol coded by the range coder is drawn from a finite alphabet and coded |
| in a separate "context", which describes the size of the alphabet and the |
| relative frequency of each symbol in that alphabet. |
| </t> |
| <t> |
| Suppose there is a context with n symbols, identified with an index that ranges |
| from 0 to n-1. |
| The parameters needed to encode or decode symbol k in this context are |
| represented by a three-tuple (fl[k], fh[k], ft), with |
| 0 <= fl[k] < fh[k] <= ft <= 65535. |
| The values of this tuple are derived from the probability model for the |
| symbol, represented by traditional "frequency counts". |
| Because Opus uses static contexts these are not updated as symbols are decoded. |
| Let f[i] be the frequency of symbol i. |
| Then the three-tuple corresponding to symbol k is given by |
| </t> |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| k-1 n-1 |
| __ __ |
| fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i] |
| /_ /_ |
| i=0 i=0 |
| ]]></artwork> |
| </figure> |
| <t> |
| The range decoder extracts the symbols and integers encoded using the range |
| encoder in <xref target="range-encoder"/>. |
| The range decoder maintains an internal state vector composed of the two-tuple |
| (val, rng), representing the difference between the high end of the |
| current range and the actual coded value, minus one, and the size of the |
| current range, respectively. |
| Both val and rng are 32-bit unsigned integer values. |
| </t> |
| |
| <section anchor="range-decoder-init" title="Range Decoder Initialization"> |
| <t> |
| Let b0 be the first input byte (or zero if there are no bytes in this Opus |
| frame). |
| The decoder initializes rng to 128 and initializes val to |
| (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the |
| first input byte. |
| It saves the remaining bit, (b0&1), for use in the renormalization |
| procedure described in <xref target="range-decoder-renorm"/>, which the |
| decoder invokes immediately after initialization to read additional bits and |
| establish the invariant that rng > 2**23. |
| </t> |
| </section> |
| |
| <section anchor="decoding-symbols" title="Decoding Symbols"> |
| <t> |
| Decoding a symbol is a two-step process. |
| The first step determines a 16-bit unsigned value fs, which lies within the |
| range of some symbol in the current context. |
| The second step updates the range decoder state with the three-tuple |
| (fl[k], fh[k], ft) corresponding to that symbol. |
| </t> |
| <t> |
| The first step is implemented by ec_decode() (entdec.c), which computes |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| val |
| fs = ft - min(------ + 1, ft) . |
| rng/ft |
| ]]></artwork> |
| </figure> |
| The divisions here are integer division. |
| </t> |
| <t> |
| The decoder then identifies the symbol in the current context corresponding to |
| fs; i.e., the value of k whose three-tuple (fl[k], fh[k], ft) |
| satisfies fl[k] <= fs < fh[k]. |
| It uses this tuple to update val according to |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| rng |
| val = val - --- * (ft - fh[k]) . |
| ft |
| ]]></artwork> |
| </figure> |
| If fl[k] is greater than zero, then the decoder updates rng using |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| rng |
| rng = --- * (fh[k] - fl[k]) . |
| ft |
| ]]></artwork> |
| </figure> |
| Otherwise, it updates rng using |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| rng |
| rng = rng - --- * (ft - fh[k]) . |
| ft |
| ]]></artwork> |
| </figure> |
| </t> |
| <t> |
| Using a special case for the first symbol (rather than the last symbol, as is |
| commonly done in other arithmetic coders) ensures that all the truncation |
| error from the finite precision arithmetic accumulates in symbol 0. |
| This makes the cost of coding a 0 slightly smaller, on average, than its |
| estimated probability indicates and makes the cost of coding any other symbol |
| slightly larger. |
| When contexts are designed so that 0 is the most probable symbol, which is |
| often the case, this strategy minimizes the inefficiency introduced by the |
| finite precision. |
| It also makes some of the special-case decoding routines in |
| <xref target="decoding-alternate"/> particularly simple. |
| </t> |
| <t> |
| After the updates, implemented by ec_dec_update() (entdec.c), the decoder |
| normalizes the range using the procedure in the next section, and returns the |
| index k. |
| </t> |
| |
| <section anchor="range-decoder-renorm" title="Renormalization"> |
| <t> |
| To normalize the range, the decoder repeats the following process, implemented |
| by ec_dec_normalize() (entdec.c), until rng > 2**23. |
| If rng is already greater than 2**23, the entire process is skipped. |
| First, it sets rng to (rng<<8). |
| Then it reads the next byte of the Opus frame and forms an 8-bit value sym, |
| using the left-over bit buffered from the previous byte as the high bit |
| and the top 7 bits of the byte just read as the other 7 bits of sym. |
| The remaining bit in the byte just read is buffered for use in the next |
| iteration. |
| If no more input bytes remain, it uses zero bits instead. |
| See <xref target="range-decoder-init"/> for the initialization used to process |
| the first byte. |
| Then, it sets |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| val = ((val<<8) + (255-sym)) & 0x7FFFFFFF . |
| ]]></artwork> |
| </figure> |
| </t> |
| <t> |
| It is normal and expected that the range decoder will read several bytes |
| into the raw bits data (if any) at the end of the packet by the time the frame |
| is completely decoded, as illustrated in <xref target="finalize-example"/>. |
| This same data MUST also be returned as raw bits when requested. |
| The encoder is expected to terminate the stream in such a way that the decoder |
| will decode the intended values regardless of the data contained in the raw |
| bits. |
| <xref target="encoder-finalizing"/> describes a procedure for doing this. |
| If the range decoder consumes all of the bytes belonging to the current frame, |
| it MUST continue to use zero when any further input bytes are required, even |
| if there is additional data in the current packet from padding or other |
| frames. |
| </t> |
| |
| <figure anchor="finalize-example" title="Illustrative example of raw bits |
| overlapping range coder data"> |
| <artwork align="center"><![CDATA[ |
| n n+1 n+2 n+3 |
| 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| : | <----------- Overlap region ------------> | : |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| ^ ^ |
| | End of data buffered by the range coder | |
| ...-----------------------------------------------+ |
| | |
| | End of data consumed by raw bits |
| +-------------------------------------------------------... |
| ]]></artwork> |
| </figure> |
| </section> |
| </section> |
| |
| <section anchor="decoding-alternate" title="Alternate Decoding Methods"> |
| <t> |
| The reference implementation uses three additional decoding methods that are |
| exactly equivalent to the above, but make assumptions and simplifications that |
| allow for a more efficient implementation. |
| </t> |
| <section anchor="ec_decode_bin" title="ec_decode_bin()"> |
| <t> |
| The first is ec_decode_bin() (entdec.c), defined using the parameter ftb |
| instead of ft. |
| It is mathematically equivalent to calling ec_decode() with |
| ft = (1<<ftb), but avoids one of the divisions. |
| </t> |
| </section> |
| <section anchor="ec_dec_bit_logp" title="ec_dec_bit_logp()"> |
| <t> |
| The next is ec_dec_bit_logp() (entdec.c), which decodes a single binary symbol, |
| replacing both the ec_decode() and ec_dec_update() steps. |
| The context is described by a single parameter, logp, which is the absolute |
| value of the base-2 logarithm of the probability of a "1". |
| It is mathematically equivalent to calling ec_decode() with |
| ft = (1<<logp), followed by ec_dec_update() with |
| the 3-tuple (fl[k] = 0, |
| fh[k] = (1<<logp) - 1, |
| ft = (1<<logp)) if the returned value |
| of fs is less than (1<<logp) - 1 (a "0" was decoded), and with |
| (fl[k] = (1<<logp) - 1, |
| fh[k] = ft = (1<<logp)) otherwise (a "1" was |
| decoded). |
| The implementation requires no multiplications or divisions. |
| </t> |
| </section> |
| <section anchor="ec_dec_icdf" title="ec_dec_icdf()"> |
| <t> |
| The last is ec_dec_icdf() (entdec.c), which decodes a single symbol with a |
| table-based context of up to 8 bits, also replacing both the ec_decode() and |
| ec_dec_update() steps, as well as the search for the decoded symbol in between. |
| The context is described by two parameters, an icdf |
| ("inverse" cumulative distribution function) table and ftb. |
| As with ec_decode_bin(), (1<<ftb) is equivalent to ft. |
| idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal to |
| (1<<ftb) - fl[k+1]. |
| fl[0] is assumed to be 0, and the table is terminated by a value of 0 (where |
| fh[k] == ft). |
| </t> |
| <t> |
| The function is mathematically equivalent to calling ec_decode() with |
| ft = (1<<ftb), using the returned value fs to search the table |
| for the first entry where fs < (1<<ftb)-icdf[k], and |
| calling ec_dec_update() with |
| fl[k] = (1<<ftb) - icdf[k-1] (or 0 |
| if k == 0), fh[k] = (1<<ftb) - idcf[k], |
| and ft = (1<<ftb). |
| Combining the search with the update allows the division to be replaced by a |
| series of multiplications (which are usually much cheaper), and using an |
| inverse CDF allows the use of an ftb as large as 8 in an 8-bit table without |
| any special cases. |
| This is the primary interface with the range decoder in the SILK layer, though |
| it is used in a few places in the CELT layer as well. |
| </t> |
| <t> |
| Although icdf[k] is more convenient for the code, the frequency counts, f[k], |
| are a more natural representation of the probability distribution function |
| (PDF) for a given symbol. |
| Therefore this draft lists the latter, not the former, when describing the |
| context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16 for a |
| uniform context with four possible values and ft = 16. |
| The value of ft after the slash is always the sum of the entries in the PDF, |
| but is included for convenience. |
| Contexts with identical probabilities, f[k]/ft, but different values of ft |
| (or equivalently, ftb) are not the same, and cannot, in general, be used in |
| place of one another. |
| An icdf table is also not capable of representing a PDF where the first symbol |
| has 0 probability. |
| In such contexts, ec_dec_icdf() can decode the symbol by using a table that |
| drops the entries for any initial zero-probability values and adding the |
| constant offset of the first value with a non-zero probability to its return |
| value. |
| </t> |
| </section> |
| </section> |
| |
| <section anchor="decoding-bits" title="Decoding Raw Bits"> |
| <t> |
| The raw bits used by the CELT layer are packed at the end of the packet, with |
| the least significant bit of the first value packed in the least significant |
| bit of the last byte, filling up to the most significant bit in the last byte, |
| continuing on to the least significant bit of the penultimate byte, and so on. |
| The reference implementation reads them using ec_dec_bits() (entdec.c). |
| Because the range decoder must read several bytes ahead in the stream, as |
| described in <xref target="range-decoder-renorm"/>, the input consumed by the |
| raw bits may overlap with the input consumed by the range coder, and a decoder |
| MUST allow this. |
| The format should render it impossible to attempt to read more raw bits than |
| there are actual bits in the frame, though a decoder may wish to check for |
| this and report an error. |
| </t> |
| </section> |
| |
| <section anchor="ec_dec_uint" title="Decoding Uniformly Distributed Integers"> |
| <t> |
| The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable values in |
| the range 0 to (ft - 1), inclusive, each with a frequency of 1, |
| where ft may be as large as (2**32 - 1). |
| Because ec_decode() is limited to a total frequency of (2**16 - 1), |
| it splits up the value into a range coded symbol representing up to 8 of the |
| high bits, and, if necessary, raw bits representing the remainder of the |
| value. |
| The limit of 8 bits in the range coded symbol is a trade-off between |
| implementation complexity, modeling error (since the symbols no longer truly |
| have equal coding cost), and rounding error introduced by the range coder |
| itself (which gets larger as more bits are included). |
| Using raw bits reduces the maximum number of divisions required in the worst |
| case, but means that it may be possible to decode a value outside the range |
| 0 to (ft - 1), inclusive. |
| </t> |
| |
| <t> |
| ec_dec_uint() takes a single, positive parameter, ft, which is not necessarily |
| a power of two, and returns an integer, t, whose value lies between 0 and |
| (ft - 1), inclusive. |
| Let ftb = ilog(ft - 1), i.e., the number of bits required |
| to store (ft - 1) in two's complement notation. |
| If ftb is 8 or less, then t is decoded with t = ec_decode(ft), and |
| the range coder state is updated using the three-tuple (t, t + 1, |
| ft). |
| </t> |
| <t> |
| If ftb is greater than 8, then the top 8 bits of t are decoded using |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| t = ec_decode(((ft - 1) >> (ftb - 8)) + 1) , |
| ]]></artwork> |
| </figure> |
| the decoder state is updated using the three-tuple |
| (t, t + 1, |
| ((ft - 1) >> (ftb - 8)) + 1), |
| and the remaining bits are decoded as raw bits, setting |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8) . |
| ]]></artwork> |
| </figure> |
| If, at this point, t >= ft, then the current frame is corrupt. |
| In that case, the decoder should assume there has been an error in the coding, |
| decoding, or transmission and SHOULD take measures to conceal the |
| error and/or report to the application that the error has occurred. |
| </t> |
| |
| </section> |
| |
| <section anchor="decoder-tell" title="Current Bit Usage"> |
| <t> |
| The bit allocation routines in the CELT decoder need a conservative upper bound |
| on the number of bits that have been used from the current frame thus far, |
| including both range coder bits and raw bits. |
| This drives allocation decisions that must match those made in the encoder. |
| The upper bound is computed in the reference implementation to whole-bit |
| precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit |
| precision by the function ec_tell_frac() (entcode.c). |
| Like all operations in the range coder, it must be implemented in a bit-exact |
| manner, and must produce exactly the same value returned by the same functions |
| in the encoder after encoding the same symbols. |
| </t> |
| <t> |
| ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0). |
| In various places the codec will check to ensure there is enough room to |
| contain a symbol before attempting to decode it. |
| In practice, although the number of bits used so far is an upper bound, |
| decoding a symbol whose probability model suggests it has a worst-case cost of |
| p 1/8th bits may actually advance the return value of ec_tell_frac() by |
| p-1, p, or p+1 1/8th bits, due to approximation error in that upper bound, |
| truncation error in the range coder, and for large values of ft, modeling |
| error in ec_dec_uint(). |
| </t> |
| <t> |
| However, this error is bounded, and periodic calls to ec_tell() or |
| ec_tell_frac() at precisely defined points in the decoding process prevent it |
| from accumulating. |
| For a range coder symbol that requires a whole number of bits (i.e., |
| for which ft/(fh[k] - fl[k]) is a power of two), where there are at |
| least p 1/8th bits available, decoding the symbol will never cause ec_tell() or |
| ec_tell_frac() to exceed the size of the frame ("bust the budget"). |
| In this case the return value of ec_tell_frac() will only advance by more than |
| p 1/8th bits if there was an additional, fractional number of bits remaining, |
| and it will never advance beyond the next whole-bit boundary, which is safe, |
| since frames always contain a whole number of bits. |
| However, when p is not a whole number of bits, an extra 1/8th bit is required |
| to ensure that decoding the symbol will not bust the budget. |
| </t> |
| <t> |
| The reference implementation keeps track of the total number of whole bits that |
| have been processed by the decoder so far in the variable nbits_total, |
| including the (possibly fractional) number of bits that are currently |
| buffered, but not consumed, inside the range coder. |
| nbits_total is initialized to 9 just before the initial range renormalization |
| process completes (or equivalently, it can be initialized to 33 after the |
| first renormalization). |
| The extra two bits over the actual amount buffered by the range coder |
| guarantees that it is an upper bound and that there is enough room for the |
| encoder to terminate the stream. |
| Each iteration through the range coder's renormalization loop increases |
| nbits_total by 8. |
| Reading raw bits increases nbits_total by the number of raw bits read. |
| </t> |
| |
| <section anchor="ec_tell" title="ec_tell()"> |
| <t> |
| The whole number of bits buffered in rng may be estimated via lg = ilog(rng). |
| ec_tell() then becomes a simple matter of removing these bits from the total. |
| It returns (nbits_total - lg). |
| </t> |
| <t> |
| In a newly initialized decoder, before any symbols have been read, this reports |
| that 1 bit has been used. |
| This is the bit reserved for termination of the encoder. |
| </t> |
| </section> |
| |
| <section anchor="ec_tell_frac" title="ec_tell_frac()"> |
| <t> |
| ec_tell_frac() estimates the number of bits buffered in rng to fractional |
| precision. |
| Since rng must be greater than 2**23 after renormalization, lg must be at least |
| 24. |
| Let |
| <figure align="center"> |
| <artwork align="center"> |
| <![CDATA[ |
| r_Q15 = rng >> (lg-16) , |
| ]]></artwork> |
| </figure> |
| so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the |
| fractional part of rng. |
| Then the following procedure can be used to add one bit of precision to lg. |
| First, update |
| <figure align="center"> |
| <artwork align="center"> |
| <![CDATA[ |
| r_Q15 = (r_Q15*r_Q15) >> 15 . |
| ]]></artwork> |
| </figure> |
| Then add the 16th bit of r_Q15 to lg via |
| <figure align="center"> |
| <artwork align="center"> |
| <![CDATA[ |
| lg = 2*lg + (r_Q15 >> 16) . |
| ]]></artwork> |
| </figure> |
| Finally, if this bit was a 1, reduce r_Q15 by a factor of two via |
| <figure align="center"> |
| <artwork align="center"> |
| <![CDATA[ |
| r_Q15 = r_Q15 >> 1 , |
| ]]></artwork> |
| </figure> |
| so that it once again lies in the range 32768 <= r_Q15 < 65536. |
| </t> |
| <t> |
| This procedure is repeated three times to extend lg to 1/8th bit precision. |
| ec_tell_frac() then returns (nbits_total*8 - lg). |
| </t> |
| </section> |
| |
| </section> |
| |
| </section> |
| |
| <section anchor="silk_decoder_outline" title="SILK Decoder"> |
| <t> |
| The decoder's LP layer uses a modified version of the SILK codec (herein simply |
| called "SILK"), which runs a decoded excitation signal through adaptive |
| long-term and short-term prediction synthesis filters. |
| It runs at NB, MB, and WB sample rates internally. |
| When used in a SWB or FB Hybrid frame, the LP layer itself still only runs in |
| WB. |
| </t> |
| |
| <section title="SILK Decoder Modules"> |
| <t> |
| An overview of the decoder is given in <xref target="silk_decoder_figure"/>. |
| </t> |
| <figure align="center" anchor="silk_decoder_figure" title="SILK Decoder"> |
| <artwork align="center"> |
| <![CDATA[ |
| +---------+ +------------+ |
| -->| Range |--->| Decode |---------------------------+ |
| 1 | Decoder | 2 | Parameters |----------+ 5 | |
| +---------+ +------------+ 4 | | |
| 3 | | | |
| \/ \/ \/ |
| +------------+ +------------+ +------------+ |
| | Generate |-->| LTP |-->| LPC | |
| | Excitation | | Synthesis | | Synthesis | |
| +------------+ +------------+ +------------+ |
| ^ | |
| | | |
| +-------------------+----------------+ |
| | 6 |
| | +------------+ +-------------+ |
| +-->| Stereo |-->| Sample Rate |--> |
| | Unmixing | 7 | Conversion | 8 |
| +------------+ +-------------+ |
| |
| 1: Range encoded bitstream |
| 2: Coded parameters |
| 3: Pulses, LSBs, and signs |
| 4: Pitch lags, Long-Term Prediction (LTP) coefficients |
| 5: Linear Predictive Coding (LPC) coefficients and gains |
| 6: Decoded signal (mono or mid-side stereo) |
| 7: Unmixed signal (mono or left-right stereo) |
| 8: Resampled signal |
| ]]> |
| </artwork> |
| </figure> |
| |
| <t> |
| The decoder feeds the bitstream (1) to the range decoder from |
| <xref target="range-decoder"/>, and then decodes the parameters in it (2) |
| using the procedures detailed in |
| Sections <xref format="counter" target="silk_header_bits"/> |
| through <xref format="counter" target="silk_signs"/>. |
| These parameters (3, 4, 5) are used to generate an excitation signal (see |
| <xref target="silk_excitation_reconstruction"/>), which is fed to an optional |
| long-term prediction (LTP) filter (voiced frames only, see |
| <xref target="silk_ltp_synthesis"/>) and then a short-term prediction filter |
| (see <xref target="silk_lpc_synthesis"/>), producing the decoded signal (6). |
| For stereo streams, the mid-side representation is converted to separate left |
| and right channels (7). |
| The result is finally resampled to the desired output sample rate (e.g., |
| 48 kHz) so that the resampled signal (8) can be mixed with the CELT |
| layer. |
| </t> |
| |
| </section> |
| |
| <section anchor="silk_layer_organization" title="LP Layer Organization"> |
| |
| <t> |
| Internally, the LP layer of a single Opus frame is composed of either a single |
| 10 ms regular SILK frame or between one and three 20 ms regular SILK |
| frames. |
| A stereo Opus frame may double the number of regular SILK frames (up to a total |
| of six), since it includes separate frames for a mid channel and, optionally, |
| a side channel. |
| Optional Low Bit-Rate Redundancy (LBRR) frames, which are reduced-bitrate |
| encodings of previous SILK frames, may be included to aid in recovery from |
| packet loss. |
| If present, these appear before the regular SILK frames. |
| They are in most respects identical to regular, active SILK frames, except that |
| they are usually encoded with a lower bitrate. |
| This draft uses "SILK frame" to refer to either one and "regular SILK frame" if |
| it needs to draw a distinction between the two. |
| </t> |
| <t> |
| Logically, each SILK frame is in turn composed of either two or four 5 ms |
| subframes. |
| Various parameters, such as the quantization gain of the excitation and the |
| pitch lag and filter coefficients can vary on a subframe-by-subframe basis. |
| Physically, the parameters for each subframe are interleaved in the bitstream, |
| as described in the relevant sections for each parameter. |
| </t> |
| <t> |
| All of these frames and subframes are decoded from the same range coder, with |
| no padding between them. |
| Thus packing multiple SILK frames in a single Opus frame saves, on average, |
| half a byte per SILK frame. |
| It also allows some parameters to be predicted from prior SILK frames in the |
| same Opus frame, since this does not degrade packet loss robustness (beyond |
| any penalty for merely using fewer, larger packets to store multiple frames). |
| </t> |
| |
| <t> |
| Stereo support in SILK uses a variant of mid-side coding, allowing a mono |
| decoder to simply decode the mid channel. |
| However, the data for the two channels is interleaved, so a mono decoder must |
| still unpack the data for the side channel. |
| It would be required to do so anyway for Hybrid Opus frames, or to support |
| decoding individual 20 ms frames. |
| </t> |
| |
| <t> |
| <xref target="silk_symbols"/> summarizes the overall grouping of the contents of |
| the LP layer. |
| Figures <xref format="counter" target="silk_mono_60ms_frame"/> |
| and <xref format="counter" target="silk_stereo_60ms_frame"/> illustrate |
| the ordering of the various SILK frames for a 60 ms Opus frame, for both |
| mono and stereo, respectively. |
| </t> |
| |
| <texttable anchor="silk_symbols" |
| title="Organization of the SILK layer of an Opus frame"> |
| <ttcol align="center">Symbol(s)</ttcol> |
| <ttcol align="center">PDF(s)</ttcol> |
| <ttcol align="center">Condition</ttcol> |
| |
| <c>Voice Activity Detection (VAD) flags</c> |
| <c>{1, 1}/2</c> |
| <c/> |
| |
| <c>LBRR flag</c> |
| <c>{1, 1}/2</c> |
| <c/> |
| |
| <c>Per-frame LBRR flags</c> |
| <c><xref target="silk_lbrr_flag_pdfs"/></c> |
| <c><xref target="silk_lbrr_flags"/></c> |
| |
| <c>LBRR Frame(s)</c> |
| <c><xref target="silk_frame"/></c> |
| <c><xref target="silk_lbrr_flags"/></c> |
| |
| <c>Regular SILK Frame(s)</c> |
| <c><xref target="silk_frame"/></c> |
| <c/> |
| |
| </texttable> |
| |
| <figure align="center" anchor="silk_mono_60ms_frame" |
| title="A 60 ms Mono Frame"> |
| <artwork align="center"><![CDATA[ |
| +---------------------------------+ |
| | VAD Flags | |
| +---------------------------------+ |
| | LBRR Flag | |
| +---------------------------------+ |
| | Per-Frame LBRR Flags (Optional) | |
| +---------------------------------+ |
| | LBRR Frame 1 (Optional) | |
| +---------------------------------+ |
| | LBRR Frame 2 (Optional) | |
| +---------------------------------+ |
| | LBRR Frame 3 (Optional) | |
| +---------------------------------+ |
| | Regular SILK Frame 1 | |
| +---------------------------------+ |
| | Regular SILK Frame 2 | |
| +---------------------------------+ |
| | Regular SILK Frame 3 | |
| +---------------------------------+ |
| ]]></artwork> |
| </figure> |
| |
| <figure align="center" anchor="silk_stereo_60ms_frame" |
| title="A 60 ms Stereo Frame"> |
| <artwork align="center"><![CDATA[ |
| +---------------------------------------+ |
| | Mid VAD Flags | |
| +---------------------------------------+ |
| | Mid LBRR Flag | |
| +---------------------------------------+ |
| | Side VAD Flags | |
| +---------------------------------------+ |
| | Side LBRR Flag | |
| +---------------------------------------+ |
| | Mid Per-Frame LBRR Flags (Optional) | |
| +---------------------------------------+ |
| | Side Per-Frame LBRR Flags (Optional) | |
| +---------------------------------------+ |
| | Mid LBRR Frame 1 (Optional) | |
| +---------------------------------------+ |
| | Side LBRR Frame 1 (Optional) | |
| +---------------------------------------+ |
| | Mid LBRR Frame 2 (Optional) | |
| +---------------------------------------+ |
| | Side LBRR Frame 2 (Optional) | |
| +---------------------------------------+ |
| | Mid LBRR Frame 3 (Optional) | |
| +---------------------------------------+ |
| | Side LBRR Frame 3 (Optional) | |
| +---------------------------------------+ |
| | Mid Regular SILK Frame 1 | |
| +---------------------------------------+ |
| | Side Regular SILK Frame 1 (Optional) | |
| +---------------------------------------+ |
| | Mid Regular SILK Frame 2 | |
| +---------------------------------------+ |
| | Side Regular SILK Frame 2 (Optional) | |
| +---------------------------------------+ |
| | Mid Regular SILK Frame 3 | |
| +---------------------------------------+ |
| | Side Regular SILK Frame 3 (Optional) | |
| +---------------------------------------+ |
| ]]></artwork> |
| </figure> |
| |
| </section> |
| |
| <section anchor="silk_header_bits" title="Header Bits"> |
| <t> |
| The LP layer begins with two to eight header bits, decoded in silk_Decode() |
| (dec_API.c). |
| These consist of one Voice Activity Detection (VAD) bit per frame (up to 3), |
| followed by a single flag indicating the presence of LBRR frames. |
| For a stereo packet, these first flags correspond to the mid channel, and a |
| second set of flags is included for the side channel. |
| </t> |
| <t> |
| Because these are the first symbols decoded by the range coder and because they |
| are coded as binary values with uniform probability, they can be extracted |
| directly from the most significant bits of the first byte of compressed data. |
| Thus, a receiver can determine if an Opus frame contains any active SILK frames |
| without the overhead of using the range decoder. |
| </t> |
| </section> |
| |
| <section anchor="silk_lbrr_flags" title="Per-Frame LBRR Flags"> |
| <t> |
| For Opus frames longer than 20 ms, a set of LBRR flags is |
| decoded for each channel that has its LBRR flag set. |
| Each set contains one flag per 20 ms SILK frame. |
| 40 ms Opus frames use the 2-frame LBRR flag PDF from |
| <xref target="silk_lbrr_flag_pdfs"/>, and 60 ms Opus frames use the |
| 3-frame LBRR flag PDF. |
| For each channel, the resulting 2- or 3-bit integer contains the corresponding |
| LBRR flag for each frame, packed in order from the LSB to the MSB. |
| </t> |
| |
| <texttable anchor="silk_lbrr_flag_pdfs" title="LBRR Flag PDFs"> |
| <ttcol>Frame Size</ttcol> |
| <ttcol>PDF</ttcol> |
| <c>40 ms</c> <c>{0, 53, 53, 150}/256</c> |
| <c>60 ms</c> <c>{0, 41, 20, 29, 41, 15, 28, 82}/256</c> |
| </texttable> |
| |
| <t> |
| A 10 or 20 ms Opus frame does not contain any per-frame LBRR flags, |
| as there may be at most one LBRR frame per channel. |
| The global LBRR flag in the header bits (see <xref target="silk_header_bits"/>) |
| is already sufficient to indicate the presence of that single LBRR frame. |
| </t> |
| |
| </section> |
| |
| <section anchor="silk_lbrr_frames" title="LBRR Frames"> |
| <t> |
| The LBRR frames, if present, contain an encoded representation of the signal |
| immediately prior to the current Opus frame as if it were encoded with the |
| current mode, frame size, audio bandwidth, and channel count, even if those |
| differ from the prior Opus frame. |
| When one of these parameters changes from one Opus frame to the next, this |
| implies that the LBRR frames of the current Opus frame may not be simple |
| drop-in replacements for the contents of the previous Opus frame. |
| </t> |
| |
| <t> |
| For example, when switching from 20 ms to 60 ms, the 60 ms Opus |
| frame may contain LBRR frames covering up to three prior 20 ms Opus |
| frames, even if those frames already contained LBRR frames covering some of |
| the same time periods. |
| When switching from 20 ms to 10 ms, the 10 ms Opus frame can |
| contain an LBRR frame covering at most half the prior 20 ms Opus frame, |
| potentially leaving a hole that needs to be concealed from even a single |
| packet loss (see <xref target="Packet Loss Concealment"/>). |
| When switching from mono to stereo, the LBRR frames in the first stereo Opus |
| frame MAY contain a non-trivial side channel. |
| </t> |
| |
| <t> |
| In order to properly produce LBRR frames under all conditions, an encoder might |
| need to buffer up to 60 ms of audio and re-encode it during these |
| transitions. |
| However, the reference implementation opts to disable LBRR frames at the |
| transition point for simplicity. |
| Since transitions are relatively infrequent in normal usage, this does not have |
| a significant impact on packet loss robustness. |
| </t> |
| |
| <t> |
| The LBRR frames immediately follow the LBRR flags, prior to any regular SILK |
| frames. |
| <xref target="silk_frame"/> describes their exact contents. |
| LBRR frames do not include their own separate VAD flags. |
| LBRR frames are only meant to be transmitted for active speech, thus all LBRR |
| frames are treated as active. |
| </t> |
| |
| <t> |
| In a stereo Opus frame longer than 20 ms, although the per-frame LBRR |
| flags for the mid channel are coded as a unit before the per-frame LBRR flags |
| for the side channel, the LBRR frames themselves are interleaved. |
| The decoder parses an LBRR frame for the mid channel of a given 20 ms |
| interval (if present) and then immediately parses the corresponding LBRR |
| frame for the side channel (if present), before proceeding to the next |
| 20 ms interval. |
| </t> |
| </section> |
| |
| <section anchor="silk_regular_frames" title="Regular SILK Frames"> |
| <t> |
| The regular SILK frame(s) follow the LBRR frames (if any). |
| <xref target="silk_frame"/> describes their contents, as well. |
| Unlike the LBRR frames, a regular SILK frame is coded for each time interval in |
| an Opus frame, even if the corresponding VAD flags are unset. |
| For stereo Opus frames longer than 20 ms, the regular mid and side SILK |
| frames for each 20 ms interval are interleaved, just as with the LBRR |
| frames. |
| The side frame may be skipped by coding an appropriate flag, as detailed in |
| <xref target="silk_mid_only_flag"/>. |
| </t> |
| </section> |
| |
| <section anchor="silk_frame" title="SILK Frame Contents"> |
| <t> |
| Each SILK frame includes a set of side information that encodes |
| <list style="symbols"> |
| <t>The frame type and quantization type (<xref target="silk_frame_type"/>),</t> |
| <t>Quantization gains (<xref target="silk_gains"/>),</t> |
| <t>Short-term prediction filter coefficients (<xref target="silk_nlsfs"/>),</t> |
| <t>A Line Spectral Frequencies (LSF) interpolation weight (<xref target="silk_nlsf_interpolation"/>),</t> |
| <t> |
| Long-term prediction filter lags and gains (<xref target="silk_ltp_params"/>), |
| and |
| </t> |
| <t>A linear congruential generator (LCG) seed (<xref target="silk_seed"/>).</t> |
| </list> |
| The quantized excitation signal (see <xref target="silk_excitation"/>) follows |
| these at the end of the frame. |
| <xref target="silk_frame_symbols"/> details the overall organization of a |
| SILK frame. |
| </t> |
| |
| <texttable anchor="silk_frame_symbols" |
| title="Order of the symbols in an individual SILK frame"> |
| <ttcol align="center">Symbol(s)</ttcol> |
| <ttcol align="center">PDF(s)</ttcol> |
| <ttcol align="center">Condition</ttcol> |
| |
| <c>Stereo Prediction Weights</c> |
| <c><xref target="silk_stereo_pred_pdfs"/></c> |
| <c><xref target="silk_stereo_pred"/></c> |
| |
| <c>Mid-only Flag</c> |
| <c><xref target="silk_mid_only_pdf"/></c> |
| <c><xref target="silk_mid_only_flag"/></c> |
| |
| <c>Frame Type</c> |
| <c><xref target="silk_frame_type"/></c> |
| <c/> |
| |
| <c>Subframe Gains</c> |
| <c><xref target="silk_gains"/></c> |
| <c/> |
| |
| <c>Normalized LSF Stage-1 Index</c> |
| <c><xref target="silk_nlsf_stage1_pdfs"/></c> |
| <c/> |
| |
| <c>Normalized LSF Stage-2 Residual</c> |
| <c><xref target="silk_nlsf_stage2"/></c> |
| <c/> |
| |
| <c>Normalized LSF Interpolation Weight</c> |
| <c><xref target="silk_nlsf_interp_pdf"/></c> |
| <c>20 ms frame</c> |
| |
| <c>Primary Pitch Lag</c> |
| <c><xref target="silk_ltp_lags"/></c> |
| <c>Voiced frame</c> |
| |
| <c>Subframe Pitch Contour</c> |
| <c><xref target="silk_pitch_contour_pdfs"/></c> |
| <c>Voiced frame</c> |
| |
| <c>Periodicity Index</c> |
| <c><xref target="silk_perindex_pdf"/></c> |
| <c>Voiced frame</c> |
| |
| <c>LTP Filter</c> |
| <c><xref target="silk_ltp_filter_pdfs"/></c> |
| <c>Voiced frame</c> |
| |
| <c>LTP Scaling</c> |
| <c><xref target="silk_ltp_scaling_pdf"/></c> |
| <c><xref target="silk_ltp_scaling"/></c> |
| |
| <c>LCG Seed</c> |
| <c><xref target="silk_seed_pdf"/></c> |
| <c/> |
| |
| <c>Excitation Rate Level</c> |
| <c><xref target="silk_rate_level_pdfs"/></c> |
| <c/> |
| |
| <c>Excitation Pulse Counts</c> |
| <c><xref target="silk_pulse_count_pdfs"/></c> |
| <c/> |
| |
| <c>Excitation Pulse Locations</c> |
| <c><xref target="silk_pulse_locations"/></c> |
| <c>Non-zero pulse count</c> |
| |
| <c>Excitation LSBs</c> |
| <c><xref target="silk_shell_lsb_pdf"/></c> |
| <c><xref target="silk_pulse_counts"/></c> |
| |
| <c>Excitation Signs</c> |
| <c><xref target="silk_sign_pdfs"/></c> |
| <c/> |
| |
| </texttable> |
| |
| <section anchor="silk_stereo_pred" toc="include" |
| title="Stereo Prediction Weights"> |
| <t> |
| A SILK frame corresponding to the mid channel of a stereo Opus frame begins |
| with a pair of side channel prediction weights, designed such that zeros |
| indicate normal mid-side coupling. |
| Since these weights can change on every frame, the first portion of each frame |
| linearly interpolates between the previous weights and the current ones, using |
| zeros for the previous weights if none are available. |
| These prediction weights are never included in a mono Opus frame, and the |
| previous weights are reset to zeros on any transition from mono to stereo. |
| They are also not included in an LBRR frame for the side channel, even if the |
| LBRR flags indicate the corresponding mid channel was not coded. |
| In that case, the previous weights are used, again substituting in zeros if no |
| previous weights are available since the last decoder reset |
| (see <xref target="decoder-reset"/>). |
| </t> |
| |
| <t> |
| To summarize, these weights are coded if and only if |
| <list style="symbols"> |
| <t>This is a stereo Opus frame (<xref target="toc_byte"/>), and</t> |
| <t>The current SILK frame corresponds to the mid channel.</t> |
| </list> |
| </t> |
| |
| <t> |
| The prediction weights are coded in three separate pieces, which are decoded |
| by silk_stereo_decode_pred() (decode_stereo_pred.c). |
| The first piece jointly codes the high-order part of a table index for both |
| weights. |
| The second piece codes the low-order part of each table index. |
| The third piece codes an offset used to linearly interpolate between table |
| indices. |
| The details are as follows. |
| </t> |
| |
| <t> |
| Let n be an index decoded with the 25-element stage-1 PDF in |
| <xref target="silk_stereo_pred_pdfs"/>. |
| Then let i0 and i1 be indices decoded with the stage-2 and stage-3 PDFs in |
| <xref target="silk_stereo_pred_pdfs"/>, respectively, and let i2 and i3 |
| be two more indices decoded with the stage-2 and stage-3 PDFs, all in that |
| order. |
| </t> |
| |
| <texttable anchor="silk_stereo_pred_pdfs" title="Stereo Weight PDFs"> |
| <ttcol align="left">Stage</ttcol> |
| <ttcol align="left">PDF</ttcol> |
| <c>Stage 1</c> |
| <c>{7, 2, 1, 1, 1, |
| 10, 24, 8, 1, 1, |
| 3, 23, 92, 23, 3, |
| 1, 1, 8, 24, 10, |
| 1, 1, 1, 2, 7}/256</c> |
| |
| <c>Stage 2</c> |
| <c>{85, 86, 85}/256</c> |
| |
| <c>Stage 3</c> |
| <c>{51, 51, 52, 51, 51}/256</c> |
| </texttable> |
| |
| <t> |
| Then use n, i0, and i2 to form two table indices, wi0 and wi1, according to |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| wi0 = i0 + 3*(n/5) |
| wi1 = i2 + 3*(n%5) |
| ]]></artwork> |
| </figure> |
| where the division is integer division. |
| The range of these indices is 0 to 14, inclusive. |
| Let w[i] be the i'th weight from <xref target="silk_stereo_weights_table"/>. |
| Then the two prediction weights, w0_Q13 and w1_Q13, are |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| w1_Q13 = w_Q13[wi1] |
| + ((w_Q13[wi1+1] - w_Q13[wi1])*6554) >> 16)*(2*i3 + 1) |
| |
| w0_Q13 = w_Q13[wi0] |
| + ((w_Q13[wi0+1] - w_Q13[wi0])*6554) >> 16)*(2*i1 + 1) |
| - w1_Q13 |
| ]]></artwork> |
| </figure> |
| N.b., w1_Q13 is computed first here, because w0_Q13 depends on it. |
| The constant 6554 is approximately 0.1 in Q16. |
| Although wi0 and wi1 only have 15 possible values, |
| <xref target="silk_stereo_weights_table"/> contains 16 entries to allow |
| interpolation between entry wi0 and (wi0 + 1) (and likewise for wi1). |
| </t> |
| |
| <texttable anchor="silk_stereo_weights_table" |
| title="Stereo Weight Table"> |
| <ttcol align="left">Index</ttcol> |
| <ttcol align="right">Weight (Q13)</ttcol> |
| <c>0</c> <c>-13732</c> |
| <c>1</c> <c>-10050</c> |
| <c>2</c> <c>-8266</c> |
| <c>3</c> <c>-7526</c> |
| <c>4</c> <c>-6500</c> |
| <c>5</c> <c>-5000</c> |
| <c>6</c> <c>-2950</c> |
| <c>7</c> <c>-820</c> |
| <c>8</c> <c>820</c> |
| <c>9</c> <c>2950</c> |
| <c>10</c> <c>5000</c> |
| <c>11</c> <c>6500</c> |
| <c>12</c> <c>7526</c> |
| <c>13</c> <c>8266</c> |
| <c>14</c> <c>10050</c> |
| <c>15</c> <c>13732</c> |
| </texttable> |
| |
| </section> |
| |
| <section anchor="silk_mid_only_flag" toc="include" title="Mid-only Flag"> |
| <t> |
| A flag appears after the stereo prediction weights that indicates if only the |
| mid channel is coded for this time interval. |
| It appears only when |
| <list style="symbols"> |
| <t>This is a stereo Opus frame (see <xref target="toc_byte"/>),</t> |
| <t>The current SILK frame corresponds to the mid channel, and</t> |
| <t>Either |
| <list style="symbols"> |
| <t>This is a regular SILK frame where the VAD flags |
| (see <xref target="silk_header_bits"/>) indicate that the corresponding side |
| channel is not active.</t> |
| <t> |
| This is an LBRR frame where the LBRR flags |
| (see <xref target="silk_header_bits"/> and <xref target="silk_lbrr_flags"/>) |
| indicate that the corresponding side channel is not coded. |
| </t> |
| </list> |
| </t> |
| </list> |
| It is omitted when there are no stereo weights, for all of the same reasons. |
| It is also omitted for a regular SILK frame when the VAD flag of the |
| corresponding side channel frame is set (indicating it is active). |
| The side channel must be coded in this case, making the mid-only flag |
| redundant. |
| It is also omitted for an LBRR frame when the corresponding LBRR flags |
| indicate the side channel is coded. |
| </t> |
| |
| <t> |
| When the flag is present, the decoder reads a single value using the PDF in |
| <xref target="silk_mid_only_pdf"/>, as implemented in |
| silk_stereo_decode_mid_only() (decode_stereo_pred.c). |
| If the flag is set, then there is no corresponding SILK frame for the side |
| channel, the entire decoding process for the side channel is skipped, and |
| zeros are fed to the stereo unmixing process (see |
| <xref target="silk_stereo_unmixing"/>) instead. |
| As stated above, LBRR frames still include this flag when the LBRR flag |
| indicates that the side channel is not coded. |
| In that case, if this flag is zero (indicating that there should be a side |
| channel), then Packet Loss Concealment (PLC, see |
| <xref target="Packet Loss Concealment"/>) SHOULD be invoked to recover a |
| side channel signal. |
| Otherwise, the stereo image will collapse. |
| </t> |
| |
| <texttable anchor="silk_mid_only_pdf" title="Mid-only Flag PDF"> |
| <ttcol align="left">PDF</ttcol> |
| <c>{192, 64}/256</c> |
| </texttable> |
| |
| </section> |
| |
| <section anchor="silk_frame_type" toc="include" title="Frame Type"> |
| <t> |
| Each SILK frame contains a single "frame type" symbol that jointly codes the |
| signal type and quantization offset type of the corresponding frame. |
| If the current frame is a regular SILK frame whose VAD bit was not set (an |
| "inactive" frame), then the frame type symbol takes on a value of either 0 or |
| 1 and is decoded using the first PDF in <xref target="silk_frame_type_pdfs"/>. |
| If the frame is an LBRR frame or a regular SILK frame whose VAD flag was set |
| (an "active" frame), then the value of the symbol may range from 2 to 5, |
| inclusive, and is decoded using the second PDF in |
| <xref target="silk_frame_type_pdfs"/>. |
| <xref target="silk_frame_type_table"/> translates between the value of the |
| frame type symbol and the corresponding signal type and quantization offset |
| type. |
| </t> |
| |
| <texttable anchor="silk_frame_type_pdfs" title="Frame Type PDFs"> |
| <ttcol>VAD Flag</ttcol> |
| <ttcol>PDF</ttcol> |
| <c>Inactive</c> <c>{26, 230, 0, 0, 0, 0}/256</c> |
| <c>Active</c> <c>{0, 0, 24, 74, 148, 10}/256</c> |
| </texttable> |
| |
| <texttable anchor="silk_frame_type_table" |
| title="Signal Type and Quantization Offset Type from Frame Type"> |
| <ttcol>Frame Type</ttcol> |
| <ttcol>Signal Type</ttcol> |
| <ttcol align="right">Quantization Offset Type</ttcol> |
| <c>0</c> <c>Inactive</c> <c>Low</c> |
| <c>1</c> <c>Inactive</c> <c>High</c> |
| <c>2</c> <c>Unvoiced</c> <c>Low</c> |
| <c>3</c> <c>Unvoiced</c> <c>High</c> |
| <c>4</c> <c>Voiced</c> <c>Low</c> |
| <c>5</c> <c>Voiced</c> <c>High</c> |
| </texttable> |
| |
| </section> |
| |
| <section anchor="silk_gains" toc="include" title="Subframe Gains"> |
| <t> |
| A separate quantization gain is coded for each 5 ms subframe. |
| These gains control the step size between quantization levels of the excitation |
| signal and, therefore, the quality of the reconstruction. |
| They are independent of and unrelated to the pitch contours coded for voiced |
| frames. |
| The quantization gains are themselves uniformly quantized to 6 bits on a |
| log scale, giving them a resolution of approximately 1.369 dB and a range |
| of approximately 1.94 dB to 88.21 dB. |
| </t> |
| <t> |
| The subframe gains are either coded independently, or relative to the gain from |
| the most recent coded subframe in the same channel. |
| Independent coding is used if and only if |
| <list style="symbols"> |
| <t> |
| This is the first subframe in the current SILK frame, and |
| </t> |
| <t>Either |
| <list style="symbols"> |
| <t> |
| This is the first SILK frame of its type (LBRR or regular) for this channel in |
| the current Opus frame, or |
| </t> |
| <t> |
| The previous SILK frame of the same type (LBRR or regular) for this channel in |
| the same Opus frame was not coded. |
| </t> |
| </list> |
| </t> |
| </list> |
| </t> |
| |
| <t> |
| In an independently coded subframe gain, the 3 most significant bits of the |
| quantization gain are decoded using a PDF selected from |
| <xref target="silk_independent_gain_msb_pdfs"/> based on the decoded signal |
| type (see <xref target="silk_frame_type"/>). |
| </t> |
| |
| <texttable anchor="silk_independent_gain_msb_pdfs" |
| title="PDFs for Independent Quantization Gain MSB Coding"> |
| <ttcol align="left">Signal Type</ttcol> |
| <ttcol align="left">PDF</ttcol> |
| <c>Inactive</c> <c>{32, 112, 68, 29, 12, 1, 1, 1}/256</c> |
| <c>Unvoiced</c> <c>{2, 17, 45, 60, 62, 47, 19, 4}/256</c> |
| <c>Voiced</c> <c>{1, 3, 26, 71, 94, 50, 9, 2}/256</c> |
| </texttable> |
| |
| <t> |
| The 3 least significant bits are decoded using a uniform PDF: |
| </t> |
| <texttable anchor="silk_independent_gain_lsb_pdf" |
| title="PDF for Independent Quantization Gain LSB Coding"> |
| <ttcol align="left">PDF</ttcol> |
| <c>{32, 32, 32, 32, 32, 32, 32, 32}/256</c> |
| </texttable> |
| |
| <t> |
| These 6 bits are combined to form a value, gain_index, between 0 and 63. |
| When the gain for the previous subframe is available, then the current gain is |
| limited as follows: |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| log_gain = max(gain_index, previous_log_gain - 16) . |
| ]]></artwork> |
| </figure> |
| This may help some implementations limit the change in precision of their |
| internal LTP history. |
| The indices which this clamp applies to cannot simply be removed from the |
| codebook, because previous_log_gain will not be available after packet loss. |
| The clamping is skipped after a decoder reset, and in the side channel if the |
| previous frame in the side channel was not coded, since there is no value for |
| previous_log_gain available. |
| It MAY also be skipped after packet loss. |
| </t> |
| |
| <t> |
| For subframes which do not have an independent gain (including the first |
| subframe of frames not listed as using independent coding above), the |
| quantization gain is coded relative to the gain from the previous subframe (in |
| the same channel). |
| The PDF in <xref target="silk_delta_gain_pdf"/> yields a delta_gain_index value |
| between 0 and 40, inclusive. |
| </t> |
| <texttable anchor="silk_delta_gain_pdf" |
| title="PDF for Delta Quantization Gain Coding"> |
| <ttcol align="left">PDF</ttcol> |
| <c>{6, 5, 11, 31, 132, 21, 8, 4, |
| 3, 2, 2, 2, 1, 1, 1, 1, |
| 1, 1, 1, 1, 1, 1, 1, 1, |
| 1, 1, 1, 1, 1, 1, 1, 1, |
| 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| </texttable> |
| <t> |
| The following formula translates this index into a quantization gain for the |
| current subframe using the gain from the previous subframe: |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| log_gain = clamp(0, max(2*delta_gain_index - 16, |
| previous_log_gain + delta_gain_index - 4), 63) . |
| ]]></artwork> |
| </figure> |
| </t> |
| <t> |
| silk_gains_dequant() (gain_quant.c) dequantizes log_gain for the k'th subframe |
| and converts it into a linear Q16 scale factor via |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| gain_Q16[k] = silk_log2lin((0x1D1C71*log_gain>>16) + 2090) |
| ]]></artwork> |
| </figure> |
| </t> |
| <t> |
| The function silk_log2lin() (log2lin.c) computes an approximation of |
| 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input. |
| Let i = inLog_Q7>>7 be the integer part of inLogQ7 and |
| f = inLog_Q7&127 be the fractional part. |
| Then |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| (1<<i) + ((-174*f*(128-f)>>16)+f)*((1<<i)>>7) |
| ]]></artwork> |
| </figure> |
| yields the approximate exponential. |
| The final Q16 gain values lies between 81920 and 1686110208, inclusive |
| (representing scale factors of 1.25 to 25728, respectively). |
| </t> |
| </section> |
| |
| <section anchor="silk_nlsfs" toc="include" title="Normalized Line Spectral |
| Frequency (LSF) and Linear Predictive Coding (LPC) Coefficients"> |
| <t> |
| A set of normalized Line Spectral Frequency (LSF) coefficients follow the |
| quantization gains in the bitstream, and represent the Linear Predictive |
| Coding (LPC) coefficients for the current SILK frame. |
| Once decoded, the normalized LSFs form an increasing list of Q15 values between |
| 0 and 1. |
| These represent the interleaved zeros on the upper half of the unit circle |
| (between 0 and pi, hence "normalized") in the standard decomposition |
| <xref target="line-spectral-pairs"/> of the LPC filter into a symmetric part |
| and an anti-symmetric part (P and Q in <xref target="silk_nlsf2lpc"/>). |
| Because of non-linear effects in the decoding process, an implementation SHOULD |
| match the fixed-point arithmetic described in this section exactly. |
| An encoder SHOULD also use the same process. |
| </t> |
| <t> |
| The normalized LSFs are coded using a two-stage vector quantizer (VQ) |
| (<xref target="silk_nlsf_stage1"/> and <xref target="silk_nlsf_stage2"/>). |
| NB and MB frames use an order-10 predictor, while WB frames use an order-16 |
| predictor, and thus have different sets of tables. |
| After reconstructing the normalized LSFs |
| (<xref target="silk_nlsf_reconstruction"/>), the decoder runs them through a |
| stabilization process (<xref target="silk_nlsf_stabilization"/>), interpolates |
| them between frames (<xref target="silk_nlsf_interpolation"/>), converts them |
| back into LPC coefficients (<xref target="silk_nlsf2lpc"/>), and then runs |
| them through further processes to limit the range of the coefficients |
| (<xref target="silk_lpc_range_limit"/>) and the gain of the filter |
| (<xref target="silk_lpc_gain_limit"/>). |
| All of this is necessary to ensure the reconstruction process is stable. |
| </t> |
| |
| <section anchor="silk_nlsf_stage1" title="Normalized LSF Stage 1 Decoding"> |
| <t> |
| The first VQ stage uses a 32-element codebook, coded with one of the PDFs in |
| <xref target="silk_nlsf_stage1_pdfs"/>, depending on the audio bandwidth and |
| the signal type of the current SILK frame. |
| This yields a single index, I1, for the entire frame, which |
| <list style="numbers"> |
| <t>Indexes an element in a coarse codebook,</t> |
| <t>Selects the PDFs for the second stage of the VQ, and</t> |
| <t>Selects the prediction weights used to remove intra-frame redundancy from |
| the second stage.</t> |
| </list> |
| The actual codebook elements are listed in |
| <xref target="silk_nlsf_nbmb_codebook"/> and |
| <xref target="silk_nlsf_wb_codebook"/>, but they are not needed until the last |
| stages of reconstructing the LSF coefficients. |
| </t> |
| |
| <texttable anchor="silk_nlsf_stage1_pdfs" |
| title="PDFs for Normalized LSF Stage-1 Index Decoding"> |
| <ttcol align="left">Audio Bandwidth</ttcol> |
| <ttcol align="left">Signal Type</ttcol> |
| <ttcol align="left">PDF</ttcol> |
| <c>NB or MB</c> <c>Inactive or unvoiced</c> |
| <c> |
| {44, 34, 30, 19, 21, 12, 11, 3, |
| 3, 2, 16, 2, 2, 1, 5, 2, |
| 1, 3, 3, 1, 1, 2, 2, 2, |
| 3, 1, 9, 9, 2, 7, 2, 1}/256 |
| </c> |
| <c>NB or MB</c> <c>Voiced</c> |
| <c> |
| {1, 10, 1, 8, 3, 8, 8, 14, |
| 13, 14, 1, 14, 12, 13, 11, 11, |
| 12, 11, 10, 10, 11, 8, 9, 8, |
| 7, 8, 1, 1, 6, 1, 6, 5}/256 |
| </c> |
| <c>WB</c> <c>Inactive or unvoiced</c> |
| <c> |
| {31, 21, 3, 17, 1, 8, 17, 4, |
| 1, 18, 16, 4, 2, 3, 1, 10, |
| 1, 3, 16, 11, 16, 2, 2, 3, |
| 2, 11, 1, 4, 9, 8, 7, 3}/256 |
| </c> |
| <c>WB</c> <c>Voiced</c> |
| <c> |
| {1, 4, 16, 5, 18, 11, 5, 14, |
| 15, 1, 3, 12, 13, 14, 14, 6, |
| 14, 12, 2, 6, 1, 12, 12, 11, |
| 10, 3, 10, 5, 1, 1, 1, 3}/256 |
| </c> |
| </texttable> |
| |
| </section> |
| |
| <section anchor="silk_nlsf_stage2" title="Normalized LSF Stage 2 Decoding"> |
| <t> |
| A total of 16 PDFs are available for the LSF residual in the second stage: the |
| 8 (a...h) for NB and MB frames given in |
| <xref target="silk_nlsf_stage2_nbmb_pdfs"/>, and the 8 (i...p) for WB frames |
| given in <xref target="silk_nlsf_stage2_wb_pdfs"/>. |
| Which PDF is used for which coefficient is driven by the index, I1, |
| decoded in the first stage. |
| <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> lists the letter of the |
| corresponding PDF for each normalized LSF coefficient for NB and MB, and |
| <xref target="silk_nlsf_wb_stage2_cb_sel"/> lists the same information for WB. |
| </t> |
| |
| <texttable anchor="silk_nlsf_stage2_nbmb_pdfs" |
| title="PDFs for NB/MB Normalized LSF Stage-2 Index Decoding"> |
| <ttcol align="left">Codebook</ttcol> |
| <ttcol align="left">PDF</ttcol> |
| <c>a</c> <c>{1, 1, 1, 15, 224, 11, 1, 1, 1}/256</c> |
| <c>b</c> <c>{1, 1, 2, 34, 183, 32, 1, 1, 1}/256</c> |
| <c>c</c> <c>{1, 1, 4, 42, 149, 55, 2, 1, 1}/256</c> |
| <c>d</c> <c>{1, 1, 8, 52, 123, 61, 8, 1, 1}/256</c> |
| <c>e</c> <c>{1, 3, 16, 53, 101, 74, 6, 1, 1}/256</c> |
| <c>f</c> <c>{1, 3, 17, 55, 90, 73, 15, 1, 1}/256</c> |
| <c>g</c> <c>{1, 7, 24, 53, 74, 67, 26, 3, 1}/256</c> |
| <c>h</c> <c>{1, 1, 18, 63, 78, 58, 30, 6, 1}/256</c> |
| </texttable> |
| |
| <texttable anchor="silk_nlsf_stage2_wb_pdfs" |
| title="PDFs for WB Normalized LSF Stage-2 Index Decoding"> |
| <ttcol align="left">Codebook</ttcol> |
| <ttcol align="left">PDF</ttcol> |
| <c>i</c> <c>{1, 1, 1, 9, 232, 9, 1, 1, 1}/256</c> |
| <c>j</c> <c>{1, 1, 2, 28, 186, 35, 1, 1, 1}/256</c> |
| <c>k</c> <c>{1, 1, 3, 42, 152, 53, 2, 1, 1}/256</c> |
| <c>l</c> <c>{1, 1, 10, 49, 126, 65, 2, 1, 1}/256</c> |
| <c>m</c> <c>{1, 4, 19, 48, 100, 77, 5, 1, 1}/256</c> |
| <c>n</c> <c>{1, 1, 14, 54, 100, 72, 12, 1, 1}/256</c> |
| <c>o</c> <c>{1, 1, 15, 61, 87, 61, 25, 4, 1}/256</c> |
| <c>p</c> <c>{1, 7, 21, 50, 77, 81, 17, 1, 1}/256</c> |
| </texttable> |
| |
| <texttable anchor="silk_nlsf_nbmb_stage2_cb_sel" |
| title="Codebook Selection for NB/MB Normalized LSF Stage-2 Index Decoding"> |
| <ttcol>I1</ttcol> |
| <ttcol>Coefficient</ttcol> |
| <c/> |
| <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9</spanx></c> |
| <c> 0</c> |
| <c><spanx style="vbare">a a a a a a a a a a</spanx></c> |
| <c> 1</c> |
| <c><spanx style="vbare">b d b c c b c b b b</spanx></c> |
| <c> 2</c> |
| <c><spanx style="vbare">c b b b b b b b b b</spanx></c> |
| <c> 3</c> |
| <c><spanx style="vbare">b c c c c b c b b b</spanx></c> |
| <c> 4</c> |
| <c><spanx style="vbare">c d d d d c c c c c</spanx></c> |
| <c> 5</c> |
| <c><spanx style="vbare">a f d d c c c c b b</spanx></c> |
| <c> g</c> |
| <c><spanx style="vbare">a c c c c c c c c b</spanx></c> |
| <c> 7</c> |
| <c><spanx style="vbare">c d g e e e f e f f</spanx></c> |
| <c> 8</c> |
| <c><spanx style="vbare">c e f f e f e g e e</spanx></c> |
| <c> 9</c> |
| <c><spanx style="vbare">c e e h e f e f f e</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare">e d d d c d c c c c</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">b f f g e f e f f f</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">c h e g f f f f f f</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">c h f f f f f g f e</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">d d f e e f e f e e</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare">c d d f f e e e e e</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">c e e g e f e f f f</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">c f e g f f f e f e</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">c h e f e f e f f f</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare">c f e g h g f g f e</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">d g h e g f f g e f</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">c h g e e e f e f f</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">e f f e g g f g f e</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">c f f g f g e g e e</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">e f f f d h e f f e</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare">c d e f f g e f f e</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">c d c d d e c d d d</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">b b c c c c c d c c</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">e f f g g g f g e f</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">d f f e e e e d d c</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">c f d h f f e e f e</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">e e f e f g f g f e</spanx></c> |
| </texttable> |
| |
| <texttable anchor="silk_nlsf_wb_stage2_cb_sel" |
| title="Codebook Selection for WB Normalized LSF Stage-2 Index Decoding"> |
| <ttcol>I1</ttcol> |
| <ttcol>Coefficient</ttcol> |
| <c/> |
| <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</spanx></c> |
| <c> 0</c> |
| <c><spanx style="vbare">i i i i i i i i i i i i i i i i</spanx></c> |
| <c> 1</c> |
| <c><spanx style="vbare">k l l l l l k k k k k j j j i l</spanx></c> |
| <c> 2</c> |
| <c><spanx style="vbare">k n n l p m m n k n m n n m l l</spanx></c> |
| <c> 3</c> |
| <c><spanx style="vbare">i k j k k j j j j j i i i i i j</spanx></c> |
| <c> 4</c> |
| <c><spanx style="vbare">i o n m o m p n m m m n n m m l</spanx></c> |
| <c> 5</c> |
| <c><spanx style="vbare">i l n n m l l n l l l l l l k m</spanx></c> |
| <c> 6</c> |
| <c><spanx style="vbare">i i i i i i i i i i i i i i i i</spanx></c> |
| <c> 7</c> |
| <c><spanx style="vbare">i k o l p k n l m n n m l l k l</spanx></c> |
| <c> 8</c> |
| <c><spanx style="vbare">i o k o o m n m o n m m n l l l</spanx></c> |
| <c> 9</c> |
| <c><spanx style="vbare">k j i i i i i i i i i i i i i i</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare">i j i i i i i i i i i i i i i j</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">k k l m n l l l l l l l k k j l</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">k k l l m l l l l l l l l k j l</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">l m m m o m m n l n m m n m l m</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">i o m n m p n k o n p m m l n l</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare">i j i j j j j j j j i i i i j i</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">j o n p n m n l m n m m m l l m</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">j l l m m l l n k l l n n n l m</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">k l l k k k l k j k j k j j j m</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare">i k l n l l k k k j j i i i i i</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">l m l n l l k k j j j j j k k m</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">k o l p p m n m n l n l l k l l</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">k l n o o l n l m m l l l l k m</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">j l l m m m m l n n n l j j j j</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">k n l o o m p m m n l m m l l l</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare">i o j j i i i i i i i i i i i i</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">i o o l n k n n l m m p p m m m</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">l l p l n m l l l k k l l l k l</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">i i j i i i k j k j j k k k j j</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">i l k n l l k l k j i i j i i j</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">l n n m p n l l k l k k j i j i</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">k l n l m l l l k j k o m i i i</spanx></c> |
| </texttable> |
| |
| <t> |
| Decoding the second stage residual proceeds as follows. |
| For each coefficient, the decoder reads a symbol using the PDF corresponding to |
| I1 from either <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> or |
| <xref target="silk_nlsf_wb_stage2_cb_sel"/>, and subtracts 4 from the result |
| to give an index in the range -4 to 4, inclusive. |
| If the index is either -4 or 4, it reads a second symbol using the PDF in |
| <xref target="silk_nlsf_ext_pdf"/>, and adds the value of this second symbol |
| to the index, using the same sign. |
| This gives the index, I2[k], a total range of -10 to 10, inclusive. |
| </t> |
| |
| <texttable anchor="silk_nlsf_ext_pdf" |
| title="PDF for Normalized LSF Index Extension Decoding"> |
| <ttcol align="left">PDF</ttcol> |
| <c>{156, 60, 24, 9, 4, 2, 1}/256</c> |
| </texttable> |
| |
| <t> |
| The decoded indices from both stages are translated back into normalized LSF |
| coefficients in silk_NLSF_decode() (NLSF_decode.c). |
| The stage-2 indices represent residuals after both the first stage of the VQ |
| and a separate backwards-prediction step. |
| The backwards prediction process in the encoder subtracts a prediction from |
| each residual formed by a multiple of the coefficient that follows it. |
| The decoder must undo this process. |
| <xref target="silk_nlsf_pred_weights"/> contains lists of prediction weights |
| for each coefficient. |
| There are two lists for NB and MB, and another two lists for WB, giving two |
| possible prediction weights for each coefficient. |
| </t> |
| |
| <texttable anchor="silk_nlsf_pred_weights" |
| title="Prediction Weights for Normalized LSF Decoding"> |
| <ttcol align="left">Coefficient</ttcol> |
| <ttcol align="right">A</ttcol> |
| <ttcol align="right">B</ttcol> |
| <ttcol align="right">C</ttcol> |
| <ttcol align="right">D</ttcol> |
| <c>0</c> <c>179</c> <c>116</c> <c>175</c> <c>68</c> |
| <c>1</c> <c>138</c> <c>67</c> <c>148</c> <c>62</c> |
| <c>2</c> <c>140</c> <c>82</c> <c>160</c> <c>66</c> |
| <c>3</c> <c>148</c> <c>59</c> <c>176</c> <c>60</c> |
| <c>4</c> <c>151</c> <c>92</c> <c>178</c> <c>72</c> |
| <c>5</c> <c>149</c> <c>72</c> <c>173</c> <c>117</c> |
| <c>6</c> <c>153</c> <c>100</c> <c>174</c> <c>85</c> |
| <c>7</c> <c>151</c> <c>89</c> <c>164</c> <c>90</c> |
| <c>8</c> <c>163</c> <c>92</c> <c>177</c> <c>118</c> |
| <c>9</c> <c/> <c/> <c>174</c> <c>136</c> |
| <c>10</c> <c/> <c/> <c>196</c> <c>151</c> |
| <c>11</c> <c/> <c/> <c>182</c> <c>142</c> |
| <c>12</c> <c/> <c/> <c>198</c> <c>160</c> |
| <c>13</c> <c/> <c/> <c>192</c> <c>142</c> |
| <c>14</c> <c/> <c/> <c>182</c> <c>155</c> |
| </texttable> |
| |
| <t> |
| The prediction is undone using the procedure implemented in |
| silk_NLSF_residual_dequant() (NLSF_decode.c), which is as follows. |
| Each coefficient selects its prediction weight from one of the two lists based |
| on the stage-1 index, I1. |
| <xref target="silk_nlsf_nbmb_weight_sel"/> gives the selections for each |
| coefficient for NB and MB, and <xref target="silk_nlsf_wb_weight_sel"/> gives |
| the selections for WB. |
| Let d_LPC be the order of the codebook, i.e., 10 for NB and MB, and 16 for WB, |
| and let pred_Q8[k] be the weight for the k'th coefficient selected by this |
| process for 0 <= k < d_LPC-1. |
| Then, the stage-2 residual for each coefficient is computed via |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| res_Q10[k] = (k+1 < d_LPC ? (res_Q10[k+1]*pred_Q8[k])>>8 : 0) |
| + ((((I2[k]<<10) - sign(I2[k])*102)*qstep)>>16) , |
| ]]></artwork> |
| </figure> |
| where qstep is the Q16 quantization step size, which is 11796 for NB and MB |
| and 9830 for WB (representing step sizes of approximately 0.18 and 0.15, |
| respectively). |
| </t> |
| |
| <texttable anchor="silk_nlsf_nbmb_weight_sel" |
| title="Prediction Weight Selection for NB/MB Normalized LSF Decoding"> |
| <ttcol>I1</ttcol> |
| <ttcol>Coefficient</ttcol> |
| <c/> |
| <c><spanx style="vbare">0 1 2 3 4 5 6 7 8</spanx></c> |
| <c> 0</c> |
| <c><spanx style="vbare">A B A A A A A A A</spanx></c> |
| <c> 1</c> |
| <c><spanx style="vbare">B A A A A A A A A</spanx></c> |
| <c> 2</c> |
| <c><spanx style="vbare">A A A A A A A A A</spanx></c> |
| <c> 3</c> |
| <c><spanx style="vbare">B B B A A A A B A</spanx></c> |
| <c> 4</c> |
| <c><spanx style="vbare">A B A A A A A A A</spanx></c> |
| <c> 5</c> |
| <c><spanx style="vbare">A B A A A A A A A</spanx></c> |
| <c> 6</c> |
| <c><spanx style="vbare">B A B B A A A B A</spanx></c> |
| <c> 7</c> |
| <c><spanx style="vbare">A B B A A B B A A</spanx></c> |
| <c> 8</c> |
| <c><spanx style="vbare">A A B B A B A B B</spanx></c> |
| <c> 9</c> |
| <c><spanx style="vbare">A A B B A A B B B</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare">A A A A A A A A A</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">A B A B B B B B A</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">A B A B B B B B A</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">A B B B B B B B A</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">B A B B A B B B B</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare">A B B B B B A B A</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">A A B B A B A B A</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">A A B B B A B B B</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">A B B A A B B B A</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare">A A A B B B A B A</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">A B B A A B A B A</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">A B B A A A B B A</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">A A A A A B B B B</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">A A B B A A A B B</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">A A A B A B B B B</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare">A B B B B B B B A</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">A A A A A A A A A</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">A A A A A A A A A</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">A A B A B B A B A</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">B A A B A A A A A</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">A A A B B A B A B</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">B A B B A B B B B</spanx></c> |
| </texttable> |
| |
| <texttable anchor="silk_nlsf_wb_weight_sel" |
| title="Prediction Weight Selection for WB Normalized LSF Decoding"> |
| <ttcol>I1</ttcol> |
| <ttcol>Coefficient</ttcol> |
| <c/> |
| <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9 10 11 12 13 14</spanx></c> |
| <c> 0</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c> 1</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C C</spanx></c> |
| <c> 2</c> |
| <c><spanx style="vbare">C C D C C D D D C D D D D C C</spanx></c> |
| <c> 3</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C D C C</spanx></c> |
| <c> 4</c> |
| <c><spanx style="vbare">C D D C D C D D C D D D D D C</spanx></c> |
| <c> 5</c> |
| <c><spanx style="vbare">C C D C C C C C C C C C C C C</spanx></c> |
| <c> 6</c> |
| <c><spanx style="vbare">D C C C C C C C C C C D C D C</spanx></c> |
| <c> 7</c> |
| <c><spanx style="vbare">C D D C C C D C D D D C D C D</spanx></c> |
| <c> 8</c> |
| <c><spanx style="vbare">C D C D D C D C D C D D D D D</spanx></c> |
| <c> 9</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare">C D C C C C C C C C C C C C C</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">C C D C D D D D D D D C D C C</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">C C D C C D C D C D C C D C C</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">C C C C D D C D C D D D D C C</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">C D C C C D D C D D D C D D D</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare">C C D D C C C C C C C C D D C</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">C D D C D C D D D D D C D C C</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">C C D C C C C D C C D D D C C</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C D C C</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C C</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">C D C D C D D C D C D C D D C</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">C C D D D D C D D C C D D C C</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">C D D C D C D C D C C C C D C</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">C C C D D C D C D D D D D D D</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">C D D C C C D D C C D D D D D</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">C C C C C D C D D D D C D D D</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">D C C C C C C C C C C D C C C</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">C C D C C D D D C C D C C D C</spanx></c> |
| </texttable> |
| |
| </section> |
| |
| <section anchor="silk_nlsf_reconstruction" |
| title="Reconstructing the Normalized LSF Coefficients"> |
| <t> |
| Once the stage-1 index I1 and the stage-2 residual res_Q10[] have been decoded, |
| the final normalized LSF coefficients can be reconstructed. |
| </t> |
| <t> |
| The spectral distortion introduced by the quantization of each LSF coefficient |
| varies, so the stage-2 residual is weighted accordingly, using the |
| low-complexity Inverse Harmonic Mean Weighting (IHMW) function proposed in |
| <xref target="laroia-icassp"/>. |
| The weights are derived directly from the stage-1 codebook vector. |
| Let cb1_Q8[k] be the k'th entry of the stage-1 codebook vector from |
| <xref target="silk_nlsf_nbmb_codebook"/> or |
| <xref target="silk_nlsf_wb_codebook"/>. |
| Then for 0 <= k < d_LPC the following expression |
| computes the square of the weight as a Q18 value: |
| <figure align="center"> |
| <artwork align="center"> |
| <![CDATA[ |
| w2_Q18[k] = (1024/(cb1_Q8[k] - cb1_Q8[k-1]) |
| + 1024/(cb1_Q8[k+1] - cb1_Q8[k])) << 16 , |
| ]]> |
| </artwork> |
| </figure> |
| where cb1_Q8[-1] = 0 and cb1_Q8[d_LPC] = 256, and the |
| division is integer division. |
| This is reduced to an unsquared, Q9 value using the following square-root |
| approximation: |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| i = ilog(w2_Q18[k]) |
| f = (w2_Q18[k]>>(i-8)) & 127 |
| y = ((i&1) ? 32768 : 46214) >> ((32-i)>>1) |
| w_Q9[k] = y + ((213*f*y)>>16) |
| ]]></artwork> |
| </figure> |
| The constant 46214 here is approximately the square root of 2 in Q15. |
| The cb1_Q8[] vector completely determines these weights, and they may be |
| tabulated and stored as 13-bit unsigned values (with a range of 1819 to 5227, |
| inclusive) to avoid computing them when decoding. |
| The reference implementation already requires code to compute these weights on |
| unquantized coefficients in the encoder, in silk_NLSF_VQ_weights_laroia() |
| (NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the |
| decoder instead of using a pre-computed table to reduce the amount of ROM |
| required. |
| </t> |
| |
| <texttable anchor="silk_nlsf_nbmb_codebook" |
| title="NB/MB Normalized LSF Stage-1 Codebook Vectors"> |
| <ttcol>I1</ttcol> |
| <ttcol>Codebook (Q8)</ttcol> |
| <c/> |
| <c><spanx style="vbare"> 0 1 2 3 4 5 6 7 8 9</spanx></c> |
| <c>0</c> |
| <c><spanx style="vbare">12 35 60 83 108 132 157 180 206 228</spanx></c> |
| <c>1</c> |
| <c><spanx style="vbare">15 32 55 77 101 125 151 175 201 225</spanx></c> |
| <c>2</c> |
| <c><spanx style="vbare">19 42 66 89 114 137 162 184 209 230</spanx></c> |
| <c>3</c> |
| <c><spanx style="vbare">12 25 50 72 97 120 147 172 200 223</spanx></c> |
| <c>4</c> |
| <c><spanx style="vbare">26 44 69 90 114 135 159 180 205 225</spanx></c> |
| <c>5</c> |
| <c><spanx style="vbare">13 22 53 80 106 130 156 180 205 228</spanx></c> |
| <c>6</c> |
| <c><spanx style="vbare">15 25 44 64 90 115 142 168 196 222</spanx></c> |
| <c>7</c> |
| <c><spanx style="vbare">19 24 62 82 100 120 145 168 190 214</spanx></c> |
| <c>8</c> |
| <c><spanx style="vbare">22 31 50 79 103 120 151 170 203 227</spanx></c> |
| <c>9</c> |
| <c><spanx style="vbare">21 29 45 65 106 124 150 171 196 224</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare">30 49 75 97 121 142 165 186 209 229</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">19 25 52 70 93 116 143 166 192 219</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">26 34 62 75 97 118 145 167 194 217</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">25 33 56 70 91 113 143 165 196 223</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">21 34 51 72 97 117 145 171 196 222</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare">20 29 50 67 90 117 144 168 197 221</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">22 31 48 66 95 117 146 168 196 222</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">24 33 51 77 116 134 158 180 200 224</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">21 28 70 87 106 124 149 170 194 217</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare">26 33 53 64 83 117 152 173 204 225</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">27 34 65 95 108 129 155 174 210 225</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">20 26 72 99 113 131 154 176 200 219</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">34 43 61 78 93 114 155 177 205 229</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">23 29 54 97 124 138 163 179 209 229</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">30 38 56 89 118 129 158 178 200 231</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare">21 29 49 63 85 111 142 163 193 222</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">27 48 77 103 133 158 179 196 215 232</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">29 47 74 99 124 151 176 198 220 237</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">33 42 61 76 93 121 155 174 207 225</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">29 53 87 112 136 154 170 188 208 227</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">24 30 52 84 131 150 166 186 203 229</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">37 48 64 84 104 118 156 177 201 230</spanx></c> |
| </texttable> |
| |
| <texttable anchor="silk_nlsf_wb_codebook" |
| title="WB Normalized LSF Stage-1 Codebook Vectors"> |
| <ttcol>I1</ttcol> |
| <ttcol>Codebook (Q8)</ttcol> |
| <c/> |
| <c><spanx style="vbare"> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</spanx></c> |
| <c>0</c> |
| <c><spanx style="vbare"> 7 23 38 54 69 85 100 116 131 147 162 178 193 208 223 239</spanx></c> |
| <c>1</c> |
| <c><spanx style="vbare">13 25 41 55 69 83 98 112 127 142 157 171 187 203 220 236</spanx></c> |
| <c>2</c> |
| <c><spanx style="vbare">15 21 34 51 61 78 92 106 126 136 152 167 185 205 225 240</spanx></c> |
| <c>3</c> |
| <c><spanx style="vbare">10 21 36 50 63 79 95 110 126 141 157 173 189 205 221 237</spanx></c> |
| <c>4</c> |
| <c><spanx style="vbare">17 20 37 51 59 78 89 107 123 134 150 164 184 205 224 240</spanx></c> |
| <c>5</c> |
| <c><spanx style="vbare">10 15 32 51 67 81 96 112 129 142 158 173 189 204 220 236</spanx></c> |
| <c>6</c> |
| <c><spanx style="vbare"> 8 21 37 51 65 79 98 113 126 138 155 168 179 192 209 218</spanx></c> |
| <c>7</c> |
| <c><spanx style="vbare">12 15 34 55 63 78 87 108 118 131 148 167 185 203 219 236</spanx></c> |
| <c>8</c> |
| <c><spanx style="vbare">16 19 32 36 56 79 91 108 118 136 154 171 186 204 220 237</spanx></c> |
| <c>9</c> |
| <c><spanx style="vbare">11 28 43 58 74 89 105 120 135 150 165 180 196 211 226 241</spanx></c> |
| <c>10</c> |
| <c><spanx style="vbare"> 6 16 33 46 60 75 92 107 123 137 156 169 185 199 214 225</spanx></c> |
| <c>11</c> |
| <c><spanx style="vbare">11 19 30 44 57 74 89 105 121 135 152 169 186 202 218 234</spanx></c> |
| <c>12</c> |
| <c><spanx style="vbare">12 19 29 46 57 71 88 100 120 132 148 165 182 199 216 233</spanx></c> |
| <c>13</c> |
| <c><spanx style="vbare">17 23 35 46 56 77 92 106 123 134 152 167 185 204 222 237</spanx></c> |
| <c>14</c> |
| <c><spanx style="vbare">14 17 45 53 63 75 89 107 115 132 151 171 188 206 221 240</spanx></c> |
| <c>15</c> |
| <c><spanx style="vbare"> 9 16 29 40 56 71 88 103 119 137 154 171 189 205 222 237</spanx></c> |
| <c>16</c> |
| <c><spanx style="vbare">16 19 36 48 57 76 87 105 118 132 150 167 185 202 218 236</spanx></c> |
| <c>17</c> |
| <c><spanx style="vbare">12 17 29 54 71 81 94 104 126 136 149 164 182 201 221 237</spanx></c> |
| <c>18</c> |
| <c><spanx style="vbare">15 28 47 62 79 97 115 129 142 155 168 180 194 208 223 238</spanx></c> |
| <c>19</c> |
| <c><spanx style="vbare"> 8 14 30 45 62 78 94 111 127 143 159 175 192 207 223 239</spanx></c> |
| <c>20</c> |
| <c><spanx style="vbare">17 30 49 62 79 92 107 119 132 145 160 174 190 204 220 235</spanx></c> |
| <c>21</c> |
| <c><spanx style="vbare">14 19 36 45 61 76 91 108 121 138 154 172 189 205 222 238</spanx></c> |
| <c>22</c> |
| <c><spanx style="vbare">12 18 31 45 60 76 91 107 123 138 154 171 187 204 221 236</spanx></c> |
| <c>23</c> |
| <c><spanx style="vbare">13 17 31 43 53 70 83 103 114 131 149 167 185 203 220 237</spanx></c> |
| <c>24</c> |
| <c><spanx style="vbare">17 22 35 42 58 78 93 110 125 139 155 170 188 206 224 240</spanx></c> |
| <c>25</c> |
| <c><spanx style="vbare"> 8 15 34 50 67 83 99 115 131 146 162 178 193 209 224 239</spanx></c> |
| <c>26</c> |
| <c><spanx style="vbare">13 16 41 66 73 86 95 111 128 137 150 163 183 206 225 241</spanx></c> |
| <c>27</c> |
| <c><spanx style="vbare">17 25 37 52 63 75 92 102 119 132 144 160 175 191 212 231</spanx></c> |
| <c>28</c> |
| <c><spanx style="vbare">19 31 49 65 83 100 117 133 147 161 174 187 200 213 227 242</spanx></c> |
| <c>29</c> |
| <c><spanx style="vbare">18 31 52 68 88 103 117 126 138 149 163 177 192 207 223 239</spanx></c> |
| <c>30</c> |
| <c><spanx style="vbare">16 29 47 61 76 90 106 119 133 147 161 176 193 209 224 240</spanx></c> |
| <c>31</c> |
| <c><spanx style="vbare">15 21 35 50 61 73 86 97 110 119 129 141 175 198 218 237</spanx></c> |
| </texttable> |
| |
| <t> |
| Given the stage-1 codebook entry cb1_Q8[], the stage-2 residual res_Q10[], and |
| their corresponding weights, w_Q9[], the reconstructed normalized LSF |
| coefficients are |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| NLSF_Q15[k] = clamp(0, |
| (cb1_Q8[k]<<7) + (res_Q10[k]<<14)/w_Q9[k], 32767) , |
| ]]></artwork> |
| </figure> |
| where the division is integer division. |
| However, nothing in either the reconstruction process or the |
| quantization process in the encoder thus far guarantees that the coefficients |
| are monotonically increasing and separated well enough to ensure a stable |
| filter <xref target="Kabal86"/>. |
| When using the reference encoder, roughly 2% of frames violate this constraint. |
| The next section describes a stabilization procedure used to make these |
| guarantees. |
| </t> |
| |
| </section> |
| |
| <section anchor="silk_nlsf_stabilization" title="Normalized LSF Stabilization"> |
| <t> |
| The normalized LSF stabilization procedure is implemented in |
| silk_NLSF_stabilize() (NLSF_stabilize.c). |
| This process ensures that consecutive values of the normalized LSF |
| coefficients, NLSF_Q15[], are spaced some minimum distance apart |
| (predetermined to be the 0.01 percentile of a large training set). |
| <xref target="silk_nlsf_min_spacing"/> gives the minimum spacings for NB and MB |
| and those for WB, where row k is the minimum allowed value of |
| NLSF_Q[k]-NLSF_Q[k-1]. |
| For the purposes of computing this spacing for the first and last coefficient, |
| NLSF_Q15[-1] is taken to be 0, and NLSF_Q15[d_LPC] is taken to be 32768. |
| </t> |
| |
| <texttable anchor="silk_nlsf_min_spacing" |
| title="Minimum Spacing for Normalized LSF Coefficients"> |
| <ttcol>Coefficient</ttcol> |
| <ttcol align="right">NB and MB</ttcol> |
| <ttcol align="right">WB</ttcol> |
| <c>0</c> <c>250</c> <c>100</c> |
| <c>1</c> <c>3</c> <c>3</c> |
| <c>2</c> <c>6</c> <c>40</c> |
| <c>3</c> <c>3</c> <c>3</c> |
| <c>4</c> <c>3</c> <c>3</c> |
| <c>5</c> <c>3</c> <c>3</c> |
| <c>6</c> <c>4</c> <c>5</c> |
| <c>7</c> <c>3</c> <c>14</c> |
| <c>8</c> <c>3</c> <c>14</c> |
| <c>9</c> <c>3</c> <c>10</c> |
| <c>10</c> <c>461</c> <c>11</c> |
| <c>11</c> <c/> <c>3</c> |
| <c>12</c> <c/> <c>8</c> |
| <c>13</c> <c/> <c>9</c> |
| <c>14</c> <c/> <c>7</c> |
| <c>15</c> <c/> <c>3</c> |
| <c>16</c> <c/> <c>347</c> |
| </texttable> |
| |
| <t> |
| The procedure starts off by trying to make small adjustments which attempt to |
| minimize the amount of distortion introduced. |
| After 20 such adjustments, it falls back to a more direct method which |
| guarantees the constraints are enforced but may require large adjustments. |
| </t> |
| <t> |
| Let NDeltaMin_Q15[k] be the minimum required spacing for the current audio |
| bandwidth from <xref target="silk_nlsf_min_spacing"/>. |
| First, the procedure finds the index i where |
| NLSF_Q15[i] - NLSF_Q15[i-1] - NDeltaMin_Q15[i] is the |
| smallest, breaking ties by using the lower value of i. |
| If this value is non-negative, then the stabilization stops; the coefficients |
| satisfy all the constraints. |
| Otherwise, if i == 0, it sets NLSF_Q15[0] to NDeltaMin_Q15[0], and if |
| i == d_LPC, it sets NLSF_Q15[d_LPC-1] to |
| (32768 - NDeltaMin_Q15[d_LPC]). |
| For all other values of i, both NLSF_Q15[i-1] and NLSF_Q15[i] are updated as |
| follows: |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| i-1 |
| __ |
| min_center_Q15 = (NDeltaMin_Q15[i]>>1) + \ NDeltaMin_Q15[k] |
| /_ |
| k=0 |
| d_LPC |
| __ |
| max_center_Q15 = 32768 - (NDeltaMin_Q15[i]>>1) - \ NDeltaMin_Q15[k] |
| /_ |
| k=i+1 |
| center_freq_Q15 = clamp(min_center_Q15[i], |
| (NLSF_Q15[i-1] + NLSF_Q15[i] + 1)>>1, |
| max_center_Q15[i]) |
| |
| NLSF_Q15[i-1] = center_freq_Q15 - (NDeltaMin_Q15[i]>>1) |
| |
| NLSF_Q15[i] = NLSF_Q15[i-1] + NDeltaMin_Q15[i] . |
| ]]></artwork> |
| </figure> |
| Then the procedure repeats again, until it has either executed 20 times or |
| has stopped because the coefficients satisfy all the constraints. |
| </t> |
| <t> |
| After the 20th repetition of the above procedure, the following fallback |
| procedure executes once. |
| First, the values of NLSF_Q15[k] for 0 <= k < d_LPC |
| are sorted in ascending order. |
| Then for each value of k from 0 to d_LPC-1, NLSF_Q15[k] is set to |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| max(NLSF_Q15[k], NLSF_Q15[k-1] + NDeltaMin_Q15[k]) . |
| ]]></artwork> |
| </figure> |
| Next, for each value of k from d_LPC-1 down to 0, NLSF_Q15[k] is set to |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| min(NLSF_Q15[k], NLSF_Q15[k+1] - NDeltaMin_Q15[k+1]) . |
| ]]></artwork> |
| </figure> |
| </t> |
| |
| </section> |
| |
| <section anchor="silk_nlsf_interpolation" title="Normalized LSF Interpolation"> |
| <t> |
| For 20 ms SILK frames, the first half of the frame (i.e., the first two |
| subframes) may use normalized LSF coefficients that are interpolated between |
| the decoded LSFs for the most recent coded frame (in the same channel) and the |
| current frame. |
| A Q2 interpolation factor follows the LSF coefficient indices in the bitstream, |
| which is decoded using the PDF in <xref target="silk_nlsf_interp_pdf"/>. |
| This happens in silk_decode_indices() (decode_indices.c). |
| After either |
| <list style="symbols"> |
| <t>An uncoded regular SILK frame in the side channel, or</t> |
| <t>A decoder reset (see <xref target="decoder-reset"/>),</t> |
| </list> |
| the decoder still decodes this factor, but ignores its value and always uses |
| 4 instead. |
| For 10 ms SILK frames, this factor is not stored at all. |
| </t> |
| |
| <texttable anchor="silk_nlsf_interp_pdf" |
| title="PDF for Normalized LSF Interpolation Index"> |
| <ttcol>PDF</ttcol> |
| <c>{13, 22, 29, 11, 181}/256</c> |
| </texttable> |
| |
| <t> |
| Let n2_Q15[k] be the normalized LSF coefficients decoded by the procedure in |
| <xref target="silk_nlsfs"/>, n0_Q15[k] be the LSF coefficients |
| decoded for the prior frame, and w_Q2 be the interpolation factor. |
| Then the normalized LSF coefficients used for the first half of a 20 ms |
| frame, n1_Q15[k], are |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| n1_Q15[k] = n0_Q15[k] + (w_Q2*(n2_Q15[k] - n0_Q15[k]) >> 2) . |
| ]]></artwork> |
| </figure> |
| This interpolation is performed in silk_decode_parameters() |
| (decode_parameters.c). |
| </t> |
| </section> |
| |
| <section anchor="silk_nlsf2lpc" |
| title="Converting Normalized LSFs to LPC Coefficients"> |
| <t> |
| Any LPC filter A(z) can be split into a symmetric part P(z) and an |
| anti-symmetric part Q(z) such that |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| d_LPC |
| __ -k 1 |
| A(z) = 1 - \ a[k] * z = - * (P(z) + Q(z)) |
| /_ 2 |
| k=1 |
| ]]></artwork> |
| </figure> |
| with |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| -d_LPC-1 -1 |
| P(z) = A(z) + z * A(z ) |
| |
| -d_LPC-1 -1 |
| Q(z) = A(z) - z * A(z ) . |
| ]]></artwork> |
| </figure> |
| The even normalized LSF coefficients correspond to a pair of conjugate roots of |
| P(z), while the odd coefficients correspond to a pair of conjugate roots of |
| Q(z), all of which lie on the unit circle. |
| In addition, P(z) has a root at pi and Q(z) has a root at 0. |
| Thus, they may be reconstructed mathematically from a set of normalized LSF |
| coefficients, n[k], as |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| d_LPC/2-1 |
| -1 ___ -1 -2 |
| P(z) = (1 + z ) * | | (1 - 2*cos(pi*n[2*k])*z + z ) |
| k=0 |
| |
| d_LPC/2-1 |
| -1 ___ -1 -2 |
| Q(z) = (1 - z ) * | | (1 - 2*cos(pi*n[2*k+1])*z + z ) |
| k=0 |
| ]]></artwork> |
| </figure> |
| </t> |
| <t> |
| However, SILK performs this reconstruction using a fixed-point approximation so |
| that all decoders can reproduce it in a bit-exact manner to avoid prediction |
| drift. |
| The function silk_NLSF2A() (NLSF2A.c) implements this procedure. |
| </t> |
| <t> |
| To start, it approximates cos(pi*n[k]) using a table lookup with linear |
| interpolation. |
| The encoder SHOULD use the inverse of this piecewise linear approximation, |
| rather than the true inverse of the cosine function, when deriving the |
| normalized LSF coefficients. |
| These values are also re-ordered to improve numerical accuracy when |
| constructing the LPC polynomials. |
| </t> |
| |
| <texttable anchor="silk_nlsf_orderings" |
| title="LSF Ordering for Polynomial Evaluation"> |
| <ttcol>Coefficient</ttcol> |
| <ttcol align="right">NB and MB</ttcol> |
| <ttcol align="right">WB</ttcol> |
| <c>0</c> <c>0</c> <c>0</c> |
| <c>1</c> <c>9</c> <c>15</c> |
| <c>2</c> <c>6</c> <c>8</c> |
| <c>3</c> <c>3</c> <c>7</c> |
| <c>4</c> <c>4</c> <c>4</c> |
| <c>5</c> <c>5</c> <c>11</c> |
| <c>6</c> <c>8</c> <c>12</c> |
| <c>7</c> <c>1</c> <c>3</c> |
| <c>8</c> <c>2</c> <c>2</c> |
| <c>9</c> <c>7</c> <c>13</c> |
| <c>10</c> <c/> <c>10</c> |
| <c>11</c> <c/> <c>5</c> |
| <c>12</c> <c/> <c>6</c> |
| <c>13</c> <c/> <c>9</c> |
| <c>14</c> <c/> <c>14</c> |
| <c>15</c> <c/> <c>1</c> |
| </texttable> |
| |
| <t> |
| The top 7 bits of each normalized LSF coefficient index a value in the table, |
| and the next 8 bits interpolate between it and the next value. |
| Let i = (n[k] >> 8) be the integer index and |
| f = (n[k] & 255) be the fractional part of a given |
| coefficient. |
| Then the re-ordered, approximated cosine, c_Q17[ordering[k]], is |
| <figure align="center"> |
| <artwork align="center"><![CDATA[ |
| c_Q17[ordering[k]] = (cos_Q12[i]*256 |
| + (cos_Q12[i+1]-cos_Q12[i])*f + 4) >> 3 , |
| ]]></artwork> |
| </figure> |
| where ordering[k] is the k'th entry of the column of |
| <xref target="silk_nlsf_orderings"/> corresponding to the current audio |
| bandwidth and cos_Q12[i] is the i'th entry of <xref target="silk_cos_table"/>. |
| </t> |
| |
| <texttable anchor="silk_cos_table" |
| title="Q12 Cosine Table for LSF Conversion"> |
| <ttcol align="right">i</ttcol> |
| <ttcol align="right">+0</ttcol> |
| <ttcol align="right">+1</ttcol> |
| <ttcol align="right">+2</ttcol> |
| <ttcol align="right">+3</ttcol> |
| <c>0</c> |
| <c>4096</c> <c>4095</c> <c>4091</c> <c>4085</c> |
| <c>4</c> |
| <c>4076</c> <c>4065</c> <c>4052</c> <c>4036</c> |
| <c>8</c> |
| <c>4017</c> <c>3997</c> <c>3973</c> <c>3948</c> |
| <c>12</c> |
| <c>3920</c> <c>3889</c> <c>3857</c> <c>
|