Daqi's BlogLin Daqi's Personal Website
http://dqlin.xyz/
Sun, 27 Jun 2021 12:22:58 -0600Sun, 27 Jun 2021 12:22:58 -0600Jekyll v4.2.0Real-Time Stochastic Lightcuts Source Code Release!<p><a href="https://github.com/DQLin/RealTimeStochasticLightcuts" style="font-family:'Lora','Times New Roman'; font-size:1em; color: #0080FF"> Source Code (GitHub) </a></p>
<p><a href="/pubs/2020-i3d-SLC/" style="font-family:'Lora','Times New Roman'; font-size:1em; color: #0080FF"> Project Page </a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/u8iXZpL6bh4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
Sat, 08 Aug 2020 00:00:00 -0600
http://dqlin.xyz/tech/2020/08/08/slc-source/
http://dqlin.xyz/tech/2020/08/08/slc-source/techUsing RTX to Accelerate Instant Radiosity<p>The past 2018 was an exciting year for computer graphics. Nvidia announced RTX graphics cards which brings real-time ray tracing to consumers. Following the announcement, we saw new releases of mainstream game series including Battlefield V and Shadow of the Tomb Raider, putting RTX powered graphics in their games. This screenshot below captured from a Battlefield V promotion video
<a href="https://www.youtube.com/watch?v=rpUm0N4Hsd8">(https://www.youtube.com/watch?v=rpUm0N4Hsd8)</a>
shows super clear ray traced reflections in water.</p>
<p><img src="/img/BattleFieldV.png" alt="BattleFieldV" />.</p>
<p>However, the recent new game releases branded with RTX graphics mostly use RTX for tracing reflections and shadows (including soft shadow), many other possibilities with real-time ray tracing are still to be explored. Of course, ray traced reflections and shadows can largely enhance the overall graphics quality, considering the importance of these two components and how bad their quality was even using very complex rasterization tricks. But there are still tons of possible applications of real-time ray tracing that can lift the overall graphics quality to a new level. For example, we can achieve more faithful subsurface scattering in translucent objects like marbles and human skin. Current technique use shadow maps to estimate the distanced traveled by light inside the object, which can fall short with concave objects and have artefacts around object edges. But ray tracing simply avoids all artefacts brought by rasterization tricks and everything will appear as they should be.</p>
<p>The thing I want to advocate is using RTX to generate virtual point lights (VPL) and trace shadow rays to them. This technique is known as instant radiosity, introduced by Alexander Keller in 1997. It resembles bidirectional path tracing and photon mapping in the sense that it also traces light paths and records hit positions along the paths. These surface records are then used as point lights that represent discretized indirect lighting. Comparing to photon mapping, instant radiosity is cheap and effective to render smooth diffuse reflection, thanks to the low frequency spatial radiance distribution of point lights. A few hundreds of VPLs are enough to provide indirect lighting from a small light source with reasonable quality (of course, there is singularity problem caused by the point light approximation, but that can be bypassed by setting a minimum distance between point lights and surface points), which often requires more than 1M photons when using photon mapping to eliminate the wavy artefact. As a result, virtual point lights have been used in games for flashlights. An example is the rendering of indirect illumination in rooms from gun mounted flashlights in Gears of War 4. [Malmros, 2017] (talk:
<a href="https://www.youtube.com/watch?v=c0VxzGRIUCs">https://www.youtube.com/watch?v=c0VxzGRIUCs</a>)
The developers used reflection shadow maps [Dachsbacher & Stamminger, 2005] <a href="http://www.klayge.org/material/3_12/GI/rsm.pdf">(http://www.klayge.org/material/3_12/GI/rsm.pdf)</a>
to sample single-bounce VPLs and merge VPLs according to some geometric and material heuristics to lower the computational cost. This technique gives real-time single bounce GI (likely unshadowed). The following screenshots from the talk video a comparison between flashlight aiming at the red tapestry and the wall. Clearly, the technique generates reasonable color bleeding as shown by the change of color on the ceiling.</p>
<p><img src="/img/gearsOfWar.png" alt="GearOfWar" /></p>
<p>Looking carefully into the talk video, there are some temporal incoherence as the flashlight moves. However it is not obvious in a dark environment like this, especially when using a moving FPS game character. But the render quality can still be improved if the virtual point lights can be generated and evaluated at a lower cost. First, using more VPLs improves the temporal stability and reduces the bright blotches. Also, given strict performance requirements Gears of War 4 probably only used didn’t trace shadow rays to VPLs for indirect illumination shadows, which could be very important given a more complex scene setting. Now with RTX available, we can make VPLs much cheaper, solving the aforementioned problems.</p>
<h1 id="directx-ray-tracing">DirectX Ray Tracing</h1>
<p>There are several options to get access to the ray tracing function in RTX cards. But the most convenient one for PC gaming development is the DirectX Ray Tracing (DXR) API, which integrates seamlessly with the rasterization pipeline we use everyday. There is a very nice introduction to DXR from Nvidia <a href="https://devblogs.nvidia.com/introduction-nvidia-rtx-directx-ray-tracing/">(https://devblogs.nvidia.com/introduction-nvidia-rtx-directx-ray-tracing/)</a>. In most concise words, DXR breaks the ray tracing process into three new shaders, “raygen”, “hit” and “miss”. Rendering starts from “raygen” or ray generation, invoked in a grid manner like a compute shasder. In all shaders, calling TraceRay() executes the fixed function hardware scene traversal using an acceleration structure (BVH). Because can a ray can hit any object, a shader table is used to store shading resources for each geometry object. Upon intersection, the entry for the current object can be retrieved from the shader table to determine which shaders and textures to use. With these functions, we can virtually implement all possible ray tracing functions with DXR, except choosing and our own acceleration structure (for example, k-d tree).</p>
<h1 id="example-implementation">Example Implementation</h1>
<p>Here I provide a brief code walk through of using DXR to implement the original (brute force) instant radiosity algorithm. Some parts of code is modified from the Microsoft MiniEngine DXR example <a href="https://github.com/Microsoft/DirectX-Graphics-Samples">(https://github.com/Microsoft/DirectX-Graphics-Samples)</a>. Notice that it is not meant to be interactive as the original instant radiosity is an offline rendering algorithm that goes through all virtual point lights for each pixel, and shooting shadow rays to resolve the visibility. In our experiment we will generate 1 million VPLs for at most two-bounce indirect diffuse reflection and render a 1280x720 instant radiosity image. This means we need to trace 921.6 giga shadow rays against the scene. My DXR program running on an RTX 2080 takes 25 minutes to render a instant radiosity image for Crytek Sponza (262k triangles), about 600 mega rays per second which is quite impressive. Please notice that I a very unoptimzed way of tracing shadow rays (trace shadow rays for one light for all pixels in each frame with which contains a fixed G-buffer overhead). With proper optimization, the speed should at least reach 1 Giga rays per second. The VPL generation process is relatively fast, taking less than 2 ms. Here is the raygen shader I used to generate the VPLs.</p>
<p>To sample light rays from a directional distant light (sun light), I used a function like this to randomly sample a point from the top disk of the bounding cylinder of the scene bounding sphere aligned to the sun light direction (a technique introduced in PBRT). This point is then used as the origin of the light ray, which is then traced through the scene and make diffuse bounces when intersecting with surfaces.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void GenerateRayFromDirectionalLight(uint2 seed, out float3 origin, out float pdf)
{
float3 v1, v2;
CoordinateSystem(SunDirection, v1, v2); // get a local coordinate system
float2 cd = GetUnitDiskSample(seed + DispatchOffset);
float3 pDisk = SceneSphere.xyz - SceneSphere.w * SunDirection +
SceneSphere.w * (cd.x * v1 + cd.y * v2);
origin = pDisk;
pdf = 1 / (PI * SceneSphere.w * SceneSphere.w);
}
</code></pre></div></div>
<p>DXR requires us to specify the ray using a ray description (RayDesc) structure that contains the origin, tMin (min parametric intersection distance along the ray), ray direction and tMax. Additionally, for this algorithm we also need to define a payload structure that records the carried radiance on the light path which I call “alpha”. It is multiplied with the surface albedo and the cosine factor during each surface hit. The payload also stores the current recursion depth. This payload feeds the last argument of TraceRay, which can pass the information from ray generation to hit shader and between bounces.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
/// some resource and function definitions
...
struct RayPayload
{
float3 alpha;
uint recursionDepth;
};
[shader("raygeneration")]
void RayGen()
{
float3 origin;
float pdf;
// using the 2D ray dispatch index as random seed
GenerateRayFromDirectionalLight(DispatchRaysIndex().xy, origin, pdf);
float3 alpha = SunIntensity / pdf;
RayDesc rayDesc = { origin,
0.0f,
SunDirection,
FLT_MAX };
RayPayload rayPayload;
rayPayload.alpha = alpha;
rayPayload.recursionDepth = 0;
// defition of parameters can be found on
// https://developer.nvidia.com/rtx/raytracing/dxr/DX12-Raytracing-tutorial-Part-2
TraceRay(accelerationStructure, RAY_FLAG_NONE, ~0, 0, 1, 0, rayDesc, rayPayload);
}
</code></pre></div></div>
<p>Here is the “closesthit” shader that creates VPLs at surface hits. I use three DX12 linear buffers (StructuredBuffer) to
store VPL positions, normals and colors (flux). An atomic counter is incremented and returned each time a new VPL is
created to prevent writing to the same position. Following that, a diffuse reflection ray is sampled from
the hemisphere centered at the surface normal to generate next bounce. Notice how DXR provides a wide range of fixed functions
and variables that stores necessary ray tracing information we need for lighting computation. For example,
barycentric coordinates are fetched from “BuiltInTriangleIntersectionAttributes” and the intersection distance is provided in
RayTCurrent().</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/// some resource and function definitions
...
[shader("closesthit")]
void Hit(inout RayPayload rayPayload, in BuiltInTriangleIntersectionAttributes attr)
{
uint materialID = MaterialID;
uint triangleID = PrimitiveIndex();
RayTraceMeshInfo info = meshInfo[materialID];
///fetch texture coordinates (uv0, uv1, uv2), vNormal), vBinormal, vTangent for triangle vertices
...
float3 bary = float3(1.0 - attr.barycentrics.x - attr.barycentrics.y, attr.barycentrics.x, attr.barycentrics.y);
float2 uv = bary.x * uv0 + bary.y * uv1 + bary.z * uv2;
float3 worldPosition = WorldRayOrigin() + WorldRayDirection() * RayTCurrent();
uint2 threadID = DispatchRaysIndex().xy + DispatchOffset;
const float3 rayDir = normalize(-WorldRayDirection());
uint materialInstanceId = info.m_materialInstanceId;
const float3 diffuseColor = g_localTexture.Sample(defaultSampler, uv).rgb;
float3 normal = g_localNormal.SampleLevel(defaultSampler, uv, 0).rgb * 2.0 - 1.0;
float3x3 tbn = float3x3(vTangent, vBinormal, vNormal);
normal = normalize(mul(normal, tbn));
// sample a diffuse bounce
float3 R = GetHemisphereSampleCosineWeighted(threadID * (rayPayload.recursionDepth+1), normal);
float3 alpha = rayPayload.alpha;
// attenuate carried radiance
alpha *= diffuseColor;
// increment atomic counter
uint VPLid = g_vplPositions.IncrementCounter();
// store a new vpl
g_vplPositions[VPLid] = worldPosition;
g_vplNormals[VPLid] = normal;
g_vplColors[VPLid] = alpha;
//push to vpl storage buffer
if (rayPayload.recursionDepth < MAX_RAY_RECURSION_DEPTH)
{
TraceNextRay(worldPosition + epsilon * R, R, alpha, rayPayload.recursionDepth);
}
}
</code></pre></div></div>
<p>Finally, we can gather the VPL contribution for all visible pixels to generate an irradiance buffer.
Notice that texPosition and texNormal are surface world position and normal from the G buffer.</p>
<p>The irradiance $E$ can be calcuated as
$ E = \Phi \frac{<n_s, r><n_l, -r>}{||r||^2} $,
where $\Phi$ is VPL flux, $n_s$, $n_l$ are surface normal and VPL normal,
$p_s$, $p_l$ are surface and VPL positions, and $r$ is the normalized shadow ray direction.</p>
<p>The following shader code shows computing the contribution from one VPL, indexed by VPLId.
A shadow ray is traced to resolve the visibility between current pixel and VPL, and the
max ray T is set to the light distance which means any hit is an occlusion. This message can be
stored in the ray payload as a boolean. Because we are tracing shadow rays, it is better to set
RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH argument in TraceRay() to avoid the unnecessary search
of closest hit.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
/// some resource and function definitions
...
/// an anyhit shader setting payload.IsOccluded to translucent
...
[shader("raygeneration")]
void RayGen()
{
float3 output = 0;
int2 screenPos = DispatchRaysIndex().xy;
float3 SurfacePosition = texPosition[screenPos].xyz;
float3 lightPosition = vplPositions[VPLId].xyz;
float3 lightDir = lightPosition - SurfacePosition;
float lightDist = length(lightDir);
lightDir = lightDir / lightDist; //normalize
// cast shadow ray
RayDesc rayDesc = { SurfacePosition,
0.1f, //bias
lightDir,
lightDist };
ShadowRayPayload payload;
payload.IsOccluded = false;
TraceRay(g_accel, RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH,
~0, 0, 1, 0, rayDesc, payload);
if (!payload.IsOccluded) // no occlusion
{
float3 SurfaceNormal = texNormal[pixelPos].xyz;
float3 lightNormal = vplNormals[VPLId].xyz;
float3 lightColor = vplColors[VPLId].xyz;
output = max(dot(SurfaceNormal, lightDir), 0.0) * lightColor;
output *= max(dot(lightNormal, -lightDir), 0.0) / (lightDist*lightDist + CLAMPING_BIAS);
}
irradianceBuffer[pixelPos] += output / numVPLs;
}
</code></pre></div></div>
<p>After we iterate through all VPLs, we can modulate the irradiance buffer with the surface albedo
to produce the final indirect illumination and add that to the direct illumination. And here is
the final image.</p>
<p><img src="/img/ir.png" alt="BattleFieldV" />.</p>
<p>If you look into VPL-based GI literature, you’ll find a ton of methods that approximates the instant
radiosity with a dramatic reduction in evaluation cost. But none of them can reach real time performance
yet. But we’ll continue with this topic to see what RTX can bring us for real time global illumination.</p>
<h2 id="references"><em>References</em></h2>
<p>Dachsbacher, C., & Stamminger, M. (2005, April). Reflective shadow maps. In Proceedings of the 2005 symposium on Interactive 3D graphics and games (pp. 203-231). ACM.</p>
<p>Malmros, J. (2017, July). Gears of War 4: custom high-end graphics features and performance techniques. In ACM SIGGRAPH 2017 Talks (p. 13). ACM.</p>
Thu, 24 Jan 2019 00:00:00 -0700
http://dqlin.xyz/tech/2019/01/24/dxr-ir/
http://dqlin.xyz/tech/2019/01/24/dxr-ir/techThe Hardest Problem on the Hardest Test<p>Yesterday, I saw this intriguing YouTube [<a href="https://youtu.be/OkmNXy7er84">video</a>] talking about using a clever trick to
solve what was meant to be the hardest problem in 1992 Putnam Math competition.</p>
<p>The problem goes as following: Choosing four random points on the surface of a sphere, what’s the probability that the center of the sphere lies in the tetrahedron spanned by the four points? At first sight, it occurred to me that this was a very difficult multivariate integral problem. However, it turned out that you just need to “flip coins” to get the correct answer.</p>
<p>The solution approaches the problem by considering the simplified version first. If there are only three points $P_1$, $P_2$, $P_3$, it’s much easier to consider the case where we fix two points and slide the third point on the circle. Apparently, only when the third point falls on the “opposite” arc (Drawing lines from $P_1$ and $P_2$ through the center, the arc intersected by the lines on the opposite side to that formed by $P_1$ and $P_2$), can the triangle spanned by the three points covers the center of the circle. So given any $P_1$ and $P_2$, the probability that adding a third point $P_3$ forms a triangle that covers the center is the probability that $P_3$ lands on that arc, i.e., the length of the arc over the circumference of the circle since $P_3$ is randomly chosen on the circle. To derive the probability with any $P_1$, $P_2$ and $P_3$ will require $P_1$, $P_2$ to be arbitrarily chosen, which means we can integrate the previous probability over all possible angles between $P_1$ and $P_2$. However, since parameterizing on the central angle is linear, we can directly take the “average length” where the arc formed by $P_1$ and $P_2$ is a quarter circle (we only consider the shorter arc), which gives us a $25\%$ probability.</p>
<p style="text-align: center;"><a href="/img/putnam/1.png"><img src="/img/putnam/1.png" alt="alt text" /></a></p>
<center><small>
The simplified 2D case.
</small></center>
<p>I believe you are already excited after seeing this 2D case result which was derived without any computation. However, things are getting more nebulous when we go into 3D. We naturally want to extend the 2D reasoning to 3D. Of course, the tetrahedron formed by $P_1$, $P_2$, $P_3$ and $P_4$ covers and only covers the center of the sphere when $P_4$ falls on the circular triangle opposite to the one spanned by $P_1$, $P_2$ and $P_3$. However, getting the “average area” of the circular triangle is non-trivial as it involves evaluating a surface integral with four variables.</p>
<p>So here comes the plot twist, let us go back to the 2D case. Instead of choose three random points, let’s choose two random lines crossing the center of the circle and choose a random point. Each of the random lines represents a point from $P_1$ and $P_2$ and $P_1$ and $P_2$ can be on either end of their underlying line, at equal probability. This should be equivalent to the case of choosing three random points. Since there are only 4 different combinations of $P_1$ and $P_2$ lying on a specific end of their corresponding lines and only one of these combinations yields $P_1, P_2$ to be on the opposite side of the circle as the randomly chosen $P_3$ (in other words, not all of them are on the same semicircle), we have a one-fourth chance of forming a center-covering triangle. It is very similar to flipping two coins - one specific coin must be head and the other one must be tail, which has a $1/4$ probability.</p>
<p style="text-align: center;"><a href="/img/putnam/3.png"><img src="/img/putnam/3.png" alt="alt text" /></a></p>
<center><small>
The "coin flip" illustration.
</small></center>
<p>And what’s awesome about this reasoning is that it extends seamlessly to the 3D case. Four random points just become three random lines crossing the sphere center and one random point. This time, we are flipping 3 coins to generate all possible combinations and only one out of all 8 cases gives us the situation when the fourth point is on the opposite side of the sphere as the other three points. So the probability that four randomly chosen points on a sphere covers the center of the sphere is $1/8$.</p>
<p>This was such an elegant proof and the reasoning process was very inspiring. In most often cases, you can go to simpler cases first (in this case, going down one dimension and fixing some points) before solving a hard problem. More complex cases are usually generalization or combination of simpler cases, the essence of the problem should be the same. Therefore, considering the simpler cases makes you more likely to grasp the essence of the problem (but we need to beware of the false positives when generalizing reasonings) rather than being overwhelmed by the harder problem. A joke probably from the Institute for Advanced Study (Einstein, Godel, von Neumann, Yang Chen-Ning, etc. have worked there…), there were signs at the balconies of the buildings saying “before you jump, have you ever considered dimension one?”. Another more important insight from this video is that, changing the formalization can be the plot twist of solving a hard problem. This happens when the solver of the problem found that generalizing the 2D result to 3D was very hard and went back to 2D. In more philosophical words, it is actually the language, whether it is text or graph, that defines the way we think. The essence of the problem is the “thing” itself but some languages might do a better job than others describing it. In this case, four random points and three random lines plus one random point are just two different languages describing the same thing. However, one language is a better tool to approach the essence of the problem than the other. And we are less likely to discover that language if we don’t go to the low dimension case. Although it is very subtle, we can always practice the idea by thinking in different angles and formalize the problem is multiple different ways.</p>
<p style="text-align: center;"><a href="/img/putnam/2.png"><img src="/img/putnam/2.png" alt="alt text" /></a></p>
<center><small>
The 3D case.
</small></center>
<p>Up to now, the solution is only described in a very intuitive way so we can say we have the idea of the proof but not the proof itself. To formalize the proof, some mathematical tools need to be used. An [<a href="http://lsusmath.rickmabry.org/psisson/putnam/putnam-web.htm">article</a>] here gives a formal proof by expressing the points as position vectors with the sphere center being the origin, and by requiring the origin to be expressed as a convex combination of the four position vectors, i.e. all weights having the positive sign, matching the only case out of 8 possible cases. This article also generalized the problem and derived some interesting facts. Obviously, the easiest generalization is $n+1$ points on a $\mathbb{R}^n$ dimensional ball, which gives a probability of covering the center of $\frac{1}{2^n}$.</p>
<h2 id="references"><em>References</em></h2>
<p>3Blue1Brown(Grant Sanderson). “The Hardest Problem on the Hardest Test.” YouTube, 8 Dec. 2017, www.youtube.com/watch?v=OkmNXy7er84. (All images in this article are captured from the video. The article also heavily used the ideas in the video.)</p>
<p>Howard, R., & Sisson, P. (1996). Capturing the origin with random points: Generalizations of a Putnam problem. The College Mathematics Journal, 27(3), 186-192.</p>
<p>Kedlaya, K. S., Kedlaya, K. S., Poonen, B., & Vakil, R. (2002). The William Lowell Putnam mathematical competition 1985-2000: problems, solutions and commentary. MAA. Retrived from https://www.amazon.com/William-Mathematical-Competition-1985-2000-Problem/dp/0883858274 (The covered of this book is used as the header image of this article.)</p>
Sat, 01 Sep 2018 00:00:00 -0600
http://dqlin.xyz/post/2018/09/01/putnam/
http://dqlin.xyz/post/2018/09/01/putnam/postLatest Techniques in Interactive Global Illumination<p>This is my presentation at the University of Utah graphics seminar, which is about a very exciting topic - interactive GI!</p>
<iframe src="https://drive.google.com/file/d/13chIGW2bGA3l6iusIV7iSoNXUs0u969W/preview" width="100%" height="1000"></iframe>
Wed, 03 Jan 2018 00:00:00 -0700
http://dqlin.xyz/tech/2018/01/03/ltgi/
http://dqlin.xyz/tech/2018/01/03/ltgi/techStencil Buffer Trick for Volume Deferred Shading<p>Deferred shading have been known to largely boost the efficiency of rendering when treating a large amount of lights. The mechanism is very simple: separate the geometry complexity from lighting. To do this, G-buffers which form an array of textures, often including position, material color, normal of the points to shade, are rendered in the first pass from the point of view. Then, an orthographic camera are used to render a quad that covers the whole screen. The normalized (0-1) screen coordinates are then used to retrieved the geometry/material data at the point of the screen, which is fed into the lighting function. In such a way, we avoid producing tons of fragments from the projected scene geometry, instead, only render those which are visible.</p>
<p>However, imagine we have a large group of lights. We’ll still have to go through the whole list of lights for each screen pixel. With a more physically based lighting model, in which each light has a influential radius (resulted from the physical fact that has an ideal point light source has inverse squared drop-off), fragments that are outside a certain light’s influential radius would waste time on waiting other fragments in the same batch to go to a different branch. We know that branching is bad for GPU. This leads to a severe time penalty. Many techniques have been proposed to alleviate this problem. Tiled deferred shading is a very popular method, probably most of you have heard of. It partition the screen into tiles and create a “light list” for each tile using only the lights that intersect with the tiles. This is of course an elegant method. However, we will always need to do some preprocessing before generating a new group of lights (if there is a need to).</p>
<p>A simpler solution is volume deferred shading. We just need to render “light volumes” for each light, which, as you might guess, can be a sphere with a radius equal to the light’s influential radius. For example, in OpenGL, we just need to create a list of vertices/indices of a sphere and prepare a model matrix for each light (which is simply scaling and translation). While rendering, we perform the draw call n times, where n is the number of lights. One such light volume will produce fragments that covers the specific region on the screen where the fragments are possible to be shaded by the light. Of course, by doing this we are losing the depth dimension. We have to explicitly test the fragments and make sure they are at about the same depth region of the light (which is only a necessary condition). Of course, tile rendering also require such testing, but if we have lights scattering everywhere in a very “deep” scene, the lights to be tested are significantly lesser. However, because no preprocessing are required, volume deferred shading have quite competitive performance in most cases.</p>
<p>Wait! We should not render n passes! Instead, a better way is to use the instancing function which is supported on every modern graphics card to avoid the latency caused by lots of communication between CPU and GPU. Also another important thing, depth write should be disabled and additive blending should be enabled. The reason that depth write should be disabled is that light volumes are not real geometry. While two light volumes are close to each other and are illuminating the same region of the scene, we don’t want them to occlude each other such that some part are only shaded by one of the light volumes.</p>
<p>If you do the volume deferred shading described above directly, we will immediately discover that something goes wrong. When you approaches a light’s location (with a moving camera), at some point the screen will suddenly be darkened. This is because no matter you turn backface culling on/off, you will fall in a dilemma that you either render the pixels twice as bright when you are out of the light volume, or not render anything when you are inside the light volume.</p>
<p>It turns out that this situation can be easily solved by switching the culling mode to front-face culling. However, this is not good enough. We can actually keep the Z buffer created by G-buffers rendering and use this information perform some rejection of fragments that are not intersected by light volumes. Here is a nice <a href="http://kayru.org/articles/deferred-stencil/">stencil buffer trick</a> introduced by Yuriy O’Donnell (2009). What it do is basically using the stencil buffer as a counter to record whether the front face of a light volume is behind the visible scene geometry (so that it has no chance to shade the pixel). This is achieved by only rendering front faces (with color buffer write disabled) in the first pass and add 1 to the stencil of the Z-failed pixels. Another situation is that the backface of a light volume is before the visible scene geometry, which is solved by the second pass - rendering only back faces and use a Greater/Equal z-test to continue filter the final pixels from the pixels with a zero stencil value (already pass the first test). So that we can keep only the light volume pixels that “fail” the z-test (the original “LESS” one), which intuitively corresponds to the scene geometry-generated pixels that intersects with the light volume. Notice this trick also works when you are inside a light volume, in which case front faces won’t get rendered (it is illogical that the fact that we are inside a light volume and that the front face is behind the visible geometry hold at the same time!), leaving a zero stencil value that allow us to use Greater/Equal depth test only to filter the pixels to be shaded. Of course, in either pass we need to disable Z write. Certainly we don’t want the light volumes bumping into each other.</p>
<p style="text-align: center;"><a href="/img/stencil_trick.png"><img src="/img/stencil_trick.png" alt="alt text" /></a></p>
<center><small>
The original diagram used by Yuriy O'Donnell.
</small></center>
<p>While this trick definitely increases the efficiency especially when the lighting situation is very complex, we can do something better. Often modeling a detailed sphere as polygon creates large number of vertices that cram up the pipeline. Instead, we can use a very coarse polygon-sphere (e.g., with 4 vertical/horizontal slices) with a slightly larger radius to ensure that the light volume is bounded correctly. We can even use just a bounding cube of the light volume! Of course, the least thing we can do is just use a quad. However, that gives up some aforementioned depth test benefits and it also involves some complex projective geometry. Just for fun, I also prepared a derivation of the axis-aligned screen space bounding quad of the projected sphere.</p>
<p style="text-align: center;"><a href="/img/quad_1.png"><img src="/img/quad_1.png" alt="alt text" /></a>
<a href="/img/quad_2.png"><img src="/img/quad_2.png" alt="alt text" /></a></p>
Tue, 02 Jan 2018 00:00:00 -0700
http://dqlin.xyz/tech/2018/01/02/stencil/
http://dqlin.xyz/tech/2018/01/02/stencil/techA Way of Rendering Crescent-shaped Shadows under Solar Eclipse<p>Happy new year! I haven’t posted for a long time. However, I have done many exciting projects in the last half year and I’ll upload some of them soon. The first project I want to share with you is the render I created in in the Utah Teapot Rendering Competition. We were all awed by the great solar eclipse on August 21th. Do you remember the crescent-shaped shadow casted by tree leaves? It was so beautiful that the first time I saw it, I want to render it with ray tracing. Before working on this competition, I thought that there must be some complex math to figure out to simulate this rare phenomenon. However, the problem turned out to be embarrassingly simple. We can just model the actual geometry of sun and moon and trace rays. You might think that it is a crazy idea. In fact, instead of creating a sun with a diameter of 1.4 million kilometers and putting it 150 million kilometers away, we can simply put a 1-unit wide sun 100 units away from our scene, where 1 unit is approximately how big our scene is. It is supposed to have almost the same result. Then we use the same kind of trick to put a moon with a slightly smaller diameter and slightly ahead of the sun to make sure that the sun is eclipsed by it with a crescent shape. By treating the sun as an isotropic spherical emitter, the moon as a diffuse occluding sphere, and use a tree model with detailed alpha-masked leaf texture, I got results that are surprisingly good and also fast to compute. In such a simple way, I created a nice image with a glass Utah teapot sitting under a tree on a lawn behind the Warnock Engineering Building with crescent-shaped leaf shadows at the background.</p>
<p><a href="/img/crescent/crescent_shadow.png"><img src="/img/crescent/crescent_shadow.png" alt="alt text" /></a></p>
<center><small>
The render. Click to enlarge.
</small></center>
<p><a href="/img/crescent/workflow.png"><img src="/img/crescent/workflow.png" alt="alt text" /></a></p>
<center><small>
See the "sun" and "moon"? This is really how it works!
</small></center>
<p><a href="/img/crescent/sun_moon_closeup.png"><img src="/img/crescent/sun_moon_closeup.png" alt="alt text" /></a></p>
<center><small>
a close-up showing the crescent shape by eclipse
</small></center>
<p>Hope you like this project. Sometimes complex effects are really that simple!</p>
<p>Remark:</p>
<p><small>
The tree model and the grass model (include their texture) are downloaded from TurboSquird.com
<br />Grass: <i>https://www.turbosquid.com/FullPreview/Index.cfm/ID/868103</i>
<br />Tree: <i>https://www.turbosquid.com/FullPreview/Index.cfm/ID/884484</i>
</small>
<br /> <small>
TurboSquid Royalty Free License
<br /><i>https://blog.turbosquid.com/royalty-free-license/</i>
</small>
<br /></p>
<p><small>
The pavement texture is downloaded from TextureLib.com <br />
<i>http://texturelib.com/texture/?path=/Textures/brick/pavement/brick_pavement_0099</i>
</small>
<br /><small>
License:
<i>http://texturelib.com/license/</i>
</small>
<br /></p>
<p><small>
The environment map is a paranoma taking in the vicinity of Warnock Engineering Building. Taken by Cameron Porcaro, uploaded to Google street view. By Google’s Terms of use it is considered a fair use since it is not for commerical use.
</small></p>
Mon, 01 Jan 2018 00:00:00 -0700
http://dqlin.xyz/tech/2018/01/01/crescent/
http://dqlin.xyz/tech/2018/01/01/crescent/tech[GAPT Series - 13] Conclusions<!--more-->
<h2 id="131-summary">13.1 Summary</h2>
<p>This series focuses on how to improve the solution to the real-time path
tracing problem by introducing and discussing possible optimizations in
3 categories – SAS, sampling and SIMD, which are implemented in a
program with real-time rendering and interaction capability. While the
SIMD optimization bases itself on the parallel computing model in GPGPU
and aimed specially for the real-time requirement, the first two
categories – SAS and sampling – are not hardware dependent and also used
in off-line renderers as they are defined in the domain of a single
computing thread. However, it is also possible to improve the models
involved in these two categories to achieve better collaboration with
the GPGPU model. For SAS, as a common bottleneck of ray tracing
processes, SAH based kd-tree and BVH were introduced for being the
optimum of their peers in minimizing expected global cost of
ray-primitive intersection test and their indispensable functions in
different applications, and optimization techniques on such data
structure including triangle clipping and short stack traversal for
kd-tree and node refitting for dynamic BVH are also discussed with
implementation details. In the chapter for sampling, different
context-based optimization methods on Monte Carlo algorithm which are
all aimed for decrease variance in rendering – importance sampling on
BSDF, next event estimation for direct lighting, multiple importance
sampling combining the previous two, and bidirectional path tracing for
difficult lighting conditions – were introduced. Moreover, Metropolis
Light Transport as a modification of the basic Monte Carlo process based
on Markov Chain was introduced and some implementation details on GPU
were shared. For SIMD optimization, data structure rearrangement,
code-level thread divergence reduction, thread compaction as three
different types were illustrated with codes and test cases. A more
efficient ray-triangle intersection solution which transforms the
problem space was cited for its contribution on the performance increase
of our program. More importantly, we proposed a new GPU construction
algorithm for SAH kd-tree in full details, which turns out to help
greatly reducing the initialization overhead for complex model. In
addition, the underlying mechanism of rendering effects chosen and
supported in our program – surface-to-surface reflection/refraction,
volume rendering, and subsurface scattering were analyzed to clarify
possible complications in usage. For most methods we introduced and
discussed, test cases on our path tracer were provided to verify the
ideas. Finally, we benchmarked our program with the path tracing demo in
NVIDIA’s Optix engine and a free mainstream path tracer to prove that
our program has a large advantage in rendering simple scenes like the
Cornell Box by improving the performance by up to 30% and slightly
outperforms a free mainstream path tracer for a complex rendering of a
car, which means it is at least competitive with most of the mainstream
path tracers nowadays in real-time rendering of models with industrial
complexity. By analyzing, gathering, testing, and integrating different
optimization techniques into a whole process, and choosing the correct
rendering methods, we can efficiently produce aesthetically-pleasing,
photorealistic results.</p>
<h2 id="132-limitations--recommendations-for-further-work">13.2 Limitations & Recommendations for Further Work</h2>
<p>Given the immense potential of GPGPU, it is possible to see path tracing
offering a photorealistic, film standard experience, replacing
rasterization-based graphics to be the gaming standard in the future as
the hardware performance continues to multiply. However, improvements in
algorithm and software structure are also necessary to reduce as much
workload as possible to accelerate the coming of such day. This thesis
addresses many distinctive issues of real-time path tracing such as
large thread divergence and dynamic geometry. However, many problems
that may appear in future real-world applications of path tracing have
not been considered due to the time limit. One such problem is to
efficiently render a large set of animation data which may contain
particle system or complex deformation. Another problem is the
insufficient optimization of the spatial acceleration structure which is
a bottleneck in ray-traced graphics. New algorithms or hardware need to
be developed to continuously improve the traversal speed and update or
rebuild the SAS with minimal efforts. In addition, better
parallelization methods are still required for some algorithms with
relatively obscure parallelizability but tremendous serial performance
like Metropolis Light Transport, even though many have been developed.
Moreover, parsing can be transferred to the GPU to greatly reduce the
initialization time of geometrically complex scenes.</p>
<h2 id="bibliography"><em>Bibliography</em></h2>
<p>Ashikhmin, M., & Shirley, P. (2000). An anisotropic Phong BRDF model.
Journal of graphics tools, 5(2), 25-32.</p>
<p>Beason, K. (2007). Smallpt: Global Illumination in 99 lines of C++.
Retrieved from http://www.kevinbeason.com/smallpt/</p>
<p>Chandrasekhar, S. (1960). Radiative Transfer. New York: Dover
Publications. Originally published by Oxford University Press, 1950.</p>
<p>Chandrasekhar, S. (1960). The stability of non-dissipative Couette flow
in hydromagnetics. Proceedings of the National Academy of Sciences,
46(2), 253-257.</p>
<p>Cook, R. L., & Torrance, K. E. (1982). A reflectance model for computer
graphics. ACM Transactions on Graphics (TOG), 1(1), 7-24.</p>
<p>Foley, T., & Sugerman, J. (2005, July). KD-tree acceleration structures
for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware (pp. 15-22). ACM.</p>
<p>Henyey, L. G., & Greenstein, J. L. (1941). Diffuse radiation in the
galaxy. The Astrophysical Journal, 93, 70-83.</p>
<p>Jensen, H. W., Marschner, S. R., Levoy, M., & Hanrahan, P. (2001,
August). A practical model for subsurface light transport. In
Proceedings of the 28th annual conference on Computer graphics and
interactive techniques (pp. 511-518). ACM.</p>
<p>Kajiya, J. T. (1986, August). The rendering equation. In ACM Siggraph
Computer Graphics (Vol. 20, No. 4, pp. 143-150). ACM.</p>
<p>Kopta, D., Ize, T., Spjut, J., Brunvand, E., Davis, A., & Kensler, A.
(2012, March). Fast, effective BVH updates for animated scenes. In
Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and
Games (pp. 197-204). ACM.</p>
<p>Lafortune, E. P., & Willems, Y. D. (1993). Bi-directional path tracing.</p>
<p>NVIDIA. (2015). Memory Transactions. NVIDIA® Nsight™ Development
Platform, Visual Studio Edition 4.7 User Guide. Retrieved from
http://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/
report/cudaexperiments/sourcelevel/memorytransactions.htm</p>
<p>NVIDIA. (2017). CUDA Toolkit Documentation. Retrieved From
http://docs.nvidia.com/cuda/thrust/#axzz4dK4GrjBF</p>
<p>Pauly, M. (1999). Robust Monte Carlo Methods for Photorealistic
Rendering of Volumetric Effects (Doctoral dissertation, Master’s Thesis,
Universität Kaiserslautern).</p>
<p>Pharr, M., Jakob, W., & Humphreys, G. (2011). Physically based
rendering: From theory to implementation. Second Edition. Morgan
Kaufmann.</p>
<p>Santos, A., Teixeira, J. M., Farias, T., Teichrieb, V., & Kelner, J.
(2012). Understanding the efficiency of KD-tree ray-traversal techniques
over a GPGPU architecture. International Journal of Parallel
Programming, 40(3), 331-352.</p>
<p>Schlick, C. (1994, August). An Inexpensive BRDF Model for
Physically‐based Rendering. In Computer graphics forum (Vol. 13, No. 3,
pp. 233-246). Blackwell Science Ltd.</p>
<p>Schmittler, J., Woop, S., Wagner, D., Paul, W. J., & Slusallek, P.
(2004, August). Realtime ray tracing of dynamic scenes on an FPGA chip.
In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics
hardware (pp. 95-106). ACM.</p>
<p>Veach, E. (1997). Robust monte carlo methods for light transport
simulation (Doctoral dissertation, Stanford University).</p>
<p>Vinkler, M., Havran, V., & Bittner, J. (2014, May). Bounding volume
hierarchies versus kd-trees on contemporary many-core architectures. In
Proceedings of the 30th Spring Conference on Computer Graphics (pp.
29-36). ACM.</p>
<p>Wald, I., & Havran, V. (2006, September). On building fast kd-trees for
ray tracing, and on doing that in O (N log N). In Interactive Ray
Tracing 2006, IEEE Symposium on (pp. 61-69). IEEE.</p>
<p>Walter, B., Marschner, S. R., Li, H., & Torrance, K. E. (2007, June).
Microfacet models for refraction through rough surfaces. In Proceedings
of the 18th Eurographics conference on Rendering Techniques (pp.
195-206). Eurographics Association.</p>
<p>Ward, G. J. (1992). Measuring and modeling anisotropic reflection. ACM
SIGGRAPH Computer Graphics, 26(2), 265-272.</p>
<p>Zhou, K., Hou, Q., Wang, R., & Guo, B. (2008). Real-time kd-tree
construction on graphics hardware. ACM Transactions on Graphics (TOG),
27(5), 126.</p>
<h2 id="appendix"><em>Appendix</em></h2>
<p>The following pictures show the result of rendering a BMW M6 car for one
minute in Cycles Render, one minute in our path tracer, and one hour in
our path tracer, successively. The BMW M6 car model was modeled by Fred
C. M’ule Jr in 2006, under
<a href="https://creativecommons.org/publicdomain/zero/1.0/">CC-Zero</a> (public
domain) license, downloaded from
http://www.blendswap.com/blends/view/3557.</p>
<p>
<img src="/img/gpt_part2/image127.jpg" alt="" />
<img src="/img/gpt_part2/image128.jpg" alt="" />
<img src="/img/gpt_part2/image129.jpg" alt="" /></p>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/13_final/
http://dqlin.xyz/tech/2017/06/30/13_final/tech[GAPT Series - 12] Benchmarking<!--more-->
<p>Benchmarking different path tracing engines is not a trivial task.
Different engines have different strengths at different types of
rendering tasks. In addition, rendering methods may be different for
different engines, thus it is difficult to choose a measure of the
performance. If one engine uses Metropolis Light Transport and another
engine uses brute force path tracing, one cannot claim that the first
engine has a better performance than the second one, just because it has
a larger frame rate (or samples per second for offline path tracing).
Normally, we compare the performance by convergence rate – in same
amount of time, the engine converges more has a better performance –
measured by the variance level. A special reminder is that one can only
use the variance measure when the engines use same basic sampling method
– the only two we introduced before are normal Monte Carlo sampling and
Markov Chain Monte Carlo sampling (used only in MLT) – as MLT will
always try to find a smallest variance even if the color is incorrect
(also known as start-up bias). Alternatively, it seems that one can also
compare the absolute difference between the rendered image and the
ground truth. However, the BSDF used in different engines are usually
slightly different, in case of which the absolute difference is an
invalid measure. A practical solution of this issue is to force
different engines use the same basic sampling method (in most cases it
can be changed in options) and compare the convergence rate. It is
important to notice that images generated by different engines must all
be tone mapped or non-tone mapped before a variance comparison can be
done. Or if it is known that the engines to be compared use the same
specific sampling method (like next event estimation or bi-directional
path tracing), it is simpler to directly compare the frame rate.
However, one should also look at the differences between the rendered
image and the ground truth to prevent some low-quality images or
artefacts produced by incorrect implementation or the deviation from the
industrial standard.</p>
<p>Two scenes are used to benchmark our path tracer against some free
mainstream path tracers. The first scene is the default Cornell Box with
all Lambertian diffuse surfaces and a diffuse area light, rendered by
next event estimation. The real-time path tracing sample program of
NVIDIA’s Optix ray tracing engine is used to compare with our path
tracer (Figure 18). Since the sample program is open-source and uses the
same next event estimation method and both our program and NVIDIA’s
program are real-time path tracers, we can compare the performance by
directly comparing the frame rate of rendering. Table 3 shows the frame
rate of rendering of our path tracer and NVIDIA’s path tracer in 512x512
resolution with 4 samples taken for each pixel in each frame (Figure
18), on the mid-end NVIDIA GeForce GTX 960M graphics card and the
high-end NVIDIA GeForce GTX 1070 graphics card.</p>
<p><img src="/img/gpt_part2/image125.jpg" alt="" /></p>
<center><small>
Figure 18 Left: Our Render Right: NVIDIA's Render
</small></center>
<table>
<thead>
<tr>
<th>Type</th>
<th>GTX 960M</th>
<th>GTX 1070</th>
</tr>
</thead>
<tbody>
<tr>
<td>NVIDIA’s Path Tracer</td>
<td>13.52 fps</td>
<td>30.0 fps</td>
</tr>
<tr>
<td>Our Path Tracer</td>
<td>14.02 fps</td>
<td>41.5 fps</td>
</tr>
<tr>
<td>Speedup</td>
<td>3.7%</td>
<td>38.3%</td>
</tr>
</tbody>
</table>
<center><small>
Table 3 Speedup of our path tracer on different graphics cards,
comparing with NVIDIA's
</small></center>
<p>A reason for our path tracer to gain a larger speedup on high-end
graphics card is that high-end graphics cards have larger memory
bandwidth, which allows faster memory operations in stream compaction
used in our path tracer but not in NVIDIA’s path tracer.</p>
<p>The second scene is the BMW M6 car modeled by Fred C. M’ule Jr. (2006),
which aims for testing the capability of our path tracer to render
models in real application. For comparison, we chose the Cycles Render
embedded in the Blender engine. Albeit being an off-line renderer, it
also has a “preview” function to progressively render the result in
real-time. Notice that Cycles Render uses a different workflow to blend
the material color and may use different BSDF formulae on same material
attribute, causing the appearance to be different (the glasses and metal
rendered by Cycles is less reflective on same attributes, and the
overall tone is different). It is extremely hard to tune the rendering
result to the same, but we can still guarantee that the workload on each
path tracer is almost the same, as the choice of material component
depends on the Fresnel equation. </p>
<p>Since the ways of implementation may also be vastly different, we use
the convergence rate in one minute as the measure of the performance. It
is important to know that it is invalid to use the variance of all pixel
in the picture to compare for convergence. As convergence corresponds to
noise level in Monte Carlo sampling, a small region that will be
rendered to a uniform color is used for convergence test. For this
scene, it is convenient to just choose the upper-left 64x64 pixel to
compare for variance, as the wall has a uniform diffuse material which
will produce nearly same color under current lighting condition. Also,
the rendered result of one hour from our path tracer is used as the
ground truth for variance comparison (it is equivalent to use either
side’s). For illumination, a 3200x1600 environment map for a forest
under sun is used.</p>
<p>The following images in Figure 19 are the grey scale value of the
upper-left 64x64 square region from Cycles Render, our path tracer, and
the ground truth. By only looking with the eyes, one is difficult to
judge which of our result and Cycles’ result has a lower noise level.
However, we can numerically analyze the variance by evaluation of the
standard deviation of the pixel values. By using OpenCV, the average
value and the standard deviation of all grey scale pixels can be easily
obtained, which are listed in Table 4.</p>
<p><img src="/img/gpt_part2/image126.png" alt="" /></p>
<center><small>
Figure 19 Left, Middle & Right: Cycles’, our, ground truth's upper left
64x64 pixels in greyscale
</small></center>
<table>
<thead>
<tr>
<th>Type </th>
<th>Average</th>
<th>Std. Deviation</th>
<th>Convergence</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theirs</td>
<td>98.42</td>
<td>10.31</td>
<td>16.7%</td>
</tr>
<tr>
<td>Ours</td>
<td>98.95</td>
<td>9.41</td>
<td>18.3%</td>
</tr>
<tr>
<td>Ground Truth</td>
<td>104.6</td>
<td>1.72</td>
<td>100%</td>
</tr>
</tbody>
</table>
<center><small>
Table 4 Comparing convergence rate of our path tracer and Cycles Render
in 1 minute's render time
</small></center>
<p>From the data, we can see that our path tracer does have a slightly
better performance than Cycles Render. Although due to time restriction,
we are not able to carry out more tests using different scenes and with
other mainstream renderers, the complexity of such test scene (700K+
faces, glossy, diffuse, refraction BSDF, environment light) can be a
solid proof that our path tracer has at least the same level of
performance with current mainstream rendering software. For the reader’s
interest, we also provide the sample pictures of our and Cycles’
rendering results for 1 minute, and our rendering result for 1 hour,
which can be found in the appendix.</p>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/12_bench/
http://dqlin.xyz/tech/2017/06/30/12_bench/tech[GAPT Series - 11] SIMD Optimization (cont.)<p>As mentioned half a year ago, apart from data structure rearrangement,
thread divergence
reduction, we can also optimize the SIMD performance by doing thread compaction.
The first section below will introduce how I figure it out using the CUDA
Thrust API, followed by a proposition of a <strong>new method for parallel
construction of kd-tree on GPU</strong>.</p>
<!--more-->
<p>The following sections will introduce three types of optimizations based
on CUDA architecture – data structure rearrangement, thread divergence
reduction and thread compaction used in our path tracer to increase the
SIMD efficiency and reduce the overall rendering time. The necessity of
most of these optimizations comes from the real-time rendering
requirement, without the possibility to design fixed number of samples
for each rendering branch. After that, two sections will be dedicated to
discussion of optimizations on specific components - a ray-triangle
algorithm better for SIMD performance will be introduced and</p>
<h2 id="111-thread-compaction">11.1 Thread Compaction</h2>
<p>Russian roulette is necessary for transforming the theoretically
infinite ray bounces to a sampling process with finite stages, which is
terminated by probability. While decreasing the expected number of
iterations for each thread in every frame and causing an overall speedup
due to early terminated thread blocks, it scatters terminated threads
everywhere, giving a low percentage of useful operations across warps
(32 threads in a warp are always executed as a whole) and an overall low
occupancy (number of active warps / maximum number of active warps) in
CUDA, which aggravates as number of iterations increases.</p>
<p>Relating to the set of basic parallel primitives, one naturally finds
that stream compaction on the array of threads is very suitable for
solving this problem. As illustrated in Figure 16, assuming each warp
only contains 4 threads and there is only one block with 4 warps running
on GPU for simplification and using green and red colors to represent
active and inactive threads, before stream compaction the rate of useful
operations is 25% (3 active threads out of 12 running threads) and after
grouping the 3 active threads to a same warp, the percentage of useful
operations becomes 75%, equivalently, same amount of work can be done by
1/3 amount of warps, leaving space for other blocks to execute their
warps. Also, if the first row is the average case for multiple blocks,
the occupancy would be 75% since each block with 4 warps has an inactive
warp, implying that less amount of work can be done with same amount of
hardware resources. With stream compaction, occupancy is close to 100%
in first few iterations, before the time when total number of active
threads is not enough to fill up the stream multiprocessor.</p>
<p><img src="/img/gpt_part2/image122.jpg" alt="" /></p>
<center><small>
Figure 16 Upper: Before thread compaction Lower: After thread
compaction
</small></center>
<p>In order to measure the performance impact of thread compaction, we
designed a test comparing frame rate of path tracing with and without
thread compaction on our NVIDIA GeForce GTX 960M for maximum trace depth
from 1 to 10, and 20, 50, 100. The test scene is the standard Cornell
box rendering with next event estimation with 1,048,576 paths traced in
each frame.</p>
<p><img src="/img/gpt_part2/image123.jpg" alt="" /></p>
<center><small>
Figure 17 Frame rate as the function of max trace depth, for program
with and without thread compaction
</small></center>
<p>As shown by Figure 17, without thread compaction, the frame rate
experiences a rapid decline in first 5 increments of max trace depth,
after which the declination of frame rate approximates a linear function
until the depth when all threads become inactive probably between 20 and 30.
With thread compaction, the frame rate starts to surpass the
original one in depth 3 with only little falloff for every depth
increment and become almost stable after depth 5.</p>
<p>The reason thread compaction causes first two max depths slower is that
thread compaction has some overhead of initialization, which cannot be
offset by the speedup provided by stream compaction when terminated
threads are too few. A struct stores next ray, mask color, pixel
position and state of activeness needs to be initialized at the
beginning for each thread and retrieved in every bounce. For stream
compaction, we also use Thrust library introduced in Chapter 2, which
offers a remove_if () function to remove the array elements satisfying
the customized predicate. For this task, the customized predicate takes
the struct as the argument and checks whether the state of activeness is
false to determine elements to discard.</p>
<p>Nevertheless, we can also use stream compaction to do a rearrangement of
threads such that threads that will be running the same Fresnel branch
in next iteration are grouped together. The number of stream compaction
operations will be equal to the number of Fresnel branches (which in our
case is 3). By using double buffering, the results of stream compaction
can be copied or appended to another array. After generating the
resorted array, the indices for the buffers are swapped. In our
experiment with a simple scene adapted from the Cornell box with glossy
reflection, diffuse reflection and caustics, up to 30% speedup can be
achieved from regrouping the threads.</p>
<h2 id="112-gpu-sah-kd-tree-construction">11.2 GPU SAH Kd-tree Construction</h2>
<p>We will propose a GPU SAH kd-tree construction method in this section.
So far, the CPU construction of SAH kd-tree has a lower bound of O(N log
N), which is still too slow for complex scenes with more than 1 million
triangles. It takes more than 10 seconds to construct the SAH kd-tree
for the 1,087,716-face Happy Buddha model on our Intel i7 6700HQ, which
is a serious overhead. Given the immense power of current GPGPU, it is a
promising task to adapt the kd-tree construction to a parallel
algorithm. A GPU kd-tree construction algorithm was proposed by Zhou et
al. (2008), which splits the construction levels into large node stages
where median of the node’s refitted bounding box is chosen as the
spatial split the and small node stages where SAH is used to select the
best split. Although with a high construction speed, the method
sacrifices some traversal performance due to the approximated choice of
best splits in large node stages. In contrast, we will now propose a
full SAH tree construction algorithm on GPU.</p>
<p>First, similar to Wald’s CPU kd-tree construction model (2006), we
create an event struct containing the 1D position, type (0 for starting
event, 1 for ending event), triangle index (which is actually triangle
address since at the beginning the node-triangle association list is
same as the triangle list), and a “isFlat” Boolean which marks whether
the opposite end has the same coordinate for every end of bounding boxes
of triangles in all 3 dimensions, which are stored in 3 arrays. For each
dimension, the event array is sorted by ascending position coordinate
while keeping ending events before starting event when the positions are
same (we use the same routine as in the Wald’s algorithm – subtracting
the triangle of ending event from the right side before SAH calculation
and adding the triangle of starting event to the left side after the SAH
calculation, which can guarantee that triangles with an end lying on the
splitting plane can find the correct parent – except for being
parallel). Such sort should be a highly efficient parallel sort like the
parallel radix sort.</p>
<p>After that, we separate the struct attributes into a SoA (structure of
arrays) for better memory access pattern. Also, we need to create an
“owner” array of length of number of triangles, which is initialized to
zeros as root has an index of 0, to store the index of owner node, since
we will be processing the nodes in parallel. So far, we have three
position arrays, three type arrays, three triangle address arrays, three
isFlat arrays, and one owner array, each of which has the same length of
events from all nodes in current construction level. Nevertheless, we
also need an array for node-triangle association, which lists the
indices of triangles associated with nodes in current level in
node-by-node order. Again, this node-triangle association list (which
will be called triangle list for short) also needs an owner list, which
we call “triOwner”, also initialized to zeros.</p>
<p>What still left for initialization are the two dynamic arrays – nodeList
for storing all the processed nodes, which are pushed into as groups
from the working node array of current construction level, linearly and
leafTriList for storing all the triangles in leaves in leaf-by-leaf
linear order.</p>
<p>After all initializations are done, we choose a dimension with the
largest span in the root’s bounding box. Note that the selection of such
dimension will be processed in parallel in following iterations, at the
moment of creating node structs for all newly spawned children from the
current level. The following explanation will treat the current
construction level a general level with many nodes other than level 0.
The first parallel operation other than sorting we perform is the
inclusive segmented scan on the type array, the purpose of which is to
count the number of ending events before the current event (or including
the current event if it is an ending event) for use in the following
calculation of number of triangles on the left and right side of the
splitting plane, alongside with the surface areas of bounding boxes of
the potential left child and right child, as is required to calculate
the SAH function. In this segmented scan, the owner array is used as a
key to separate events from different nodes. It is worth mentioning that
for SAH calculation, the offset of the node’s events in the event list
is stored in the node struct, so that an event is able to know its
relative position in its belonging part in the array, which will be used
together with the scanned result of number of starting events to the
left to derive the number of triangles in the left or right subspace of
the splitting plane. For SAH calculation for splitting plane with flat
triangle lying on it, we simplified the process by grouping all such
flat triangles to the left side, which in most cases has no influence on
traversal performance, so that we do not need to deal with the flat case
specially in triangle counting. The information of a potential split is
stored in a struct containing SAH cost, two child bounding boxes,
splitting position, and number of left side and right side triangles.
The array of such struct then undergoes a segmented reduction to find
the best split (with minimal SAH cost) for each node.</p>
<p>The next step is assigning triangles to each side, which is also the
step where we determine whether to turn the interesting node to a leaf.
In the assigning function which is launched for every event in current
splitting dimension in parallel, we check whether the best cost is
greater than the cost of not splitting (which in most cases is
proportional to the number of the triangles in the node) or the number
of triangles in the node is below a threshold we set. If it is the case,
we create a leaf by marking the “axis” attribute in the node struct with 3.
For assigning triangles to both children, our key method is to use a
bit array of twice the size of the current triangle list and let the
threads of current events to assign 1 at the address at the belonging
side (or two sides if the triangle belongs to both left and right side),
after which the bit array is scanned to obtain the address of the
triangle list in next level. Since the events are in sorted order, an
event can decide its belonging by comparing the index with the index of
the event chosen for best split. If the event is a starting event, and
index is smaller than the best index, the event will assign its triangle
to the left side; and if the event is an ending event, and the index is
greater than the best index, the event will assign its triangle the
right side. Notice that because we are launching a thread for each
event, a triangle spanning across the splitting plane will be correctly
assigned to both side by different threads, without special care. In
addition, flat triangles lying on the splitting plane will be assigned
to both sides (where isFlat variable is checked) to avoid the effect of
numerical inaccuracy in traversal which can cause artefacts.</p>
<p>Also, a leaf indicator array is assigned by the threads in the triangle
assignment function such that the indicator array would have a 1 in the
position of triangles that belong to a newly created leaf in the
triangle list, which will be scanned to determine the address of the
triangle in the leafTriList, similar to how the addresses of triangles
in the next level’s triangle list are determined, and reduced to obtain
the number of triangles in the leafTriList in current level which is
used to calculate the new size of the whole leafTriList to be used as
next level’s offset. Since we also need to know the local offset of the
leaf’s triangles in the part of current level in leafTriList, we need to
do a segmented reduction followed by an exclusive scan on the leaf
indicator array before assigning the offset to the leaf’s struct.</p>
<p>Before spawning new events for the child nodes, we need to finish the
rest of the operations on the triangle list. The triOwner list for the
new level can be easily generated by “spawning” a list from the original
triOwner list with doubled size by appending the list to itself with the
owner index offset by the original number of owners of nodes in the
second half and performing a stream compaction using the aforementioned
bit array as the key to remove the entries for triangle not belonging to
the specific side. A question may be that after the stream compaction,
the owner indices are not incremental, which cannot be used for
indexing. However, this issue can be easily solved by doing a parallel
binary search on the returned key array of the segmented reduction (or
counting, more properly) on the constant array of 1 (the returned value
array of which is stored as the counts of the triangles in next level’s
nodes) with the just generated next level’s triOwner array itself as the
key, whose result is used to replace the array. In a similar way, the
triangle list for next level is “spawned” from the original triangle
list and compacted by the bit array.</p>
<p>Finally, we explain how the next level’s events (type, split position,
isFlat and triangle address) are generated. The method is surprisingly
simple – after duplicating the event list, we only need to produce a bit
array for events by checking the corresponding values in the bit array
for triangles, which only requires reading the values in current events’
triangle address list as the pointer to the position in the bit array
for triangle. The 3 attributes type, split position and isFlat can be
spawned by duplicating the original array and perform a stream
compaction with the bit array as the key. The triangle address array
itself can spawn the array for next level by duplicating, reading the
new addresses in the previously scanned result of the triangle bit array
and also doing a stream compaction.</p>
<p>So far, there is only one last array to spawn – the event’s owner list
in the next level, which can be generated in the same method as the
triOwner array uses – “stream compaction – segmented reduction
– binary search”. Before next iteration
begins, node structs for next level are created using data like counts
and offsets in the corresponding previous generated arrays and pushed to
the final node list as a whole level. The splitting axes for the next
level are also chosen in this process by comparing the lengths of the 3
dimensions of the bounding box. If an axis different from current axis
is chosen, the 4 event arrays for the 3 dimensions are “rotated” to the
desirable place – if 0 stands for the splitting axis and current
splitting axis is x, y and z will be stored under index 1 and 2; if next
splitting axis is z, the memory will have a “recursive downward”
rotation so that z is rotated to 0, x is rotated to 1, y is rotated to 2.
Finally, the pointers of all working arrays are swapped with the
buffered arrays. The termination condition is that the next level has no
nodes.</p>
<p>We also performed a test comparing the speed between Wald’s CPU
construction and our GPU construction of the same SAH kd-tree (full SAH
without triangle clipping) on a computer with Intel i7-4770 processor
and NVIDIA GTX 1070 graphics card. The result (Table 2) shows that a
sufficiently large model is required for our GPU construction to
outperform the CPU counterpart, due to the overhead of memory allocation
and transfer. A 5x speedup can be obtained when the model size goes
beyond 1M, which indicates that our method can be used for ray tracing
large models to greatly reduce the initialization overhead while
maintaining the same tree quality.</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Face Count</th>
<th>CPU(s)</th>
<th>GPU(s)</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cornell</td>
<td>32</td>
<td>0.001</td>
<td>0.046</td>
<td>0.02x</td>
</tr>
<tr>
<td>Suzanne</td>
<td>968</td>
<td>0.016</td>
<td>0.095</td>
<td>0.17x</td>
</tr>
<tr>
<td>Bunny</td>
<td>69,451</td>
<td>1.442</td>
<td>0.655</td>
<td>2.20x</td>
</tr>
<tr>
<td>Dragon</td>
<td>201,031</td>
<td>3.705</td>
<td>1.100</td>
<td>3.37x</td>
</tr>
<tr>
<td>Buddha</td>
<td>1,087,716</td>
<td>13.903</td>
<td>2.801</td>
<td>4.96x</td>
</tr>
</tbody>
</table>
<center><small>
Table 2 Speedup of our GPU SAH Kd-tree comparing with Wald's CPU
algorithm
</small></center>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/11_stream/
http://dqlin.xyz/tech/2017/06/30/11_stream/tech[GAPT Series - 10] Rendering Effects<p>Before going to the discussion of SIMD optimization, we present this
chapter to briefly introduce the rendering effects supported by the path
tracer, the importance of which comes from the fact that it is the
direct application of the sampling methods discussed before.</p>
<!--more-->
<h2 id="101-surface-to-surface-reflectionrefraction">10.1 Surface-to-surface Reflection/Refraction</h2>
<p>As a guarantee of its practical capability, our path tracer simulates
the optical effects of all kinds of surface-to-surface reflection or
refraction, not including the cases like polarized light or fluorescence
which are rare in practice. For diffuse reflection, the Lambertian model
adjusted by Ashikhmin and Shirley formula is used (Section 4.5) while
Cook-Torrance microfacet model is responsible for specular or glossy
reflection. Our path tracer also supports anisotropic material
standardized by Wald model (Ward, 1992), which only modifies the
Beckmann distribution factor in Cook-Torrance model (Cook & Torrance,
1982),</p>
<p style="text-align:center;"><img src="/img/gpt_part2/image74.png" alt="" /></p>
<p>where
<img src="/img/gpt_part2/image75.png" alt="" />
and
<img src="/img/gpt_part2/image76.png" alt="" />
correspond to “roughness” of the material
in x and y direction w.r.t. tangent space. Taking azimuth angle
<img src="/img/gpt_part2/image77.png" alt="" />
as argument, it is easy to see that when
<img src="/img/gpt_part2/image78.png" alt="" />
the distribution is completely
determined by
<img src="/img/gpt_part2/image75.png" alt="" />
and when
<img src="/img/gpt_part2/image79.png" alt="" />
the distribution is totally decided by
<img src="/img/gpt_part2/image76.png" alt="" />
.</p>
<p>For importance sampling the Wald BRDF, two uniform unit random variables
<img src="/img/gpt_part2/image80.png" alt="" />
and
<img src="/img/gpt_part2/image81.png" alt="" />
are generated and it is easy to solve
the equations for azimuth angle
<img src="/img/gpt_part2/image77.png" alt="" />
and altitude angle
<img src="/img/gpt_part2/image82.png" alt="" />
:</p>
<p style="text-align:center;"><img src="/img/gpt_part2/image83.png" alt="" /></p>
<p><img src="/img/gpt_part2/image84.png" alt="" /></p>
<center><small>
Figure 10 Isotropic & anisotropic specular
</small></center>
<p>For surface-to-surface refraction, the Cook-Torrance microfacet model
can be modified by recalculating the Jacobian matrix for the transform
between half-vector and outgoing vector (Walter, Marschner,
Li, Torrance, 2007), yielding</p>
<p><img src="/img/gpt_part2/image85.png" alt="" />
, where D is the microfacet distribution
function (Beckmann in our implementation, G is the shadowing term (a
numerical approximation to Smith shadowing function in our
implementation, the roughness
coefficient<img src="/img/gpt_part2/image86.png" alt="" />
inside whom is substituted by 1/
<img src="/img/gpt_part2/image87.png" alt="" />
for anisotropic material) and J =
<img src="/img/gpt_part2/image88.png" alt="" />
, where
<img src="/img/gpt_part2/image89.png" alt="" />
are the index of refraction of the two
media, is the absolute value of the Jacobian matrix. The half-vector
in refraction indicates the normal of the sampled microfacet,
which can be obtained by calculating
<img src="/img/gpt_part2/image90.png" alt="" />
if BSDF needs to be determined from
arbitrary incoming and outgoing radiance so as in the case of multiple
importance sampling.</p>
<h2 id="102-volume-rendering">10.2 Volume Rendering</h2>
<p>The rendering techniques so far are based on the assumption that spaces
between surfaces are vacuum, which is only an effective simplification
in ordinary cases with clear air. For phenomena like fog, smog, smoke,
obvious scatter, absorption, and emission can happen between surfaces,
which affect the radiance towards viewer. In the presence of such
participating media, an integro-differential equation of transfer
(Chandrasekhar, 1960) shows the directional radiance gradient of a point
in participating media to model the change of radiance in space.</p>
<p style="text-align:center;"><img src="/img/gpt_part2/image91.png" alt="" /></p>
<p>where
<img src="/img/gpt_part2/image62.png" alt="" />
is the point in space and
<img src="/img/gpt_part2/image92.png" alt="" />
is the direction in interest with
<img src="/img/gpt_part2/image93.png" alt="" />
being the measure of displacement along
the direction.
<img src="/img/gpt_part2/image94.png" alt="" />
is the attenuation coefficient
accounting for both absorption and out-scattering, while
<img src="/img/gpt_part2/image95.png" alt="" />
is the scattering coefficient
controlling the magnitude of in-scattering, which has the phase function
<img src="/img/gpt_part2/image66.png" alt="" />
to define the probability density of
in-scattering from each direction.
<img src="/img/gpt_part2/image96.png" alt="" />
and
<img src="/img/gpt_part2/image97.png" alt="" />
stand for media emission and incoming
radiance, respectively. For isotropic media,
<img src="/img/gpt_part2/image66.png" alt="" />
has a constant value of
<img src="/img/gpt_part2/image98.png" alt="" />
, which is intuitive as the integral of
differential angles in the sphere gives
<img src="/img/gpt_part2/image99.png" alt="" />
. For general media, there is a phase
function developed by Henyey and Greenstein
(1941) which provides a simple asymmetry parameter
<img src="/img/gpt_part2/image100.png" alt="" />
ranging from -1 to 1 to control the
“polarity” of the participating media.</p>
<p>In practice, the integro-differential equation is solved by decomposing
different parts, calculating their values separately before using ray
marching to accumulate the values in each sample point on the incoming
ray. These sample points are treated as differential segments with
constant coefficients. Indeed, such numerical integration can be
estimated by Monte Carlo sampling. The parameter t of ray can naturally
be used as the measure of displacement along the ray direction.
Depending on the targeted convergence rate or frame rate, the step size
can be set larger or smaller. Normal Monte Carlo sampling would randomly
pick a point in each segment for radiance estimation. However, for
estimating the light transfer integral which is only one dimensional and
often has a smooth function, standard numerical integration may have an
edge over the Monte Carlo method. By using a stratified pattern (Pauly,
1999) which assigns same offset for each segment and randomizes the
offset for a new ray, it can be shown to have a lower variance than
Monte Carlo. The p.d.f. for such samples is also very simple, which is a
constant equal to the step size
<img src="/img/gpt_part2/image101.png" alt="" />
.</p>
<p>In the simplest case of volume rendering, where the participating media
has homogenous attributes, both emission and attenuation factors can be
directly evaluated by
<img src="/img/gpt_part2/image102.png" alt="" />
(where s is the length of the ray
segment of interest), which is the direct result of solving the
differential equation
<img src="/img/gpt_part2/image103.png" alt="" />
, also known as Beer’s law. If the
distribution of media density or other properties has an analytical
solution for such differential equation, the analytical solution (if it
is an elementary function) can be directly evaluated without using
sampling techniques, which is used where an analytical solution is
impossible, unknown, or too complex, or in the case where the
distribution is a customized discrete data set.</p>
<p>As mentioned before, the attenuation term and the augmentation term can
be treated separately in computation, which maps well to the
accumulation of mask and intermediate color in our implementation,
expressed by</p>
<p style="text-align:center;"><img src="/img/gpt_part2/image104.png" alt="" />,</p>
<p>where
<img src="/img/gpt_part2/image105.png" alt="" />
is multiplied to the mask and
<img src="/img/gpt_part2/image106.png" alt="" />
is estimated with samples and added to
the immediate color. Transmittance coefficient T can either be
analytically evaluated or estimated with samples as mentioned in
previous paragraph.</p>
<p>The sampling estimation of the augmentation term
<img src="/img/gpt_part2/image106.png" alt="" />
can use the same stratified pattern
mentioned before to reduce the variance. However, inside
<img src="/img/gpt_part2/image107.png" alt="" />
, there is another integral which accounts
for in-scattering from all directions, the estimation of which is
another non-trivial task. For simplicity, we only consider single
scattering from direct lighting. A light sample can be taken as in next
event estimation and a shadow ray is shot from the current ray segment’s
sample point to the light to detect visibility. Note that this method
neglects the contribution from indirect lighting to the in-scattering,
which is often too weak to affect the rendering equality.</p>
<p>It is worth mentioning that it is also possible to use metropolis light
transport for sampling the in-scattering. A random number can be stored
here for mutation in every frame so that directions with large
contribution can be easily discovered and focused on, which is
especially suitable for very anisotropic participating media.</p>
<p>Two sample images are shown in Figure 11 to exhibit the visual effect of
volume rendering. The first image shows strong scattering and absorption
of light in a room with dense homogenous smog. The smog has a
Henyey-Greenstein asymmetry parameter of 0.7, indicating that incident
lights are primary scattered forward, as can be seen in the narrow shape
of the illuminated cone under the area light. The second image
illustrates fog with white emission and exponentially attenuated density
in vertical direction, which is the miniature of the atmosphere in a
box.</p>
<p><img src="/img/gpt_part2/image108.jpg" alt="" /></p>
<center><small>
Figure 11 Left: Homogenous smog with strong absorption and forward
scattering Right: Fog with exponential density
</small></center>
<h2 id="103-subsurface-scattering">10.3 Subsurface Scattering</h2>
<p>Scattering and absorption can happen inside objects as well. The reason
for using BSDF to estimate the radiance from material is that many
material can be categorized into metallic (which reflects most of the
energy at surface) or dielectric which are either too opaque or too
transparent to exhibit any obvious scattering effect. For material with
an albedo high enough to be considered as non-transparent and not enough
to be considered as opaque like jade, milk and skin, the effect of
scattering inside cannot be ignored, for which BSDF cannot give
sufficient approximation of the surface radiance. Instead, BSSRDF
(bi-directional subsurface scattering reflectance distribution function)
is used to include the contribution to the outgoing radiance of the
point in interest from incoming radiance on other surface points. A
6-dimensional function</p>
<p><img src="/img/gpt_part2/image109.png" alt="" />
(<img src="/img/gpt_part2/image110.png" alt="" />
is known, the other 3 variables are all
2D) can be used to describe the sum of all radiance scattered from
<img src="/img/gpt_part2/image111.png" alt="" />
to
<img src="/img/gpt_part2/image110.png" alt="" />
in all possible paths. Again, to
calculate the outgoing radiance of the point in interest, one must
integrate the contribution from points all over the object surface,
which can be estimated by randomly sampling a surface point as a Monte
Carlo process. However, the BSSRDF itself is largely unknown due to the
complexity of the multiple scattering problem. Also, points in the
surrounding may contribute most of the radiance which implies that
indiscriminately choosing a surface point has an intolerably low
convergence rate.</p>
<p>To provide a practical approximation of general subsurface scattering,
Jensen et al. introduced the dipole diffusion model (2001), which
decomposes the BSSRDF into a diffusion term and a single scattering term
as a simplification. Observing that the radiance distribution becomes
nearly isotropic after thousands of scatterings in material with very
high albedo like milk, they proposed a diffusion model that transforms
the incoming ray into a dipole source and uses the radial diffusion
profile of the material to compute the outgoing radiance. The diffusion
term has an exponential falloff with respect to the distance from the
incidence point, which provides an effective p.d.f. for importance
sampling. Note that the key idea of this model is to interpolate between
2 extreme cases - pure single scattering and pure diffusion – for
general material, which turns out to be an insufficient approximation
when highly physically authentic pictures are required.</p>
<p>In our implementation, we only consider the contribution of the single
scattering term as a demonstration of the idea of subsurface scattering.
The program can be easily extended with an additional module for the
diffusion term estimation. Since single scattering only happens when the
refraction rays of
<img src="/img/gpt_part2/image112.png" alt="" />
and
<img src="/img/gpt_part2/image113.png" alt="" />
meet inside the material, BSSRDF is not
directly evaluated by taking a surface sample. Instead, after the ray
intersects the surface with BSSRDF component, a random distance is
generated by
<img src="/img/gpt_part2/image114.png" alt="" />
, where
<img src="/img/gpt_part2/image115.png" alt="" />
is the unit uniform random variable and
<img src="/img/gpt_part2/image116.png" alt="" />
is the reduced scattering coefficient
corresponding to the tendency of forward scattering, followed by moving
the intersection point by such distance in the ray direction to become
the point of single scattering and using the phase function p to
importance sample the direction of scattering as the direction of the
next ray. Note that this method is only suitable for rendering
translucent objects with low to moderate albedo like jade. Objects with
high albedo still require at least the dipole model to render a
reasonable appearance.</p>
<p>A pair of sample images are shown in Figure 12 to show the Stanford
bunny with a deep jade color rendered as translucent and opaque
material. The result of the translucent material shows the effect of
subsurface single scattering. Note that thin parts of the object like
the ear has a more acute response to the change of albedo.</p>
<p><img src="/img/gpt_part2/image117.jpg" alt="" /></p>
<center><small>
Figure 12 Left: Bunny with subsurface scattering Right: Bunny without
subsurface scattering
</small></center>
<h2 id="104-environment-map-and-material-texture">10.4 Environment Map and Material Texture</h2>
<p>As mentioned in the introduction, the heavy usage of textures can often
be found in rasterization based graphics used in most of the mainstream
video games. PBR workflow, as the de-facto industrial standard, uses a
variety of textures to describe the varying material attributes across
the texture space. Apart from the basic diffuse color texture or albedo
map, a roughness map is used to define the shape of the distribution of
normal in the microfacet model, such that the roughness value of 0
indicates a perfectly smooth surface point which only reflects at the
mirroring direction and the roughness value of 1 indicates a surface
point with almost equal amount of reflection in every direction which
makes it no longer look glossy. Similarly, a metalness map is used to
define the level of similarity to metal of each texel. A metalness value
of 1 defines a surface point that only has specular reflection with some
absorption of specific wavelengths which simulates the behavior of
metal, while a metalness value of 0 defines a completely dielectric
surface point that follows the Fresnel equation with real index of
refraction. More commonly, a uniform white specular color is defined for
every opaque object to substitute the evaluation of
<img src="/img/gpt_part2/image32.png" alt="" />
in the Fresnel equation when index of
refraction is not defined. Furthermore, a normal map is used to simulate
microscopic geometry for material with a complex surface texture and
ambient occlusion maps are used to compensate the corresponding effect
in the lack of global illumination in rasterization based graphics. For
still objects, light maps which baked the color bleeding or shadow
(requiring still light model) can be used in surrounding surfaces like
walls or floors to give a more realistic feeling of the rendering. Since
it is generally infeasible to generate real-time samples in
rasterization based graphics, usually an irradiance map and a
pre-filtered mipmap are precomputed from the environment map of the
scene to simulate diffuse and glossy reflection of indirect lighting
respectively.</p>
<p>However, for path tracing, we only need the maps related to material
attributes in texture space, i.e. albedo map, metalness map, and
roughness map, because we are using a global illumination algorithm. We
may also want to use environment maps as image-based lights – lighting
from the outer environment like sky and sunlight are usually hard to be
modeled as solid objects. Also, if the surrounding environment is
sufficiently far, using environment map can avoid the ponderous task of
modeling all objects from the border of the bounding environment to the
point being shaded and accumulating all possible indirect lightings
between any two objects. One can simulate the intricate indirect
lighting effect by sampling the texel in the reflected ray direction if
nothing in the local setting (the models in interest) is hit, treating
the outer environment as an infinitely far sphere so that all rays can
be seen as being shot from the center of the sphere. The environment map
can either be stored as a cubemap or a spherical projection map. In our
implementation, we use spherical projection maps due to easier
computation and less chance of artefact. </p>
<p>Two sample images are shown in Figure 13 to demonstrate the effect of
environment map as image based lighting and the material texture as
surface attribute control. The second image simply provides albedo,
roughness and metalness map to transform the diffuse back wall of the
original Cornell box to a realistic scratched metal.</p>
<p><img src="/img/gpt_part2/image118.jpg" alt="" /></p>
<p><img src="/img/gpt_part2/image119.jpg" alt="" /></p>
<center><small>
Figure 13 Upper: Armadillo under environment lighting Lower: Cornell
Box with scratched metal wall
</small></center>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/10_fx/
http://dqlin.xyz/tech/2017/06/30/10_fx/tech[GAPT Series - 9] Sampling Algorithms (Cont.)<p><em>This post corrects some misconception in the former section [GAPT Series - 3]
Path Tracing Algorithm as well as introduces some new rendering methods.</em></p>
<!--more-->
<p>A key feature of path tracing that differentiates it with normal ray
tracing is that it is a stochastic process (provided that the random
numbers are real) instead of a deterministic process. Normal path
tracing depends on Monte Carlo algorithm which gradually converges the
result to the ground truth as the number of samples increases.
Theoretically, one can uniformly sample all paths to converge to the
correct result. However, given limited hardware resource and time
requirement, we need to adapt the brute Monte Carlo algorithm by various
strategies for different cases. The following sections will introduce
the rendering equation we need to solve in path tracing and some most
popular sampling methods.</p>
<h2 id="91-bsdf-and-importance-sampling">9.1 BSDF and Importance Sampling</h2>
<p>Importance sampling is an effective method for solving rendering
equation in high convergence rate. Basically, importance sampling
chooses samples by a designed probability density function and divides
the sample value by p.d.f. to return the result. If the designed p.d.f.
turns out to be proportional with the values, variance will be very low.
Since the sample value is determined by
<img src="/img/gpt_part2/image20.png" alt="" />
, both irradiance and BSDF decides the
p.d.f.
<img src="/img/gpt_part2/image21.png" alt="" />
of the surface sample. Importance
sampling BSDF is a more trivial task than importance sampling the
spatial distribution of incoming radiance. As long as a BSDF formula
(Lambertian, Cook-Torrance, Oren-Nayar, etc.) and corresponding surface
characteristics are provided, one can calculate
<img src="/img/gpt_part2/image22.png" alt="" />
as the result. However, distribution of
<img src="/img/gpt_part2/image23.png" alt="" />
is harder to calculate in general case.
For indirect lighting on the surface point, it is impossible to know the
distribution of the incoming radiance, which is a chicken and egg
paradox. For direct lighting where
<img src="/img/gpt_part2/image24.png" alt="" />
, different methods can be applied to
find the p.d.f. With a simplified lighting model like point light or
area light, the distribution of incoming radiance is rather explicit.
However, when image-based lighting is used (e.g. environmental map or
sky box), advanced techniques are required to perform importance
sampling efficiently. The following sections will all focus in the case
of having explicit lighting models since it is easier to exemplify how
to utilize the p.d.f. of incoming radiance. Discussion for image-based
lighting will be continued in the next chapter. Since combing the effect
of two p.d.f. requires not only sampling on the surface point, but the
lighting model or lighting image as well. A technique called multiple
importance sampling will be introduced in Section 9.3.</p>
<h2 id="92-next-event-estimation">9.2 Next Event Estimation</h2>
<p>For direct lighting with explicit model, lighting computation can be
directly done without shooting ray to the lights, which is called next
event estimation. Literally, it accounts for the contribution of what
may happen in the next iteration. For ideal diffuse surface, whose BSDF
is spatial uniform (usually expressed by Lambertian model), this task is
very simple. One only needs to sample a point in one of the lights. Take
diffuse area light (emission of a point is uniform in all directions) as
an example, if the emission across the light emission surface is the
same, one only needs to uniformly sample the shape of the light.
Otherwise, usually there is an existing intensity distribution to use.
After that, to convert the area measure of p.d.f to solid angle measure,
from formula
<img src="/img/gpt_part2/image25.png" alt="" />
(Veach, 1997) we can derive
<img src="/img/gpt_part2/image26.png" alt="" />
, which can be used as the p.d.f for
incoming radiance. Given that Lambertian surface has a BSDF of
<img src="/img/gpt_part2/image27.png" alt="" />
, the final color can be expressed in the
simple formula
<img src="/img/gpt_part2/image28.png" alt="" />
.</p>
<p>For diffuse area lights of simple geometric shape like triangle, uniform
sampling can be trivially done by using barycentric coordinates. Apart
from light sampling, we also need to shoot a shadow ray from the surface
point being shaded to the sampled light point. Since ray-triangle
intersection test is a bottleneck in ray tracing for reasonably complex
scenes, this means we will almost halve the frame rate. However, the
benefit of next event estimation totally worth the costs it takes.
Figure 5 is an example picture comparing the convergence rate of diffuse
reflection of two balls under highlight with and without next event
estimation, where 16 samples are taken for each pixel for both sampling
methods. Although the frame rate for rendering with next event
estimation is only 60% of that without next event estimation, the noise
level of the former is dramatically lower than the latter.</p>
<p><img src="/img/gpt_part2/image29.jpg" alt="" /></p>
<center><small>
Figure 5. Comparison between same scene of high dynamic range with and
without next event estimation
</small></center>
<p>However, this example only shows the case where the shaded surface has a
uniform BSDF. For surfaces with general non-uniform BSDF like glossy
surface in the Cook-Torrance microfacet model, sampling the light is
inefficient for variance reduction. An extreme case is the perfect
mirror reflection, whose BSDF is a delta function. Since only the
mirrored direction of the view vector w.r.t the surface normal
contributes to the result, sampling from a random point in the light
model has zero probability to contribute. We would naturally want to
sample according to the BSDF. A general case is, a glossy surface whose
BSDF values concentrate in a moderately small range and the lighting
model occupies moderately large portion of the hemisphere around the
surface being shaded. As a rescue for general cases, multiple importance
sampling will be introduced later. However, if we want to have a balance
between quality and speed, next event estimation can be mixed with
direct path tracing for different BSDF components. Especially for the
case of surface material with only diffuse and perfectly specular
reflection, doing a next event estimation by multiple importance
sampling would be redundant. Instead, if the BSDF component in current
iteration is diffuse, we will set a flag to avoid counting in the
contribution if we hit a light in next iteration. Otherwise, such flag
will have a false value, allowing the radiance of the light hit by the
main path to be accounted into. For determining which BSDF component to
sample, we will use a “Fresnel switch”, which will be introduced in the
next section.</p>
<h2 id="93-fresnel-switch-and-the-russian-roulette">9.3 Fresnel Switch and <a href="#three_five">The Russian Roulette</a></h2>
<p>Physically, there are only reflection and refraction when light as an
electromagnetic wave interacts with a surface, the ratio of which is
determined by the media’s refractive indices in two sides of the surface
and the incidence angle of the light. The original formula is actually
different for s and p polarization component in the light ray. However,
in computer graphics, we normally treat the light as non-polarized.
Under this assumption, Schlick’s approximation (Schlick, 1994) can be
used to calculate the Fresnel factor:
<img src="/img/gpt_part2/image30.png" alt="" />
. However, in the standard PBR workflow,
the refractive index is usually only provided for translucent objects.
Although we can look up for the refractive index of many kinds of
material, for metals and semiconductors, the refractive index n is a
complex number, which complicates the calculation. Since the refractive
index of the metal indicates an absorption of the color without
transmission, specular color is used to approximate the reflection
intensity in normal direction. In practice, there is usually an albedo
map and a metalness map for metallic objects. The metalness of a point
determines the ratio in interpolation between white color and albedo
color. In the case of metalness equal to 1, the albedo color will become
the specular color in reflection, as what we refer to as the color of
metal actually suggests how different wavelengths of light is absorbed
rather than transmitted or scattered in the case for dielectric
material.</p>
<p>Briefly, if there is an explicit definition of index of refraction, we
will use that for
<img src="/img/gpt_part2/image31.png" alt="" />
calculation of the material. Otherwise,
we interpolate the albedo color provided in material map or scene
description file with the white specular color meaning complete
reflection of light according to the metalness of the material, as
defined in normal PBR workflow. However, considering the fact that
dielectric material also absorbs light to some degree, the specular
color can be attenuated by some factor less than 1 to provide a more
realistic appearance.</p>
<p>After
<img src="/img/gpt_part2/image32.png" alt="" />
is calculated, we can calculate the
Fresnel factor by substituting the dot product of view vector and the
surface normal into the Schlick’s formula. Notice that there is a power
of 5, which is better calculated by brute multiplication of 5 times than
using the pow function in the C++ or CUDA library for better
performance. The Fresnel factor R indicates the ratio of reflection. In
path tracing, this indicates the probability of choosing the specular
BRDF to sample for the next direction. Complementarily, T = 1 – R
expresses the probability of transmission.</p>
<p>In our workflow, diffuse reflection is also modeled as a transmission,
which will immediately be scattered back by particles beneath the
surface uniformly in all directions, which is usually modeled by
Lambertian BSDF. However, to be more physically realistic, we can refer
to Ashikhmin and Shirley’s model (2000) which models the surface as one
glossy layer above and one diffuse layer beneath with infinitesimal
distance between. The back scattering in realistic diffuse reflection
happens at the diffuse layer as a Lambertian process, after which the
reflected ray is attenuated again by transmitted across the glossy
surface, implying less contribution of next ray in near tangent
directions. For energy conservation, Ashikhmin and Shirley also include
a scaling constant
<img src="/img/gpt_part2/image33.png" alt="" />
in the formula. Therefore, the complete
BSDF is :</p>
<p><img src="/img/gpt_part2/image34.png" alt="" />
.</p>
<p>Russian roulette is used to choose the BSDF component given the Fresnel
factor and other related material attributes. If the generated
normalized uniform random number is above the Fresnel threshold R, next
ray will be transmission or diffuse reflection. Again, the
“translucency” attribute of the material will be used as the threshold
for determining transmission or diffuse reflection, which is actually an
approximation of general subsurface scattering, which will be discussed
in next chapter. The Fresnel switch guarantees we can preview the
statistically authentic result in real-time. However, the thread
divergence implies a severe time penalty in GPU. Suggestion will be
given in the SIMD optimization analysis in Chapter 6.</p>
<p>The Russian Roulette is also responsible for thread termination. Since
any surface cannot amplify the intensity of incoming light, the RGB
value of the mask
(<img src="/img/gpt_part2/image35.png" alt="" />
) is always less than or equal to 1. The
intensity of the mask (which is the value of the largest component of
RGB, or the value in HSV decomposition of color) is then used as the
threshold to terminate paths. To be statistically correct, the mask is
always divided by the threshold value after the Russian Roulette test,
which is a very intuitive process – if the reflectance of the surface is
weak, early termination with value-compensated masks is equivalent to
multiple iterations of normal masks. Such termination decision greatly
speeds up rendering without increasing the variance.</p>
<p>For generating photorealistic result, it is possible to use the Russian
Roulette solely to determinate termination without setting a maximum
depth. However, considering the extreme case where the camera is inside
an enclosed room and all surfaces are perfectly specular and reflect all
lights, there is still a need to set a maximum depth.</p>
<h2 id="94-multiple-importance-sampling">9.4 Multiple Importance Sampling</h2>
<p>Returning to our question of next event estimation or direct lighting
computation for general surface BSDF, the technique called multiple
importance sampling was introduced by Eric Veach in his 1997’s PhD
dissertation. Basically, it uses a simple strategy to provide a highly
efficient and low-variance estimate of a multi-factor integral mapped to
a Monte Carlo process where two or more sampling distributions are used,
usually in different sampling domains. Given an integral
<img src="/img/gpt_part2/image36.png" alt="" />
with two available sampling distributions
<img src="/img/gpt_part2/image37.png" alt="" />
and
<img src="/img/gpt_part2/image38.png" alt="" />
, a simple formula given to as the Monte
Carlo estimator
<img src="/img/gpt_part2/image39.png" alt="" />
, where
<img src="/img/gpt_part2/image40.png" alt="" />
and
<img src="/img/gpt_part2/image41.png" alt="" />
are number of samples taken for each
distribution and
<img src="/img/gpt_part2/image42.png" alt="" />
and
<img src="/img/gpt_part2/image43.png" alt="" />
are specially chosen weighting functions
which must guarantee that the expected value of the estimation equals
the integral
<img src="/img/gpt_part2/image36.png" alt="" />
. As expected, the weighting functions can
be chosen from some heuristics to minimize the variance.</p>
<p>Veach also offers two heuristic weighting functions: balance heuristic
and power heuristic. In a general form
<img src="/img/gpt_part2/image44.png" alt="" />
where k is any particular sampling
method, balance heuristic always takes
<img src="/img/gpt_part2/image45.png" alt="" />
= 1 while power heuristic gives the
freedom to choose any exponent
<img src="/img/gpt_part2/image45.png" alt="" />
. Veach then uses numerical tests to
determine that
<img src="/img/gpt_part2/image46.png" alt="" />
is a good value for most cases.</p>
<p>In order to verify the practical result of multiple importance sampling
and the effect of the choice of weighting function, some test cases were
performed and sample images of the results are listed below.</p>
<p><img src="/img/gpt_part2/image47.png" alt="" /></p>
<center><small>
Figure 6 Left: multiple important sampling Right: single importance
(light) sampling
</small></center>
<p>The first test case compares the result of multiple importance sampling
(both from BSDF and light) and single importance sampling (only from
BSDF). In each frame, a path is traced for each pixel. Although rendered
only 10 frames, the MIS result clearly displays the shape of the
reflection of the strong area light on the rough mirror behind the boxes
and its further reflection on the alongside walls. In contrast, the
non-MIS result generated by 100 frames still has a very noisy
presentation of the reflected shape of the highlight. Note that for
reflection on the floor and ceiling, which has a low roughness, non-MIS
in frame 100 still has a lower noise level than that of MIS in frame 10.
Since most contribution to the color comes from samples generated from
BSDF, the strength of multiple importance sampling diminishes, which is
also the case when the BSDF is near uniform.</p>
<p><img src="/img/gpt_part2/image48.jpg" alt="" /></p>
<center><small>
Figure 7 Left: balance heuristic Middle: power heuristic Right:
ground truth
</small></center>
<p>Another test case aims at comparing the effect of balance heuristic and
power heuristic. The images in Figure 7 above show the result of
rendering the same Cornell box scene with moderately rough back mirror
in 10 frames for both methods. Without very carefully inspecting the
images, it is nearly impossible to observe any difference between two
images. However, noise level is indeed lower when using power heuristic.
Intuitively, if one carefully looks at some dark regions in the picture
like the front face of the shorter box, an observation that many noise
points are brighter in the image generated by balance heuristic (Since
both cases use the same seed for random number generation, it is
possible to compare noise point at same pixel position). However, we
also offer numerical analysis to compare the percentage of difference
between the two images and the ground truth (rendered by metropolis
light transport in 80000 frames). The result of calculating the
histogram correlation coefficient with the ground truth for each image
shows that image generated by power heuristic has a value of 0.7517,
larger than the image generated by balance heuristic which has a value
of 0.7488. Although this is only a minor difference, it proves that
power heuristic indeed has lower variance.</p>
<h2 id="95-bi-directional-path-tracing">9.5 Bi-directional Path Tracing</h2>
<p>An important fact of the sampling methods we have discussed so far is
that the variance level depends on both geometry of the emissive
surfaces or lights and the local geometry of surrounding surfaces. For
brute path tracing, the rate of hitting the background (out of the
scene) will be much larger than hitting the light if the summed area of
lights is too small or lights are almost locally occluded, which results
in only a few of terminated rays would carry the color of emission,
causing large variance. For importance sampling or multiple importance
sampling, shadow rays must be shot to the light. Expected variance level
can only be guaranteed when the chance of misses is of the same order of
magnitude as that of chance of hits; otherwise, the variance level will
degenerate to that of brute path tracing. Thankfully, this problem can
be solved by also shooting a ray from the light and “connect” the end
vertices of the eye path and light path to calculate the contribution,
which is called bi-directional path tracing (Lafortune & Willems, 1993).</p>
<p>The core problem in bi-directional path tracing is the “connect”
process. Since the contribution of incoming radiance is sampled as a
point (end vertex of light path) in the area domain, we must convert
that to the solid angle domain to obtain the result. The
<img src="/img/gpt_part2/image49.png" alt="" />
term in the rendering equation can be
converted to an equivalent
<img src="/img/gpt_part2/image50.png" alt="" />
in the domain of all surface areas,
where
<img src="/img/gpt_part2/image51.png" alt="" />
is the visibility factor which equals to
1 if the connection path is not occluded and 0 otherwise and
<img src="/img/gpt_part2/image52.png" alt="" />
is the geometry term.</p>
<p>Importantly, we not only connect the terminated end points of light path
and eye path, but the intermediate path ends as well. However, since
each connection involves a ray-triangle intersection test, performance
will be greatly affected - for path lengths of O(n), there will be
O(n^2^) intersection tests. In some situation like perfect mirror
reflection, we can exclude the contribution of light path since it has a
zero probability as a way to save computation time. It is worth
mentioning that for all combinations with a specific path length, the
contribution of each should be divided by the total number of
combinations to maintain energy conservation. For any combined path
length of n, there are n + 1 ways of combination of eye path and light
path if we want to include the contribution of all kinds of combination.
It is also possible to apply importance sampling here as a specific path
combination can be weighted by the p.d.f. in all path combinations of
the same length. However, that requires additional space to store the
intermediate results and may not be good for GPU performance. Next event
estimation can also be applied here, so that direct lighting component
exists in all combined path length less than or equal to (maximum eye
path length + 1), for which the denominator of contribution should be
incremented by 1 to include this factor.</p>
<p>In order to test the effect of bi-directional path tracing, we simply
inverted the light in the original Cornell Box scene such that light
faces the ceiling and the light can only bounce to other surfaces via
the small crevices on the rim of the light. The sample image in Figure 8
shows that both rendered in 100 frames, bi-directional path remarkably
reduces the noise level comparing to that when only using next event
estimation, totally worth for the reduction of frame rate to 50%.</p>
<p><img src="/img/gpt_part2/image53.png" alt="" /></p>
<center><small>
Figure 8 Left: Bi-directional path tracing Right: Uni-directional
(eye) path tracing
</small></center>
<h2 id="96-metropolis-light-transport">9.6 Metropolis Light Transport</h2>
<p>Using multiple importance sampling and bi-directional path tracing, it
is still difficult to maintain low variance in integration estimation
for solving problems like bright indirect light, caustics caused by
reflected light from caustics and light coming from long, narrow, and
tortuous corridors. The key problem of previous sampling methods is that
they only consider local importance (light or surface BSDF) rather than
the importance of the whole path. In terms of global importance
sampling, the original Monte Carlo method is still a brute force
solution. To our rescue, there is a rendering algorithm called
Metropolis Light Transport (MLT) adapted from the Metropolis–Hastings
sampling algorithm based on the Markov Chain Monte Carlo (MCMC) method
(Veach, 1997). It has the nice feature that the probability of a path
being sampled corresponds to its contribution in the global integral of
the radiance toward camera and such paths can be locally explored by
designing some mutation strategy. Basically, it proposes new
perturbation or mutation to current path in every iteration and accepts
the proposal with a probability
<img src="/img/gpt_part2/image54.png" alt="" />
, where the
<img src="/img/gpt_part2/image55.png" alt="" />
and
<img src="/img/gpt_part2/image56.png" alt="" />
are the radiance values and
<img src="/img/gpt_part2/image57.png" alt="" />
and
<img src="/img/gpt_part2/image58.png" alt="" />
are tentative transition functions which
indicate the probability density of transforming from a state to another
state in the designed mutation strategy. In general cases,
<img src="/img/gpt_part2/image57.png" alt="" />
and
<img src="/img/gpt_part2/image58.png" alt="" />
are not equal. For example, given the
specific task of sampling caustics, we can define a mutation strategy
that only moves the path vertices at the specular surface. As a return,
such mutation can have a transitional probability corresponding with
local p.d.f. – if such point
<img src="/img/gpt_part2/image59.png" alt="" />
in a specular surface contributes more
highlight than another
<img src="/img/gpt_part2/image60.png" alt="" />
in the same surface as its BSDF
suggests, the probability of moving from
<img src="/img/gpt_part2/image61.png" alt="" />
to
<img src="/img/gpt_part2/image62.png" alt="" />
is greater than moving from
<img src="/img/gpt_part2/image62.png" alt="" />
to
<img src="/img/gpt_part2/image61.png" alt="" />
, giving
<img src="/img/gpt_part2/image63.png" alt="" />
<img src="/img/gpt_part2/image57.png" alt="" />
. However, different mutation strategies
need to be designed in different kinds of task in order to achieve the
highest rendering quality. For general kinds of task, we can ignore
local p.d.f. in mutation while still maintaining a good performance. A
way of doing so is to store and mutate only the random numbers for every
samples generated in the path (selecting camera ray, choosing next ray
direction, picking points on area light, determining Russian roulette
value, etc.), which will function as “global-local” perturbations on
current path, as suggested by Pharr & Humphreys (2011).</p>
<p>Another issue for MLT is ergodicity (Veach, 1997). The MCMC process must
traverse the whole path space without getting stuck at some subspaces,
which turns out to have a solution of setting a probability for large
(global) mutation. Each iteration will test a random number against the
threshold. If, for example, the random number is lower than the
threshold 0.25, it means on average there is 1 large mutation and 3
small (local) mutations out of 4 mutations. The local mutations are
sampled by an exponential distribution (Veach, 1997), implying much
larger chance of less movement from the original place while allowing
moderately large local mutations. The global mutations are sampled
uniformly across the [0,1] interval.</p>
<p>Still another problem of MLT is that in order to choose a probability
density function p corresponding to the radiance contribution of the
path, which is a scalar, we must find a way to map the 3D radiance value
to 1D space to determine the acceptance probability. A reasonable way of
doing this is using the Y value in XYZ color space which reflects the
intensity perceived by human eye. A simple conversion formula is given
as Y = 0.212671R + 0.715160G + 0.072169B. Note that no matter what
mapping formula is chosen, the result is still unbiased. Choosing a
mapping closer to eye perception curve allows faster convergence and
better visual appearance in same number of iterations.</p>
<p>Nevertheless, another important issue of MLT is the start-up bias
(Veach, 1997). Considering the estimation function
<img src="/img/gpt_part2/image64.png" alt="" />
, where f is the radiance, p is the
mapped intensity and w is the lens filter function. The
Metropolis–Hastings algorithm guarantees that the sampling probability
<img src="/img/gpt_part2/image65.png" alt="" />
in equilibrium, giving the minimal
variance. However, we have no way to sample in such
<img src="/img/gpt_part2/image66.png" alt="" />
before equilibrium is reached but
gradually converge to the correct p.d.f, even though we use
<img src="/img/gpt_part2/image67.png" alt="" />
as the denominator which is actually not
the real p.d.f. for current sample. This causes incorrect color in first
few samples, known as start-up bias. Depending on the requirement, if
the rendering task is not time-restrictive or not aimed for dynamic
scene, we can just discard the first few samples until the result become
reasonable. Using a smaller large mutation rate is also a remedy for
start-up bias. However, tradeoffs are that complex local paths become
harder to detect and while the global appearance converges quickly,
local features like caustics and glossy reflection emerge very slowly.
In practice, if the scene contains mostly diffuse surfaces, large
mutation rate can be set larger. On the other hand, if the intricate
optical effects are the emphasis of rendering, large mutation rate
should be set much smaller. In fact, designing specific mutation
strategies with customized transition functions may be better than just
using “global-local” mutations.</p>
<p>It turns out that MLT can be trivially mapped to GPU by running
independent tasks in each thread, as is implemented by this project.
However, such method still has its defects. For decent convergence rate,
the number of threads is set to be equal to the number of screen pixels
(so that for the average case, every pixel can be shaded in every
frame), which implies high space complexity, due to stored random
numbers (stored as float) in graphic memory. If the screen width and
height are W and H, maximum path length (or combined path length if
bi-directional path tracing is used) is k and about 10 samples need to
be generated for each path segment (as in our implementation), there
will be 40k*W*H bytes in total for storing the MLT data. With
1920*1080 resolution and a reasonable k = 30, there will be 2.37G of
data, exceeding size of the graphic memory for many low to mid end
graphic cards nowadays. To solve this issue, some data compression can
be done and it is also possible to let the threads collaborate in a more
efficient way, which means running less number of tasks while keeping
same or lower variance. However, we will not study this topic in this
project due to the effectiveness of existing performance and existence
of other important issues.</p>
<p>Last but not least, the estimation of global radiance
<img src="/img/gpt_part2/image68.png" alt="" />
(||x|| indicates a measure of the
magnitude of intensity of the RGB color) at the beginning will affect
the overall luminance of the final result. From the formula
<img src="/img/gpt_part2/image64.png" alt="" />
,
<img src="/img/gpt_part2/image69.png" alt="" />
is chosen to be
<img src="/img/gpt_part2/image70.png" alt="" />
, yielding
<img src="/img/gpt_part2/image71.png" alt="" />
as the radiance contribution from one
sample,
where<img src="/img/gpt_part2/image72.png" alt="" />
functions as a scaling factor for all
samples. As a result, in the case where the rendered result is required
to be physically authentic, more sample should be taken to estimate the
global radiance, although it causes a considerable overhead.</p>
<p>To illustrate the advantage of MLT, some sample image from tests are
shown in Figure 9. In both cases, bi-directional path tracing with
multiple importance sampled direct lighting is used; the first row shows
the result in frame 1, 30, 100 for normal Monte Carlo (MC) sampling
whereas the second row shows the results at same frames for MLT. Reader
will notice difference of the ways of the two estimators accumulate
color. While maintaining a low noise level from the initial frame, MLT
estimator exhibits start-up bias as shown by the dim color of the part
of the ceiling directly illuminated by the light. In frame 100, the MLT
estimator almost reaches the equilibrium as observed in comparison with
the MC estimator with respect to color of directly illuminated part of
the ceiling. It is worth mentioning that such level of variance is can
only be achieved in frame number > 1000 by MC estimator while the MLT
estimator is only slightly slower than MC estimator, cause of which
mainly attributes to the atomic memory write as each thread runs an
independent MLT task which can write color to all pixels on the screen.</p>
<p><img src="/img/gpt_part2/image73.png" alt="" /></p>
<center><small>
Figure 9 Upper: Monte Carlo Lower: Metropolis (Both use
bi-directional path tracing)
</small></center>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/09_samp/
http://dqlin.xyz/tech/2017/06/30/09_samp/tech[GAPT Series - 8] Spatial Acceleration Structure (Cont.)<p>Now we’ve added BVH as an alternative choice of SAS! It is necessary for
real-time ray tracing against dynamic scene geometry with complex moving
meshes as it maintains the interior hierarchy of the mesh and only
updates exterior hierarchy which is usually much simpler to do.</p>
<!--more-->
<h2 id="81-triangle-clipping-in-kd-tree-construction">8.1 Triangle Clipping in Kd-Tree Construction</h2>
<p>An important issue in kd-tree construction is the necessity to clip
triangles which span across the splitting plane in each level for
accelerating intersection. In their 2006’s paper, Wald & Havran
suggested that on average, for a kd-node with N triangles, there are
O(<img src="/img/gpt_part2/image7.png" alt="" />) triangles spanning across a splitting
plane. Normally, we check whether the ends on current chosen splitting
axis of the triangle’s bounding box are in distinct sides of the
splitting plane to determine whether to add it in both of the child
nodes. However, it may be the case that the triangle does not overlap
one of the child node, even though in one dimension it does. Such error
will be accumulated to a situation that the whole triangle lies outside
of its node’s bounding box. With the increase of kd-tree level,
<img src="/img/gpt_part2/image8.png" alt="" /> will be increasingly close to 1, which
means at leaf level, one would expect many false positives to occur in
intersection test, unnecessarily increasing the time for intersection.
In addition, adding spanning triangles to both sides unnecessarily
increases the workload of construction as one has to test the vertices
against the boundary of the bounding box in current kd-tree level to
avoid choosing a splitting plane outside the bounding box.</p>
<p>Since it is convincing from the analysis above that clipping triangle
does has an obvious boost on intersection performance, which is often a
bottleneck for path tracing, we offer some test cases to quantize the
rate of performance improvement. After that, we will explain how to clip
the triangles, which is a relatively simple task without the need of
importing third-party libraries.</p>
<p><img src="/img/gpt_part2/image9.jpg" alt="" /></p>
<center><small>
Table 1 Speedup of triangle clipping for different models
</small></center>
<p>From the table, we can observe that clipping triangles results in a
speedup from 3.5% to 9.1% for different mesh complexity, which is not
very drastic but obvious enough to confirm the effect of triangle
clipping.</p>
<p>We will now illustrate how to clip arbitrary triangle against box with
an example where a triangle is clipped to a pentagon. In Figure 2 below,
we determine the intersection between any pair of vertices that lies in
different sides of each of the 6 faces of the cuboid. Then, in the plane
spanned by each of the 6 faces, we calculate the intersections the line
extended by the two intersection points (if there is two), which may be
zero, one or two in number. Any intersection will become a vertex of the
clipped polygon. If the end vertex itself lies inside or on the border
of the box’s face, it also become a vertex of the clipped polygon.
Notice that the process is trivially parallelizable for the 6 faces,
except for the memory write, i.e. expanding current bounding box to
contain the new vertex. If the construction is running on GPU,
parallelizing on 6 faces can decrease considerable amount of time for
the clipping stage in each level, which does not process too many
triangles and is solely occupying the GPU.</p>
<p><img src="/img/gpt_part2/image10.png" alt="" /></p>
<center><small>
Figure 2 A triangle clipping example</small></center>
<h2 id="82-sah-based-bvh">8.2 SAH-based BVH</h2>
<p>As mentioned before, BVH is a crucial component for real-time ray
tracing against dynamic scene geometry. Similar to kd-tree, the
probabilistic analysis applies to the decision of splitting plane in
each tree level, which naturally leads to the fact that greedy choice of
local optimum gives the best available algorithm for optimizing
traversal cost. The only difference between kd-tree and BVH in terms of
surface area heuristic is – BVH is a hierarchy of objects and kd-tree is
a hierarchy of subspaces. Therefore, refitting bounding box is necessary
when dividing the node into child nodes, which turns out to be a
time-consuming bottleneck in construction.</p>
<p>Unlike kd-tree construction which uses a dynamic array or vector to
store all triangle events, BVH construction needs to maintain a binary
search tree for each dimension. The shrinking side of the refitting
process requires us to search and delete the events that are recently
switched to the expanding side and add them to the BST of that side.
Initialization of the BSTs costs O (N log N). Each check of splitting
position costs O(log N) and the whole process of best plane
determination costs O (N log N), leading to a O (N log N) total time
complexity. Even worse, maintaining binary search tree for three
dimensions implies a big constant of 3. For this reason, the
construction of SAH-based BVH often causes serious overhead latency in
complex cases. Fortunately, a simplification method which equally
divides the space into a small number of buckets was proposed by Pharr &
Humphreys (2011) in their famous <em>Physically Based Rendering</em> book.
Partitions are then only considered at bucket boundaries, which gives
much more efficient construction while remaining as effective in
traversal as the original method. It is easy to figure out that the
whole construction only requires O(N log N) time since bucket number is
a constant. Meanwhile, a binned partition implies easier and more
efficient parallelization on GPU. Due to time limit, a GPU construction
for BVH is not implemented, as there exist well known parallel binned
BVH construction methods.</p>
<p>The traversal of BVH is easier to implement than kd-tree but less
efficient in performance. Since we cannot guarantee any deterministic
spatial order of BVH nodes as they may overlap each other in any form,
we cannot terminate traversal after we find a hit in leaf. Although the
child nodes of a BVH node are also stored in a “front-back” order. It
only indicates the spatial order of the centroid of two bounding boxes
as in construction the decision of affiliation of triangle is based on
the side of its centroid. It is entirely possible that the “back” BVH
node contains a nearer hit than the “front” node. Nevertheless, such
probability is not high. In most cases, the “back” BVH node will not
have any intersection in the trimmed range after intersection is done
for the “front” BVH node, after which it will be popped from stack. If
then the stack is checked to be empty, the function returns the nearest
intersection if there is any.</p>
<h2 id="83-automatic-bvh-refitting">8.3 Automatic BVH Refitting</h2>
<p>A simple solution of tracing dynamic scene geometry is to recursively
refit the local BVH whenever intended object moves beyond its parent’s
bounding box. If tree levels outside the intended object is much less
than tree levels inside or movements of the object is spatially limited,
such method has a very low time cost. Attention must be paid to the fact
that shrinking refitting is also necessary when object moves towards the
original position, for which we can store a record of the moving
direction of current frame as bitmask of 3 axes. If the bounding box of
moving object in last frame borders its parent or ancestor with respect
to any of the current dimension of moving read from the bitmask, we
perform a shrinking refit. Also, for every movable object, a translation
vector is stored to be used as an offset in triangle intersection.
However, when assumption of less exterior tree level does not hold or
there is violent movement of objects, we need to consider alternative
methods other than the purely refitting. A combination of splitting,
merging and rotation operations can be performed on tree structure
(Kopta et al., 2012), which massively increases the rendering speed for
complex animated scenes as it avoids structural degeneration in naïve
refitting.</p>
<p>However, such method also has its limitation. When most of the objects
in the scene are animated (e.g. particle system), update of BVH is
serialized due to necessary atomic operations for many threads changing
the boundaries of the same bounding box. In this situation, it is better
to rebuild the BVH rather than modify it.</p>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/08_bvh/
http://dqlin.xyz/tech/2017/06/30/08_bvh/tech[GAPT Series - 7] Overview of Software Workflow<!--more-->
<p><img src="/img/gpt_part2/image1.png" alt="" /></p>
<center><small>Figure 1 UML flowchart of our path tracer>
</small></center>
<p>The diagram above shows the simplified workflow of the proposed path
tracer. Note that modules for metropolis light transport and
bi-directional path tracing are not included in this diagram, which only
describes the normal unidirectional path tracing. However, one can
easily modify the diagram to get the versions for metropolis light
transport and bi-directional path tracing, which shares most of the
processes. Also, it only shows the case when kd-tree is used as the
spatial acceleration structure. In fact, BVH is also implemented
especially for user manipulation of scene geometry.</p>
<p>An obvious characteristic of the displayed architecture is that modules
seem to be evenly distributed on CPU and GPU. In fact, GPU consumes most
of the living time of the program, as CPU is only responsible for some
preprocessing work like handling I/O, parsing, invoking memory
allocation, calling CUDA and OpenGL API and the coordination between
thread pool and GPU kernel. Despite its frequent involvement in the path
tracing task (it appears between every two recursion levels in a frame),
CPU occupies only a fractional of the time. Notwithstanding its
conciseness, the diagram shows all 3 optimization factors in path
tracing. Kd-tree, which is constructed on GPU in this project, is highly
optimized by the “short-stack” traversal method which will be introduced
later. Multiple importance sampling, as a generalization of single
importance sampling, can be used for rendering glossy surface under
strong highlights to reduce the variance. Thread compaction, a crucial
method for increasing proportion of effective work and memory bandwidth,
representative of the SIMD optimization, is shown at the bottom of the
diagram. A thread pool is maintained to coordinate the work of thread
compaction.</p>
<p>After initialization, the program spends all of the rest of the time in
two loops, the outer of which uses successive refinement algorithm to
display image in real-time, and the inner of which executes a path
tracing recursion level for all threads in every iteration. Between two
inner iterations, thread compaction is used to maintain occupancy as
mentioned before. Between two outer iterations, any user interaction is
processed by CPU. After updating corresponding values in device memory,
OpenGL API is called to swap frame buffers.</p>
<p>It is noticeable that the Thrust library is also an important component
of our workflow. Developed by NVIDIA, Thrust is a C++ library for CUDA
providing all parallel computing primitives like map, scatter, reduce,
scan and thread compaction. As a high-level interface, Thrust enables
high performance parallel computing capability while dramatically
reduces the programming effort (NVIDIA, 2017). Rather than “reinventing
the wheel”, we use thrust for all parallel computing primitives required
for GPU kd-tree construction and thread compaction in our program due to
the proven efficiency it provides and the flexibility of its API.</p>
<p>Overall, the diagram shows a very macroscopic outline of the software
structure, whose detail will be introduced in the following chapters. In
addition, some limitations and recommended improvements will be
addressed in the final chapter.</p>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/07_sw/
http://dqlin.xyz/tech/2017/06/30/07_sw/tech[GAPT Series - 6] Real-time Path Tracing on GPU - Introduction<p>Hello, it’s been half a year since last update and now I’m back! I’ve finally
finished the project and renamed it to Real-time Path Tracing on GPU. As I told you
before, the second part will try to push the boundary of GPU accelerated
path tracing to approach real-time performance. I want to
clarify here that by “real-time” I don’t mean that you can
get same image quality of mainstream game graphics with a
speed comparative with rasterized graphics using a ray-tracing technique.
However, with proper constraints (limitation of BRDF types, resolution and model
complexity)
you can actually get a pretty close real-time performance
which must be implemented with tons of texture tricks in
a rasterization setting (e.g. the 512x512 Cornell box, as
I will show below). Besides introducing optimizations I’ve
used for such a great leap in speed, I will analyze
the gap between current performance and the ideal performance we want to have in
future and try provide some
suggestions what we can do for the improvement of this technique.</p>
<!--more-->
<h2 id="61-objectives-of-the-second-part-of-the-project">6.1 Objectives of the Second Part of the Project</h2>
<p>The main objective of this project is to explore the capability and
performance of combination of the power of current GPGPU with existing
and new path tracing algorithms in the task of real-time path tracing.
To achieve this, a variety of different factors that determines the
efficiency of path tracing are studied, amongst which spatial
acceleration structures, sampling algorithms, and
single-instruction-multiple-data (SIMD) optimization are most important.
Since the variable (lighting configuration, scene geometry and material)
and measure (frame rate, convergence rate) of path tracing performance
are both multi-dimensional, optimization concepts are provided in a
case-by-case analysis. A standalone program is written to demonstrate
these concepts. To guarantee that the concepts are applicable to general
and complex cases, a considerably large subset of all functionalities
found in state-of-art path tracers is integrated into the program which
includes PBR (physically-based rendering) material and participating
media. Besides, bi-directional path tracing and Metropolis light
transport are also studied to deal with cases containing difficult
lighting configuration and to improve the rendering quality under same
time constraints.</p>
<h2 id="62-layout">6.2 Layout</h2>
<p>The main content of the second part will be divided into 6 posts as post 7
– post 12. In post 7, we will have an overview of the workflow of
our path tracer, followed by the spatial acceleration structures
including SAH-based Kd-Tree and BVH as the first studied factor of
optimization in post 8. In post 9, the rendering equation will be
reviewed with normal Monte Carlo sampling algorithm, after which
advanced sampling techniques - multiple importance sampling,
bi-directional path tracing and Metropolis light transport based on
Markov Chain Monte Carlo method will be studied in response to difficult
rendering cases and noise reduction. Before the discussion of SIMD
optimization in post 11, several important shading models including
surface-to-surface BSDF, ray-marching volume rendering and a
simplification of subsurface scattering will be introduced due to their
close relationship with sampling methods. In particular, I will propose
a parallel SAH-based Kd-Tree construction algorithm that is suitable for
current GPGPU in post 11. In post 12, benchmark methods will be
introduced which carry out the comparison between my path tracer,
NVIDIA’s Optix path tracing demo, and Cycles Renderer – a free
mainstream path tracer.</p>
Fri, 30 Jun 2017 00:00:00 -0600
http://dqlin.xyz/tech/2017/06/30/06_pt2/
http://dqlin.xyz/tech/2017/06/30/06_pt2/tech[GAPT Series - 5] Current Progress & Research Plan<p>Currently, I have implemented a Monte Carlo path tracer <a href="/data/ptdemo.tar.gz">(demo)</a> with full range of surface-to-surface lighting effects including specular reflection on anisotropic material simulated by GGX-based Ward model. A scene definition text file is read from the user, whose format is modified from the popular Wavefront OBJ format by adding material description and camera parameters. I use triangle as the only primitive due to simplicity and generality. Integrated with OpenGL and using a successive refinement method, the path tracer can display the rendering result in real time. Optimization methods include algorithm-based methods: SAH based kd-tree, short stack kd-tree traversal, ray-triangle intersection in “unit triangle space”, next event estimation (explicit light sampling); and hardware-based methods: adoption of GPU-friendly data structure which has a more coalesced memory access and better cache use, reduction of thread divergence which boosts warp efficiency, etc.</p>
<!--more-->
<p>The standard Cornell box is chosen to benchmark the performance. With successive refinement that takes a sample for every pixel in each frame, in 512x512 resolution, my path tracer runs at an average 33.5 fps on my NVidia GeForce GTX 960M without explicit sampling enabled. In comparison, the state-of-art implementation by NVidia Optix engine ray tracing engine runs the same scene at an average 60.0 fps without explicit sampling enabled. Although many parts in the code of NVidia’s demo are hardcoded to speed up, one can still expect a significant performance gap between my path tracer and NVidia’s Optix engine. Therefore, the most important task in the rest of my research is to further optimize the program, both in algorithm and hardware use.</p>
<p>One planned optimization is thread compaction, which can solve the problem of under-utilized warps when some threads are terminated earlier than others. Apart from that, existing optimization on thread divergence and data structure will be pushed further. Since there is not a single type of SAS that performs better than other types in all different levels of scene complexity, BVH will also be implemented as an alternative of choice.</p>
<p>The other task is to enrich function and improve rendering quality. To enable dynamic scene, I planned to adapt the kd-tree construction to GPU. To enrich the range of optical effects, support for subsurface scattering and participating media will be considered. To support explicit sampling for translucent objects, bi-directional path tracing will be studied and implemented so that fast generation of caustics is possible. To solve convergence problem in pathological scenes like lighting from a narrow corridor, new sampling techniques such as Metropolis Light Transport will be researched and experimented. If possible, more innovative approaches will be proposed.</p>
<p>The following table summarizes current progress made and the future research plan.</p>
<p style="text-align: center;"><img src="/img/0108.png" alt="0108" /></p>
Sat, 03 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/03/05_plan/
http://dqlin.xyz/tech/2016/12/03/05_plan/tech[GAPT Series - 4] SIMD Optimization<p>With each thread rendering a screen pixel, the problem of path tracing can be solved in an embarrassingly parallel way, without the need of inter-thread communication. However, it is hard to exploit the full capability of single-instruction-multiple-data (SIMD). There is very little locality in the memory access pattern due to generally inconsistent scene geometry, which means almost all scene data needs to be stored in global memory or texture memory. Even when the first ray hit has congruent pattern, the consequent bounces can be as divergent as possible. Moreover, sampling by Russian roulette method cannot avoid branching, which implies thread divergence. However, two types of optimization based on CUDA architecture – data structure rearrangement and thread divergence reduction can be achieved to reduce the overall rendering time.</p>
<!--more-->
<h3 id="41-data-structure-rearrangement">4.1 Data Structure Rearrangement</h3>
<p>First, “flattening” the data structure to continuous memory spaces is a key method to improve memory coalescing and reduce memory access. Using kd-tree as SAS, a traditional CPU path tracer stores a tree structure with a deep hierarchy of pointers (Figure 6).</p>
<p style="text-align: center;"><img src="/img/0106.jpg" alt="0106" /></p>
<center><small> Figure 6. The commonly-used tree structure of kd-tree</small></center>
<p><br />
Undoubtedly, this structure is unsuitable for GPU. The dynamic memory allocation can give very bad memory coalescing, seriously limiting effective memory bandwidth in path tracing. One can easily flatten the kd-nodes to an array with the child pointers placed by child indices, giving an array-of-structures (AoS). However, this is far away from the optimum. Instead of keeping a separate triangle indices list for every node, we can store the pointers to triangles continuously in an array and keep only the array offset in node structure. This in large chance gives either coalesced memory access or better cache use because unlike triangles, kd-nodes has a better locality - if we use serial recursion in kd-tree construction, indices of nodes near the bottom of the tree with a near common ancestor will be very near to each other. Similarly, the triangle data can also be stored in an array, with pointers in the triangle list array substituted by indices.</p>
<p>Second, compression of the data structure is another aspect we need to concern about to improve memory efficiency. Notice that in the above kd-node structure, we have some variables that can be represented using few bits – axis (0, 1 or 2 for x, y or z), isLeaf (0 or 1) and the number of triangles (a leaf only contains a few triangles) if we want to only keep offset in the global triangle list. Rather than using separate variables to store them, one can compress them to one variable. In my path tracer, axis, triangle number, isLeaf and left child index are compressed into one 32-bit integer with 2-5-1-24 partition using bit operations, which helps to compress a kd-node from 25 bytes to 16 bytes, where 3 words are reserved for split position, right child index and triangle array offset. By compressing left child index from 32 bits to 24 bits, it limits the number of kd-nodes to 16,777,216, which is enough for most scenes. The 16-byte compression not only reduces space complexity, but provides memory alignment, improving efficiency of memory access.</p>
<p>Third, SoA can be used in place of AoS when spatial locality is high for neighboring threads or statements. As mentioned before, path tracing does not have a consistent locality for each procedure. Thus, a mixture of SoA and AoS can be used to find a balance between fewer memory accesses and more coalesced memory accesses that can optimize the overall speed. The catenated triangle indices array is an example. In addition, some triangle data (precomputed transformation coefficients, to be introduced later) can be extracted to separate SoA to achieve better cache use when iterating through all triangles in a leaf as triangle indices in a leaf are usually closed to each other. In CUDA architecture, a 128-byte cache line is fetched for each global memory access (NVIDIA, 2015). In a loop that visits some continuous elements, if number of attributes is fewer than number of elements, in large probability that fewer global memory access will take place as following access of each attribute can be already in the cache.</p>
<h3 id="42-thread-divergence-reduction">4.2 Thread Divergence Reduction</h3>
<p>Another important factor of SIMD optimization is minimizing thread divergence. My following code snippet (Figure 7) illustrates some strategies I used to reduce thread divergence. First, common statements in branches are extracted to minimize number of operations in each branch. Second, in-place swap is used to replace hardcoded-like assignment. Third, if possible, bit operations are used to replace if-else. Branches with different values assigned to the same variable can be substituted by masking and adding the two values.</p>
<p style="text-align: center;"><img src="/img/0107.jpg" alt="0107" /></p>
<center><small> Figure 7. illustration of an optimization by reducing thread divergence in kd-tree traversal</small></center>
<p><br />
The above snippet also shows another optimization – reduction of global memory access. Rather than storing the pointer to kd-node in stack, it is better to store the index. Otherwise, there will be an extra memory read for every other possibility which will not be executed. The optimization of memory access and thread divergence is often mutual. By reducing memory access, one decreases time taken to execute a divergent branch. By decreasing branch divergence, one reduces possible needs of redundant memory access.</p>
<h3 id="43-a-more-efficient-ray-triangle-intersection-algorithm">4.3 A more efficient Ray-triangle Intersection Algorithm</h3>
<p>A specific optimization on speed is the adoption of a more efficient intersection algorithm. Ray-triangle intersection can be a performance bottleneck if the math operations are too complex. Woop et al. (2004) introduced afﬁne triangle transformation for acceleration of a hardware pipeline. Instead of testing intersection of a fixed ray with varied triangles, it tests a “unit triangle” against different expressions of the ray in different “unit triangle space” from the view of each triangle, which requires an affine transformation.</p>
<p style="text-align: center;"><img src="/img/01f36.png" alt="0108" /></p>
<p>The inverse matrix and the translation term can be computed offline and stored in a 4D float vector for each dimension. Based on extraction of common geometry information, this method reduces the 26 multiplications, 23 addition/subtraction required in the standard ray-triangle intersection algorithm to 20 multiplications and 13 addition/subtraction. By separating the data of precomputed terms from the remaining necessary triangle data (vertex normal and material index) to form a structure of two arrays, higher performance can be obtained.</p>
<h3 id="references">References</h3>
<p>NVIDIA. (2015). Memory Transactions. NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide.</p>
<p>Schmittler, J., Woop, S., Wagner, D., Paul, W. J., & Slusallek, P. (2004, August). Realtime ray tracing of dynamic scenes on an FPGA chip. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (pp. 95-106). ACM.</p>
Sat, 03 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/03/04_opt/
http://dqlin.xyz/tech/2016/12/03/04_opt/tech[GAPT Series - 3] Path Tracing Algorithm<!--more-->
<h3 id="31-the-rendering-equation">3.1 The Rendering Equation</h3>
<p>The rendering equation introduced by Kajiya (1986) defines the radiance seen from a point <img src="/img/01f6.png" alt="" /> in the reflection direction <img src="/img/01f7.png" alt="" />, i.e. view vector in ray tracing’s grammar as a result of reversibility of light path:<br />
<img src="/img/01f8.png" alt="" />
which is the emittance of the point itself plus the reflectance caused by all incident radiance summed from the surrounding hemisphere. This is a physically correct model of global illumination considering only surface to surface reflection. In computer graphics, this is usually mapped into a recursion where the integral is decomposed into randomly picked samples. For path tracing, all possible light paths from the set of light rays within a pixel are sampled individually from random position within the pixel whose results are then averaged. Upon hitting a surface, only one secondary ray is shot for each sample. It intuitionally follows that the secondary ray must be generated with some probability mechanisms, which is defined by BRDF <img src="/img/01f9.png" alt="" /> with respect to the surface property. However, there are some problems. Given limited number of samples, how do we choose a proper sampling strategy to maximize the image quality? Given a specific BRDF, how to translate it into an algorithm that fits into the sampling strategy we use?</p>
<h3 id="32-stratified-sampling-vs-anti-aliasing-filters">3.2 Stratified Sampling vs. Anti-aliasing Filters</h3>
<p>To generate samples within pixels, a naïve solution is uniform sampling. In CUDA device code, the function curand_uniform(seed) can generate 1D uniform pseudorandom samples from 0 to 1. However, the uniformly distributed samples tend to form clusters, producing high noise level. A common way to overcome this is stratified sampling, which divides a pixel into <img src="/img/01f10.png" alt="" /> grids and takes uniform samples of same number within each grid. Theoretically, it reduces the error of estimation from <img src="/img/01f11.png" alt="" /> to <img src="/img/01f12.png" alt="" />, where n is the number of samples. A problem of stratified sampling is that it is not suitable for successive refinement required in real-time rendering. When using successive refinement, usually one sample is taken for every pixel in each frame. If we want to have minimum aliasing effect in any given frame, the former samples must at least not follow any certain fixed pattern, which is not possible for stratified sampling which must use at least <img src="/img/01f10.png" alt="" /> samples as a unit. Therefore, we want to find a solution having both low noise level and successive refining capability.</p>
<p>In the famous 99 lines of C++ implementation of path tracing SmallPT, Beason (2007) applies a tent filter to the uniform random samples which shift more samples towards the center of the pixel. In my test, this method produces same image quality given same sampling number as stratified sampling does with the ability of successive refinement. In fact, it approximates the sinc function, the ideal anti-aliasing filter (Figure 3). There are actually other choices of approximation with higher quality such as bicubic filter and Gaussian filter. However, these filters have much higher computation overheads while the tent filter is a more practical choice in real-time rendering.</p>
<p style="text-align: center;"><img src="/img/0103.jpg" alt="0103" /></p>
<center><small>Figure 3. comparison between sinc function and tent function</small></center>
<p><br />
Sample rays reflected by the surface can be chosen from all possible directions within the hemisphere of the incident point. Without using importance sampling, it is hard to achieve a reasonable convergence rate when the variance of radiance is high enough. Consider two factors: surface characteristics and spatial distribution of incoming radiance. The first factor can be easily described by BRDF models that offers distribution function for importance sampling. In contrast. the second factor is more difficult to analyze. For cases that light comes directly hitting the surface such as diffuse reflection, references to emitting surfaces (lights) can be stored separately to enable explicit light sampling. However, when transmission is included, this method no longer works. Effects like caustics must be achieved using bi-directional path tracing, if importance sampling with respect to incoming radiance distribution is required. This is a topic I planned to study in the next half of project.</p>
<h3 id="33-brdf--anisotropic-material">3.3 BRDF & Anisotropic Material</h3>
<p>For importance sampling based on BRDF, an important progress I have achieved is the simulation of anisotropic material. Ward (1992) introduced a practical BRDF by modifying the Beckmann distribution factor in Cook-Torrance Model (1982),</p>
<p style="text-align: center;"><img src="/img/01f13.png" alt="" /></p>
<p>where <img src="/img/01f14.png" alt="" /> and <img src="/img/01f15.png" alt="" /> correspond to “roughness” of the material in x and y direction w.r.t. tangent space. Taking azimuth angle <img src="/img/01f16.png" alt="" /> as argument, it is easy to see that when <img src="/img/01f17.png" alt="" /> the distribution is completely determined by <img src="/img/01f14.png" alt="" /> and when <img src="/img/01f18.png" alt="" /> the distribution is totally decided by <img src="/img/01f15.png" alt="" />. However, in my implementation of anisotropic materials I chose GGX function instead of Beckmann function due to its simplicity and faster computation. An isotropic version of GGX was adopted in Unreal Engine 4 (Karis, 2013):</p>
<p style="text-align: center;"><img src="/img/01f19.png" alt="" /></p>
<p>By replace <img src="/img/01f20.png" alt="" /> with <img src="/img/01f21.png" alt="" />, the equation works for anisotropic surfaces. It turns out that the sampling azimuth angle <img src="/img/01f16.png" alt="" /> and altitude angle <img src="/img/01f22.png" alt="" /> can be computed with <img src="/img/01f23.png" alt="" /> and <img src="/img/01f24.png" alt="" />, where <img src="/img/01f25.png" alt="" /> and <img src="/img/01f26.png" alt="" /> are two unit uniform random variables. Notice that this result in <img src="/img/01f27.png" alt="" />, which can be computed faster than <img src="/img/01f28.png" alt="" /> in Beckmann’s case, where two extra math functions cos() and log() are involved. Below is a sample picture (Figure 4) showing the anisotropic effect produced by the modified Ward model with GGX distribution.</p>
<p style="text-align: center;"><img src="/img/0104.png" alt="0104" /></p>
<center><small>Figure 4. Comparison between isotropic and anisotropic surfaces</small></center>
<h3 id="34-next-event-estimation">3.4 Next Event Estimation</h3>
<p>In order to utilize spatial distribution of incoming radiance in importance sampling, I applied next event estimation implemented by explicit light sampling. As mentioned before, reference to triangles with emittance greater than 0 (“light triangles”) are stored in an array. When ray hits a surface, shadow rays are shot from the hit point to every light triangle. Notice that more shadow rays imply faster convergence but lower frame rate due to extra kd-tree traversal and ray-triangle intersection cost. When doing real-time path tracing, one can determine the balance between convergence rate and frame rate heuristically.</p>
<p>To sample shadow rays, an end point is picked randomly within the boundary of every light triangle, which is then subtracted with the fixed start point to derive the ray direction. However, for importance sampling, the probability distribution function (pdf) needs to divide all other terms. For triangle, it is determined by the solid angle it spans w.r.t. the hit point divided by the hemispherical solid angle (<img src="/img/01f29.png" alt="" />). To compute this solid angle, an elegant formula was found by Oosterom and Strackee (1983):</p>
<p style="text-align: center;"><img src="/img/01f30.png" alt="" /></p>
<p>where <img src="/img/01f31.png" alt="" /> are the position vectors of the three triangle vertices w.r.t. to the origin.</p>
<p>If we normalized the three position vectors in advance, the formula is simplified to</p>
<p style="text-align: center;"><img src="/img/01f32.png" alt="" /></p>
<p>which is more convenient for computation.</p>
<p>An important notice is that the triangles used for explicit lighting sampling must be excluded in the scene intersection of next ray to avoid repetitive summing of radiance. A bitmap can be used here to mark which light triangles have been explicitly sampled if the number of light triangles is within a proper limit. If it turns out that the intersected triangle was explicitly sampled in former ray, the ray will be abandoned.</p>
<p>Next event estimation is crucial for increasing the convergence rate for scene of high dynamic range. For example, the light’s emittance factors can be 100 times greater than the surface reflectance factors while the area of light is small. Below (Figure 5) is an example showing the significant reduction of noise level by NEE given same number of successively refined frames.</p>
<p style="text-align: center;"><img src="/img/0105.jpg" alt="0105" /></p>
<center><small> Figure 5. Comparison between same scene of high dynamic range with and without next event estimation</small></center>
<h3 id="35-the-russian-roulette-algorithm">3.5 The Russian Roulette Algorithm</h3>
<p>So far, all possible surface-to-surface reflection and refraction types are supported in my path tracing program. Russian roulette is used here to determine which type of light path to take. Apart from the diffuse color and specular color, I also defined 4 extra material parameters - transparency metalness and roughness ranging from 0.0 to 1.0, which plays the role of threshold in Russian roulette. For every ray hit, the transparency value is tested against first to determine the chance of transmission/reflection. If the transparency value is 1.0, the uniform random variable <img src="/img/01f33.png" alt="" /> will always be smaller or than or equal to it, branching to the transmission case. If <img src="/img/01f34.png" alt="" />, it is tested against the metalness value to determine the ratio between specular and diffuse reflection. If <img src="/img/01f35.png" alt="" />, the next ray will be generated from diffuse BRDF (lambert in my implementation). Otherwise, the next ray will be treated as the incoming radiance of a specular BRDF. Roughness is a 2D float vector, which determines the <img src="/img/01f14.png" alt="" /> and <img src="/img/01f15.png" alt="" /> parameters in Ward model, which will be reduced to Cook-Torrance model when two roughness values are same.</p>
<h3 id="references">References</h3>
<p>Beason, K. (2007). smallpt: Global Illumination in 99 lines of C++. Retrieved from http://www.kevinbeason.com/smallpt/</p>
<p>Cook, R. L., & Torrance, K. E. (1982). A reflectance model for computer graphics. ACM Transactions on Graphics (TOG), 1(1), 7-24.</p>
<p>Kajiya, J. T. (1986, August). The rendering equation. In ACM Siggraph Computer Graphics (Vol. 20, No. 4, pp. 143-150). ACM.</p>
<p>Karis, B. (2013). Real shading in unreal engine 4. part of ACM SIGGRAPH 2013 Courses.</p>
<p>Van Oosterom, A., & Strackee, J. (1983). The solid angle of a plane triangle. IEEE transactions on Biomedical Engineering, 2(BME-30), 125-126.</p>
<p>Ward, G. J. (1992). Measuring and modeling anisotropic reflection. ACM SIGGRAPH Computer Graphics, 26(2), 265-272.</p>
Sat, 03 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/03/03_render/
http://dqlin.xyz/tech/2016/12/03/03_render/tech[GAPT Series - 2] Spatial Acceleration Structure<!--more-->
<h3 id="21-choice-of-sas">2.1 Choice of SAS</h3>
<p>The naïve implementation of ray tracing related algorithms iterates through the set of all primitives in the scene and checks ray-primitive intersection for each, which is very time-consuming (linear to the number of primitives) and is a severe bottleneck in performance when the number of primitives gets high. In reality, different spatial acceleration structures (SAS) are applied to solve the problem. They generally improve the speed to logarithmic time and therefore can make interactive ray tracing for complex or even dynamic scene. Octree, BSP (binary space partitioning), BVH (bounding volume hierarchy) and kd-tree are some representatives of the SAS. The SAS generally divide the scene or mesh into recursive sub-spaces which often has a tree-like structure. Among them, octree and BSP are the type of solution which chooses split position in a fixed routine. For example, a typical octree always chooses the center of the space to divide it into 8 sub-spaces. Since they are indiscriminate to the specific geometry that the scene has, they generally exhibit lower efficiency than BVH or kd-tree. BVH or kd-tree, on the other hand, uses some heuristics to determine the partition position based on the specific scene geometry. In terms of the efficiency of BVH and kd-tree, Vinkler et al. (2014) has shown that kd-tree has higher performance for complex scenes than BVH while BVH defeats kd-tree for simple to moderately complex scenes. Considering this, I planned to implement both structures for the freedom of choice with respect to different kinds of scenes. However, due to time limit, I only studied and implemented kd-tree up to now. Therefore, this part will focus on the findings I have on kd-tree.</p>
<!--more-->
<h3 id="22-surface-area-heuristics">2.2 Surface Area Heuristics</h3>
<p>The construction of kd-tree depends on choosing one dimension and the split position in that dimension in every iteration. A naïve solution is to cycle through the 3 axes and choose the space median every time, giving no better performance than octree. A popular mechanism is Surface Area Heuristics (SAH) (Wald & Havran, 2006), which is based on the greedy algorithm to find a local optimum based on the surface areas of the two child nodes in every step. Instead of finding the global minimum cost which is practically infeasible as number of possible trees grows exponentially with scene complexity, SAH assumes all the primitives in child nodes of a particular step are in leaves, giving the formula of the expected cost of a particular split:</p>
<p style="text-align: center;"><img src="/img/01f1.png" alt="" /></p>
<p>where <img src="/img/01f2.png" alt="" /> is the surface area of volume x, <img src="/img/01f3.png" alt="" /> correspond to the cost of a traversal step and an intersection step, and <img src="/img/01f4.png" alt="" /> are number of primitives in left and right child node. According to probability theory, <img src="/img/01f5.png" alt="" /> gives the chance of uniformly distributed rays hitting the left node which leads to the fact that. This is actually a reasonable assumption as the distribution of rays tends to not follow any certain pattern with the number of ray bounces increasing and geometry of the scene varies. On the other hand, although treating both nodes as leaf overestimates the real cost, the strategy works well in practice (Wald & Havran, 2006). Another advantage of SAH is determination of when to stop splitting is easy, as one can compare the cost of splitting and not splitting directly from the above formula.</p>
<p>In my implementation, I complete the construction of kd-tree on CPU followed by transferring the structure to GPU because construction process involves large data set written to memory and branching are excessively used. Without special optimization and modification to the algorithm, the efficiency on GPU can be much lower than the CPU version. However, Zhou et al. (2008) proposed a clever method of GPU construction of kd-tree which outperforms single core CPU algorithms significantly and is competitive with multi-core CPU algorithms. This potential endows meaning of making GPU construction of kd-tree an issue in the rest of my research.</p>
<h3 id="23-dealing-with-pathological-cases">2.3 Dealing with Pathological Cases</h3>
<p>Some pathological cases are often encountered in the construction of kd-tree. The following are illustrated using triangles as primitives. Using a O(N) algorithm which scans through the spatially ordered set of triangle vertices and adds the triangle to the left child when encountering the starting vertex of its bounding box and add it to the right child when encountering the ending vertex, it guarantees if a triangle has any measurable area on left or right side of the split plane, it will exist within that node. However, when a triangle completely lies on the split plane, it will only be added into one of the child nodes (the split position is chosen from the vertices, but it cannot be considered as a candidate whose belonging triangle needs to be added into child node due to unnecessarily extra intersection cost within that node). Since all three vertices have the same coordinates in current dimension (x, y or z), whether the triangle is added into left or right node is arbitrary (Figure 1). However, this causes structural disparity among different child nodes containing the triangle, which further leads to incongruent thread paths as some of the rays directly hit the triangle in front while others may undergo complicated traversal steps to route back from the back side. Nonetheless, a simple solution is to store a marker for every starting and ending triangle vertex indicating whether it is at same position in current dimension with its counterpart. When the split position happens to fall on the specific position of such triangle, the marker is read to determine whether to add it to another child node, which completely solves the problem.</p>
<p style="text-align: center;"><img src="/img/0101.jpg" alt="0101" /></p>
<center><small>Figure 1. The vertices lying on the splitting plane can belong to either left or right child node</small></center>
<h3 id="24-kd-tree-traversal">2.4 Kd-tree Traversal</h3>
<p>For the traversal of kd-tree, the standard CPU algorithm with stack which stores backside node cannot be directly applied to GPU. First, the stack need to be implemented in fixed length array which guarantees coalesced memory access and reduces memory read and write instead of using linked list or dynamic array implementation of stack. Second, size of the stack item should be compressed as small as possible as complex scene with dozens of kd-tree levels will require the stack to be allocated in local memory instead of the thread registers with very limited capacity. Foley & Sugerman (2005) introduced two stackless traversal algorithms called kd-restart and kd-backtrack. However, without the proper priority information stored for traversal, these algorithms require modification of the traversal path which brings extra time and space complexity: kd-restart directly goes into the nearest leaf and restarts from the root with ray range propelled forward and kd-backtrack stores extra node data to improve traversal restart efficiency as it can restart from a node’s parent. Meanwhile, the worst case of kd-restart degenerates to linear.</p>
<p>A neat solution proposed by Santos et al. (2012) adopted a “short stack” method. Instead of storing 12 bytes as in standard algorithms (4 bytes for node address, 4 bytes for near ray distance (tnear), 4 bytes for far ray distance (tfar)), they discovered that tnear can be derived in the traversal process and it only updates when the traversal finishes checking a leaf, giving an “8-bytes” stack algorithm. The advantage of “8-bytes” stack is not only fewer local memory required, but faster memory access thanks to the fact that an 8-byte load is faster than a 12-byte loads in local memory in CUDA architecture.</p>
<p>By combing a SAH construction of kd-tree and a “short stack” traversal, my SAS has the optimal performance comparing with other combinations. Below is the experiment data of different combinations of construction and traversal methods on different scene (Figure 2). Notice that these data are the result of some SIMD optimization applied to the data structure, which I will discuss and compare the performance with non-optimized ones in part 4.</p>
<p style="text-align: center;"><img src="/img/0102.png" alt="0101" /></p>
<center><small>Figure 2</small></center>
<p><br /></p>
<h3 id="references">References</h3>
<p>Foley, T., & Sugerman, J. (2005, July). KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (pp. 15-22). ACM.</p>
<p>Santos, A., Teixeira, J. M., Farias, T., Teichrieb, V., & Kelner, J. (2012). Understanding the efficiency of KD-tree ray-traversal techniques over a GPGPU architecture. International Journal of Parallel Programming, 40(3), 331-352.</p>
<p>Wald, I., & Havran, V. (2006, September). On building fast kd-trees for ray tracing, and on doing that in O (N log N). In 2006 IEEE Symposium on Interactive Ray Tracing (pp. 61-69). IEEE.</p>
Sat, 03 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/03/02_sas/
http://dqlin.xyz/tech/2016/12/03/02_sas/tech[GAPT Series - 1] Introduction to GPU Accelerated Path Tracing<p>This series records the progress of my final year paper - GPU Accelerated Path Tracing. It will be divided into 6 chapters.</p>
<ul>
<li>1 Introduction</li>
<li>2 Spatial Acceleration Structure</li>
<li>3 Path Tracing Algorithm</li>
<li>4 SIMD Optimization</li>
<li>5 Current Progress & Research Plan
<!--more-->
<br /></li>
</ul>
<h3 id="11-background-information">1.1 Background Information</h3>
<p>Polygon rasterization method has been the de-facto standard of the real-time graphic generation technique of video gaming in past few decades, where ray tracing related methods are still mainly used in off-line rendering of animation films and industrial designs. However, recent years have witnessed a rapid growth in the capability of real-time ray tracing with the advent of General Purpose GPU (GPGPU) and associated programming interfaces like CUDA and OpenCL. Nowadays, it is possible to ray-trace complex scene without global illumination in real-time on high-end GPUs. Because of its theoretical straightforwardness of dealing with complex optical effects and the huge potential of performance growth thanks to the performance scalability of GPU which responds directly to Moore’s law without the power wall faced by CPU, ray tracing related methods has been considered as the standard graphic rendering technique of the future.</p>
<p>However, to achieve photorealistic effects which, in rasterization based graphics, are implemented by heavy use of textures, a ray tracing model must include global illumination beyond the simplified light paths defined in Whitted ray tracing. Some ancillary methods like radiosity and distribution ray tracing have been added into original Whitted framework to enhance the lighting effect, yet each of them targets at a specific subset of the general global illumination and the combination of them is not exhaustive of all possible light paths. Path tracing, which samples all possible light paths using Monte Carlo methods and the rendering equation has come to the rescue. Fundamentally expanding the scope of traditional ray tracing, path tracing can naturally generate authentic global illumination effect within its theoretical simplicity. Yet, it requires thousands of samples per pixel to reduce the noise on picture below human perception and is therefore used mostly in offline rendering tasks like film production. Interactive refinement method which average the result among consecutive frames (each frame takes an extra sample for every pixel and is accumulated to the result of last frame if the camera is still) has been adopted in some real-time path tracing demos. However, such method is still unfriendly to games due to the consistent noise when anything in scene changes.</p>
<!--more-->
<h3 id="12-problem-statement">1.2 Problem Statement</h3>
<p>Admittedly, with limited capability of current hardware, mature real-time path tracing still has a long time to wait. However, given any GPGPU, current parallel algorithms on path tracing still have a lot to improve to exploit all SIMD (Single Instruction Multiple Data) capability. It is therefore an interesting and promising task to optimize current parallel algorithm and hardware use as much as possible, as it helps accelerate the implementation of real-time path tracing.</p>
<h3 id="13-project-objectives">1.3 Project Objectives</h3>
<p>In this project, I choose CUDA as my platform of path tracing implementation. The main objective is to accelerate path tracing as much as possible by optimization of different factors, while other objectives include enhancement of and rendering quality and enrichment of function. Integrated with NVidia graphic cards and C++ programming interface, CUDA has been popular among developers due to its efficiency and scalability. In this interim report, three topics that I studied so far - spatial acceleration structures (SAS), path tracing algorithm and SIMD optimization - will be discussed. More specifically, analysis will focus on how to optimize the construction and traversal of SAS, enhance the efficiency and functionality of path tracing algorithm and exploit as much as possible parallelism on the graphic card.</p>
Sat, 03 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/03/01_intro/
http://dqlin.xyz/tech/2016/12/03/01_intro/tech[PBR Series - 6] Improvement of PBR<!--more-->
<h3 id="61-re-arrangement-of-textures">6.1 Re-arrangement of Textures</h3>
<p>A problem of current implementation of PBR is that black seam occurs when the roughness of object is high. The “mip-map” arrangement in pre-filtered environment map in interim report uses a pyramid-like scaling. Given a 256<em>256 resolution of a single face in LOD 0, the size of a single face in LOD 7 will be 4</em>4, in which UV components may fall outside of the region due to float precision limit in OpenGL texture2D function. In addition, compressed texture sizes also cause inaccurate color, which is more apparent when size of object gets larger. Therefore, a re-arrangement of the “mip-maps” is necessary.</p>
<p>Noticing that there are blank spaces unused in left upper corner and right lower corner, why don’t we put maps for higher roughness there? For LOD levels beyond 2, the sizes are maintained as same as that of level 2. This method solves the aforementioned problems, while it utilises image canvas more evenly and efficiently. The modification of shader is also simple – for LOD greater than 2, an array is declared to store their relative coordinates; in texture fetch, a condition is used to branch the UV coordinates to the contents of the array. Below is a sample picture of our new pre-filtered environment map.</p>
<p><br /> <img src="/img/0016.png" alt="0016" /></p>
<center><small>Figure 16. New arrangement of Pre-filtered Environment map</small></center>
<p><br />
Another adjustment of game textures is combining several single monochrome textures into one texture. The maps for metalness, roughness, ambient occlusion and transmittance are assigned to R, G, B and A channel respectively. This is helpful for conserving texture slots since our system has a mobile version using OpenGL ES rasterizer that limit at most 8 game textures. The total number of textures for a single material will be limited to 6 (diffuse map, mixed map (metalness, roughness, ambient occlusion, ambient occlusion and transmittance), normal map, irradiance map, pre-integrated BRDF map, pre-filtered environment map), regardless of additional environment maps required for environment switch.</p>
<h3 id="62-sss-based-on-normal-map">6.2 SSS Based on Normal Map</h3>
<p>A problem for current local subsurface scattering is that it depends on direct light source, which is a problem if there is only image-based lighting. However, indirect lighting often has more evenly distributed irradiance intensity, which means the effect of local scale scattering will not be affected too much by the object orientation. Therefore, I approximate the case by assuming amount of environment light coming equally from all directions. Under this assumption, it is equivalent to scatter the normal map, as explained in the interim report of a similar case. The scattering is only determined by the skin’s diffusion profile. Also, scattering the normal map turns real-time calculation into pre-processing calculation, which can help to boost performance.</p>
<p>The calculation uses convolution of 6 Gaussian functions as introduced in interim report. However, since the texture space is distorted in geometry, we need to “stretch” it back to the magnitude in the original 3D mesh. As introduced in the texture space diffusion method (introduced in interim report), baking a “stretch map” aids the problem efficiently. In my implementation, this is done by calling the “ddx/ddy” function which returns local curvature in GLSL, using the fragment color the store the stretch magnitude in X and Y direction, and capturing the rendering result which has been projected to texture space. This was then used as an input of the normal map scattering program as a LUT.</p>
<p>This method has a good rendering result. The grainy and dry look of skin without scattering is replaced by a pinkish and smooth appearance, while the details are also preserved. The lack of local subsurface scattering in case of strong contrast in environment light can be partly compensated by the translucency effect.</p>
<h3 id="63-environment-switch">6.3 Environment Switch</h3>
<p>Up to now, our PBR only supports a single environment or a single room if using box projected cube environment mapping. But the fact is our game system has complex scenes, game characters need to walk around rooms in a building, for example. Therefore, we must use different environment maps for different places and even different parts of a single large place. It is obviously impossible to store all textures in the texture slots of an object. Actually, we only need two sets of environment maps, each of which is for one of the two adjacent scenes. When entering a place from another place, color interpolation can be used between the two sets of environment maps, which only needs to use the position of the adjacent border and the scale and position of the shaded object. Assuming we are walking from environment map 1 to environment map 2. We can derive the formula below:
<img src="/img/00f4.png" alt="00f4" />.
where the positions and lengths are all projected to 1D – the line perpendicular to the environment border.</p>
<p>We can always keep 4 slots for environment maps – 2 for older one, 2 for newer one (each environment has an irradiance map and a pre-filtered environment map), keeping the total number of textures 8. Upon the event of touching a new environment, we can put the corresponding environment maps into slots for newer one. After fully entering that environment, we can move the maps to the slots for older one. The assignment of texture slots can be done by calling appropriate APIs in the game engine. Thus, we can guarantee a smooth environment switch for PBR.</p>
<h3 id="64-conclusion">6.4 Conclusion</h3>
<p>Physically-based rendering is the main methodology to achieve photorealistic rendering quality. In common polygon-rasterization-based game engines with limited computational resources, this methodology is implemented by a collaboration of precomputation (irradiance map, pre-filtered environment map) and real-time shading using HDR image-based lighting. Meanwhile, many specified methods are researched and implemented to tackle particular problems including subsurface skin scattering and translucency. If you pack all these into a PBR Tool with user-friendly api and detailed documentation, other people in your mobile or pc game projects like artists can implement realistic PBR effects without extra efforts.</p>
<p>All the research and implementation indicate that realistic rendering is a complex task. There is not a “put things right once and for all” formula for it. The nature, governed by physics laws, has infinite structures that lead to phenomenon. However, the computational power is always finite. It is impossible to simulate all phenomena by one formula. Instead, particular problem should be tackled particularly and approximation is always used. The goal should focus on giving customers a satisfying result, providing the level of realism that meets their need. However, experimental methods should always be encouraged as computer graphics should develop infinitely until perfection.</p>
<p><br /></p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 16: New arrangement of Pre-filtered Environment map. Source: My original picture generated by OpenCV.</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/06_new/
http://dqlin.xyz/tech/2016/12/02/06_new/tech[PBR Series - 5] Real-time Skin Translucency<!--more-->
<h3 id="51-diffusion-profile-and-skin-translucency">5.1 Diffusion Profile and Skin Translucency</h3>
<p>In Episode 3, I have explained the usage of diffusion profile in subsurface scattering. The multipole model reducing the problem into sum of Gaussian functions is a good approximation. Since the BSSRDF is a generic function for both reflected and transmitted light, we can also use it for calculating the translucency color of skin as a result of subsurface scattering inside skin tissues. The only difference is that when we calculate the reflectance caused by local subsurface scattering, we only consider light at the same side as the normal; in this case, the light source is at back. Therefore, to calculate the radiance color, we only need to know the irradiance at back and the distance light travelled in the skin. However, for translucency the biggest problem is that theoretically, it is hard to determine the distance travelled and incident point at the back side since the geometry of the human skin is actually complex.</p>
<p><br /> <img src="/img/0014.png" alt="0014" /></p>
<center><small>Figure 14. (a) without / (b) with local subsurface scattering (c) without / (d) with translucency
</small></center>
<h3 id="52-approximation-method">5.2 Approximation Method</h3>
<p>In a 2010 paper by Jimenez et al. on real-time skin translucency, an approximation method is raised. Noticing that the translucency factor is more obvious in thinner parts like ear, we can solve the problem by using the inverse normal of the point of shading to approximate the normal direction of the incident point at back.</p>
<p>For direct lighting, the irradiance is simply the dot product of inverse normal and light direction. For image-based lighting, we can use the inverse normal to fetch from irradiance map.</p>
<p>For the distance travelled by light in skin, Jimenez et al. also uses an approximation – they simply ignored the refraction, using the distance travel in the skin region by straight light that starts from light source and ends at the shading point as if there is nothing in the middle, which requires a depth map from the light’s perspective.</p>
<h3 id="53-transmittance-map">5.3 Transmittance Map</h3>
<p>Actually, we can approximate further. We can avoid rendering depth maps and use average thickness instead. An idea is sampling distances in skin by ray tracing from all (-90° to 90°) different incident (or departing) angles and store the value in a look-up texture. The ray tracing may not be easy if we implement it manually. Luckily, there are many software solutions that can bake such map from an input mesh. One of such software is Knald, in which the aforementioned map is called transmittance map.</p>
<p><br /> <img src="/img/0015.png" alt="0015" /></p>
<center><small>Figure 15. Transmittance map</small></center>
<p><br />
However, after fetching from texture, we may want to do some inversion and scaling or add an offset to translate the texture value to real length.</p>
<p>Combing the aforementioned factors, the final transmittance color is
<img src="/img/00f3.png" alt="00f3" />.</p>
<p>The value must be added on the reflectance value, since the light sources are different. It is worth noticing that because we don’t know the transitivity (ratio of transmitted light) of skin, we can empirically determine a scaling coefficient by the actual rendering effects.</p>
<p><br /></p>
<h3 id="references">References</h3>
<p>Jimenez, J., Whelan, D., Sundstedt, V., & Gutierrez, D. (2010). Real-time realistic skin translucency. IEEE computer graphics and applications, (4), 32-41.</p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 14: (a) without / (b) with local subsurface scattering (c) without / (d) with translucency. Source: screen capture of online pdf of Real-time realistic skin translucency by Jimenez, J. on http://iryoku.com/translucency/downloads/Real-Time-Realistic-Skin-Translucency.pdf.</p>
<p>Figure 15: Transmittance map. Source: screen capture of a transmittance map generated by Knald.</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/05_stl/
http://dqlin.xyz/tech/2016/12/02/05_stl/tech[PBR Series - 4] High Dynamic Range Imaging and PBR<!--more-->
<h3 id="41-introduction-to-hdr">4.1 Introduction to HDR</h3>
<p>An important element for realistic physically based rendering is the using of high-dynamic-range (HDR) images. So far in our series, this is approximated by ordinary low-dynamic-range images to produce some sample results and to verify the correctness of the algorithm. However, since high-dynamic-range images require floating-point picture format, the reading and processing of the source images need to be redesigned. In addition, since the built-in GLSL shader cannot directly read data from floating-point textures, the textures need to be compressed and encoded into integer formats like PNG, which is also the scope of research.</p>
<p>Light intensity in real-life scene has a huge range. Sun as a light source has a ~109 cd/m<sup>2</sup> luminance, while the intensity of average starlight is below 0.001 cd/m<sup>2</sup>. However, common digital images (e.g. BMP, JPEG) have only 24-bit color depth, i.e. each color channel has a 0-255 integer range. Therefore, the range of light intensity a normal computer image can represent is rather limited.</p>
<p>Dynamic range is such a unit that measures the scale of color intensity difference across a picture. It is defined as the logarithm of the difference of highest and lowest pixel luminance (in RGB color space, a formula of luminance is 0.2126R + 0.7152G + 0.0722B). Following this definition, a JPEG image has a dynamic range of only 2.4, while real-life scenes often have a value above 9. The former one is often referred as low dynamic range (LDR) image and the latter one is referred as high dynamic range (HDR) image.</p>
<p><br /> <img src="/img/0012.png" alt="0012" /></p>
<center><small>Figure 12. comparison of PBR using LDR and HDR environment maps
</small></center>
<!--more-->
<p><br />
Because in PBR most we use a great amount of image-based lighting, we need to use HDR images to achieve real-life lighting effect. With LDR images, one can barely feel the existence of a light source present in the photo. It is even worse when the object has low metalness - whole lighting becomes dim and unrealistic. With HDR images and proper conversion techniques, these problems can all be solved.</p>
<p>HDR images use floating point number to represent a pixel, which gives it an almost full coverage of real-life light intensities. However, there are also different floating point formats, which we will explain in next section.</p>
<h3 id="42-floating-point-image-formats">4.2 Floating-point Image Formats</h3>
<p>Common floating-point image formats include FloatTIFF, Radiance RGBE and OpenEXR. Each of the three formats has its advantages and disadvantages. However, we are going to pick the format that is most suitable for our needs.</p>
<p>The FloatTIFF is a special extension of TIFF (Tagged Image File Format) to support floating point images that is specified in the tag of the file. In FloatTIFF, each color channel is represented by 32 bits, i.e. 96 bits in total for an RGB image. Despite its high fidelity, TIFF is blamed for its huge size, since compression is rarely used due to compatibility reasons. It may not be a good choice since our system needs to be transplanted into mobile platform and big files can lead to slow reading speed.</p>
<p>Radiance RGBE is a popular HDR image format originally developed by Gregory Ward Larson for his Radiance ray-tracing software system. The RGBE format is unique for storing the intensity in a separate channel (E for exponent), while the rest RGB channels maintain the color ratio same as in LDR images. RGBE format has an advantage for its small size (8 bits for each channel, 32 bits in total). Although RGBE is good for its small size and wide dynamic range, the trade-off is lower color accuracy.</p>
<p>OpenEXR is an HDR standard developed by Industrial Light & Magic. There are two sub-types: half float (16 bits) and full float (32 bits). The former one is more commonly used for its smaller size and enough accuracy for game level rendering. The 16 bits are divided into sign (1 bit), exponent (5 bits) and mantissa (10 bits). OpenEXR supports a dynamic range of 12, which exceeds the capability of human eye. Meanwhile, it support ZIP compression, which is lossless. This format is also widely supported in different software. Blender can bake images in half and full OpenEXR format. OpenCV also supports automatic conversion of the half float to full float for its internal processing. Due to the aforementioned advantages, we decide to use OpenEXR as the standard format for HDR images in our system.</p>
<h3 id="43-processing-hdr-image-in-opencv">4.3 Processing HDR image in OpenCV</h3>
<p>Given the source code of irradiance map and pre-filtered environment maps generation for LDR images, only a few changes need to be made to fit it into HDR processing. The trickiest one is that the second argument of cv::imread needs to change into negative number to indicate the function to read the image as raw data. Otherwise, OpenCV will treat the floating number as integer value so that the image cannot be correctly represented. When creating cv::Mat, the data type should be chosen as CV_32FC3, which is compatible to both half and full float OpenEXR images. Since our environment map does not contain transparency information, the alpha channel can be safely ignored. In addition, when fetching pixel intensity from cv::Mat using “at” method, the return type tag need to change to cv::Vec3f instead of cv::Vec3b.</p>
<p>The calculation functions need not to be modified since they are physics-based. As long as the value is proportional to the luminance (the physics unit), the result is correct relative to the source. However, special attention needs to be paid into the sampling process in numerical integration. The accuracy of result is affected by the number of samples taken in Monte Carlo integration. For LDR images, 1024 samples are enough for a good approximation. When it comes to HDR images, it depends on the actual dynamic range to decide the number of samples. To determine the number of samples as a function of dynamic range is mathematically complicated. However, we can use experience to estimate a threshold for most images. In some cases (dynamic range > 10), number of samples exceeds 1 million, which is extremely expensive in terms of time. Unfortunately, such case cannot be ignored as it often appears in dark room lit by an intense light source. If the number of samples is as same as that for LDR images, many bright noise points like fireworks will appear around the intense light source. To deal with this, we need to consider dynamic range compression.</p>
<h3 id="44-rgbm-compression">4.4 RGBM Compression</h3>
<p>From the former section we know that compression is important for HDR imaging used in game level rendering. However, since our Godot game engine only supports PNG and WebP textures, we need to encode our HDR image into these LDR formats. Fortunately, there is a solution provided by Brian Karis (2009) that solves both problems, called RGBM encoding.</p>
<p><br /> <img src="/img/0013.png" alt="0013" /></p>
<center><small>Figure 13. RGBM Encoding function
</small></center>
<p><br />
The encoding algorithm is simple, as shown in the image above. The basic idea of this algorithm is storing a multiplier in the alpha channel, which is determined by the largest value from R, G and B. The compression is applied when you use saturate function to the maximum value. It is easy to see that the maximum possible value after compression is the constant 6.0, since all colors are divided by 6.0 at first. Therefore, we can also treat this as the compression rate. If we want to preserve higher dynamic range, we should increase the value. However, this is at the expense of loss of color accuracy. In my implementation, I found that 36.0 is a good choice to balance these two factors.</p>
<p>With this technique, we can export our HDR images as PNG textures with alpha. In game engine, we simply need to enable the alpha channel and decode the texture as the figures shows. The constant in front should be our chosen compression rate.</p>
<h3 id="45-tone-mapping">4.5 Tone Mapping</h3>
<p>There is still an issue we need to deal with. Since an HDR image records the real intensity of pixels, we need to map it into sRGB space to show it on the screen, which is called tone mapping. The resulting image looks more vivid since it combines details of all parts like human eyes do, creating the feeling of a wider dynamic range. To understand tone mapping, we must first introduce some concepts: exposure, gamma correction and white balance.</p>
<p>In photography, an HDR picture is often the composition of several photos of different exposure values (EV), for the same scene. Defined by <img src="/img/00f1.png" alt="00f1" /> , where N is relative aperture (f-number) and t is the exposure time in seconds, exposure value is inversely proportional to the amount of exposure. The larger the aperture is and the longer the exposure time is, the smaller the EV is, i.e. there is a greater amount of light coming in. The decrease of 1 EV is also called increase of 1 “stop”. In HDR imaging, usually 3 to 5 pictures in the range of -2~2 stops are taken, to capture the enough range of lighting conditions. In larger EV, overexposure can be avoided for the very bright parts like object lit directly by the sun. Conversely, in lower EV, dark areas can preserve more details. Software like Adobe Photoshop can be used to compose the HDR image from the several LDR images, during which each pixel is assigned a float value that is directly proportional to the real light intensity, without color correction for display.</p>
<p>However, to present a HDR image on the screen, a process called gamma correction must be done. This concept comes from the interesting fact that the human perceived light intensity is not the real light intensity. Stevens’ power law indicates that for magnitude of sensation stimulus (S) and physical intensity (I), for some power p. Image captured by digital device records the physical intensity. Therefore, when presenting the image on screen, the intensity value must be taken to the power of 1/p to recover the sensation effect of that intensity. Current sRGB pictures are already encoded with the gamma, i.e. the RGB values are already powered by 1/p. In sRGB standard, p is equal to 2.2. However, for HDR images, the gamma correction is not applied; the floating values record raw intensity. Therefore, we must apply gamma correction in tone mapping process, which is both important for having a more realistic result and color blending with LDR textures in game.</p>
<p>White balance is another necessary process, which is also related to human visual perception. Different global illumination conditions have slight impact on the reflected color of objects, but human eyes tend to recover the material color (or diffuse color in PBR), which is often measured by the magnitude of compensation for white color. For example, in cloudy condition, the white objects seem to have colder color while in sunlit condition, they have warmer color. Therefore, in tone mapping, we may want to compensate some red color for cloudy condition to make the object preserve its original color so that the whole scene looks more natural, which is called white balance. While some images may be taken by white balanced camera, the display color temperature can be another factor that requires us to do white balancing.</p>
<p>Having these HDR data and knowing the aforementioned factors, how are we going to present the picture on screen? Actually, it depends on our purpose. Some algorithm based on human visual perception system produces more realistic image, while others may present a more artistically pleasing result (in which the contrast of color is stronger). Since we pursue realistic rendering, we prefer the former one. A solution provided by John Hable (2010) at Filmic Games has nice realistic render results. The algorithm takes two parameters – white balance level and exposure compensation, which can be determined empirically. In my implementation, I take 1.0 as white balance level and 4.0 as exposure compensation.</p>
<p>In PBR, metalness value is directly multiplied with the HDR float value, which is equivalent to the using of a different exposure value. Therefore, with non-metals, bright parts like light sources often remain bright (the halo is reduced) while darker environment is almost invisible, just like what we see in real life. However, there is one tricky part. Since we are working with LDR textures like diffuse maps, reverse gamma correction must be applied on those textures (by taking a power of 2.2) due to the encoding for sRGB color space.</p>
<p><br /></p>
<h3 id="references">References</h3>
<p>Hable, J. (2010, May 5). “Filmic Tonemapping Operators” [Online blog post]. Retrieved from http://filmicgames.com/archives/75.</p>
<p>Karis, B. (2009, April 28). “RGBM color encoding” [Online blog post]. Retrieved from http://graphicrants.blogspot.sg/2009/04/rgbm-color-encoding.html.</p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 12: Comparison of PBR using LDR and HDR environment maps. Source: http://gamedev.stackexchange.com/questions/62836/does-hdr-rendering-have-any-benefits-if-bloom-wont-be-applied.</p>
<p>Figure 13: RGBM Encoding Function. Source: screen capture of http://graphicrants.blogspot.sg/2009/04/rgbm-color-encoding.html.</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/04_hdr/
http://dqlin.xyz/tech/2016/12/02/04_hdr/tech[PBR Series - 3] Subsurface Scattering – Human Skin as Example<!--more-->
<h3 id="31-subsurface-scattering-and-fidelity">3.1 Subsurface Scattering and Fidelity</h3>
<p>So far, our PBR model only considers the interaction of light at the surface of an object. This is not a problem as many models in 3D games is nearly opaque. However, when it comes to objects with a certain level of depth and translucency, like skin, marble, leaves and wax, the effect of subsurface scattering cannot be neglected if we want to get realistic results. Subsurface scattering often has an effect to make the overall lighting softer as light at a position seems to be overflowing to its neighboring areas.</p>
<p>Among all kinds of materials that requires subsurface scattering, human skin is relatively complex. Human skin has numerous layers and realistic rendering requires a model of at least 3 layers (the thin oily layer, the epidermis and dermis) (d’Eon & Luebke, 2007). The first layer contributes to the specular reflection effect which can be simulated by Cook-Torrance Model. The other two layers requires the task of subsurface scattering. This kind of multi-layer subsurface scattering has an important meaning for research. The pictures below shows the comparison between rendering without (a) and with (b) subsurface scattering. The result without subsurface scattering appears very dry-looking and unrealistic, while the latter one seems natural (Figure 8).</p>
<p>Many games include highly detailed human models. Complex textures including bump map has been applied to simulate the human skin. To match with that level of fidelity, subsurface scattering is a must-do.</p>
<p style="text-align: center;"><img src="/img/0008.jpg" alt="0008" /></p>
<center><small>Figure 8 - Comparison Between Skin Rendering Without and With Surface Scattering</small></center>
<!--more-->
<h3 id="32-diffuse-profile-and-gaussian-approximation">3.2 Diffuse Profile and Gaussian Approximation</h3>
<p>As mentioned before, in subsurface scattering, the relationship between the irradiance and radiance is governed by BSSRDF. As the skin is highly scattering, a function called dipole diffusion function has been introduced (Jensen et al., 2001) to give efficient simulation of this situation. The BSSRDF is reduced to rely only on the material’s scattering properties, the Fresnel terms at point of incidence and point that radiance comes out, and the distance between the two points.</p>
<p style="text-align: center;"><br /> <img src="/img/0009.png" alt="0009" /></p>
<center><small>Figure 9 - Dipole Diffusion Function</small></center>
<p><br /></p>
<p>In the formula above, S represents BSSRDF, Ft stands for Fresnel transmittance and R is the material’s diffusion profile.</p>
<p>The diffusion theory is later extended to simulate scattering effect in multiple-layer material by the use of multipole (Donner & Jensen, 2006), to support more realistic rendering. The advantage of diffusion profile is that it is an empirically determined function. One only needs to do experiment on specific material to get a numerical form of the function and draw the shape of the function. The dipole or multipole is then used to give an analytical approximation. However, the computation of such functions is very complex. Luckily, d’Eon et al. invented an approximation method (2007). They fit the curve of multipole by using weighted sum of multiple Gaussian functions with some heuristic coefficients.</p>
<p style="text-align: center;"><br /> <img src="/img/0010.png" alt="0010" /></p>
<center><small>Figure 10 – Error of Gaussian Approximation</small></center>
<p><br />
Certain coefficients are chosen, such that the error represents by the formula above is minimized. For skin, six Gaussians are required to match the three-layer diffusion profile accurately. The coefficients for each Gaussian are determined respectively for red, green and blue channel.</p>
<h3 id="33--texture-space-diffusion-vs-pre-integration">3.3 Texture-Space Diffusion vs. Pre-Integration</h3>
<p>With the present of skin diffusion profiles, many methods can be chosen to compute the subsurface scattering. A popular method invented by Borshukov and Lewis is called texture-space diffusion (2005). The idea is storing incident lights in texture space and uses convolution steps to simulate diffusion, which is similar to the pre-filtered environment map in IBL. However, human characters in games are likely to moving frequently and changing of diffusion texture is inevitable, which implies real-time rendering to texture (RTT). Blender, as my working platform, does not have a customized texture buffer. In order to avoid editing Blender’s source code, I chose to find another method that does not use RTT.</p>
<p>Penner and Borshukov introduced an approximation method by pre-integrating the subsurface scattering effect in skin (2011). In this method, shading process becomes local; the calculation is reduced to a simple pixel shader - no extra rendering passes are required. The idea is that the visible scattering effect can be considered as a composite result of mesh curvature change, normal map bumps and shadows.</p>
<p>The first term, change of mesh curvature, affects incident light (and thus scattering) together with the angle between light direction and normal direction (described by N∙L). A 2D LUT using the curvature and N∙L can be calculated representing accumulated light at each outgoing direction as a result of lighting a spherical surface of a given curvature, which approximates a spherical integration in BSSRDF with a ring integration. Of course, skin diffusion profiles are used in the computation.</p>
<p style="text-align: center;"><br /> <img src="/img/0011.png" alt="0011" /></p>
<center><small>Figure 11 - Illustration of Diffuse Scattering Integration on Ring</small></center>
<p><br />
When using a normal map as bumping texture, since the effect of scattering light coming from a bump appears very similar to reflecting light from a surface with no scattering and blurred out bumps, the scattering effect can be approximated by blurred the normal map. The level of blurring is different for each color channel, because different wavelength has different diffusion profiles. Using a normal map for scattering without blurring it for each color channel in different degrees would result in an unnatural grainy shading.</p>
<p>In my implementation of this pre-integration method, the influence of shadows is temporarily omitted since currently I have not found the method to add a real-time shadow when customized shaders are used in Blender Game Engine. This issue will be tackled later. The combination of ring integration and blurry normal map already gives a nice result, if we ignored the shadow</p>
<p>The ring integration texture was generated in C++ with the aid of OpenCV. 1024 samples were taken for each independent variable combination from –PI/2 to PI/2 to ensure accuracy.</p>
<p>Another point worth to mention is that GLSL has a built-in function in its fragment shader to help calculating the curvature. Although it offers relief for programmers, it is computed in the basis of triangle orientation and thus curvature is uniform inside a mesh triangle. For low-poly meshes it becomes a problem as the edge of triangles at positions with strong scattering effect would be very obvious. To alleviate this problem, I stored the curvature in a texture and read from it in vertex shader. The vertex colors would then be interpolated across each triangle like what happens in Gouraud shading.</p>
<p>Indirect light sources can also be used for subsurface scattering. The combination of image based lighting with pre-integration scattering gives more vivid rendering results.</p>
<p><br /></p>
<h3 id="references">References</h3>
<p>Borshukov, G., & Lewis, J. P. (2005, July). Realistic human face rendering for the matrix reloaded. In ACM Siggraph 2005 Courses (p. 13). ACM.</p>
<p>d’Eon, E., & Luebke, D. (2007). Advanced techniques for realistic real-time skin rendering. GPU Gems, 3, 293-347.</p>
<p>Donner, C., & Jensen, H. W. (2005). Light diffusion in multi-layered translucent materials. ACM Transactions on Graphics (TOG), 24(3), 1032-1039.</p>
<p>Jensen, H. W., Marschner, S. R., Levoy, M., & Hanrahan, P. (2001, August). A practical model for subsurface light transport. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques(pp. 511-518). ACM.</p>
<p>Penner, E., & Borshukov, G. (2011). Pre-integrated skin shading. Gpu Pro,2, 41-54</p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 8: Comparison Between Skin Rendering Without and With Surface Scattering. Source: GPU Gem 3 online book on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html</p>
<p>Figure 9: Dipole Diffusion Function. Source: online document of d’Eon et al on http://www.eugenedeon.com/wp-content/uploads/2014/04/efficientskin.pdf</p>
<p>Figure 10: Error of Gaussian Approximation. Source: Same as Figure 9.</p>
<p>Figure 11: Illustration of Diffuse Scattering Integration on Ring. Source: Penner’s slides at SIGGRAPH 2011 Courses on http://advances.realtimerendering.com/s2011/Penner%20-%20Pre-Integrated%20Skin%20Rendering%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/03_sss/
http://dqlin.xyz/tech/2016/12/02/03_sss/tech[PBR Series - 2] Image Based Lighting<!--more-->
<h3 id="21-the-concept">2.1 The Concept</h3>
<p>Image Based Lighting is basically treating all pixels in an image as light sources. Usually, an environment map (usually cubemap) created from a panoramic, high dynamic range (HDR) image will be used as the source of texture fetch. Assuming the shaded object to be opaque, we only need to consider specular reflection and diffuse reflection. However, since the light source is numerous continuous pixels, we need to integrate BRDF to get the shading result of a surface point. In computer graphics, integration is approximated by sampling. To achieve more accuracy, the number of samples is proportional to the number of pixels, which is a large number in real-time rendering. Therefore, a method is baking necessary steps into texture and fetching the pixel in real-time rendering. Before that, we need to solve a problem – how to fetch a pixel from environment map?</p>
<h3 id="22---fetching-pixels-from-environment-map">2.2 Fetching Pixels from Environment Map</h3>
<p>On any kind of surface, the radiance value of a pixel can be seen as reflected from the other side of the surface normal (this is actually the case for specular reflection on perfectly smooth surface, but for other situation like diffuse reflection, the environment map can store an imaginary source point as the composite result of radiance), having the same angle with normal as the view direction. In other words, the pixel we need to fetch can be seen as the target that the camera ray hit after reflection.</p>
<p style="text-align: center;"><img src="/img/0003.jpg" alt="0003" /></p>
<center><small>Figure 3 - Cubemap Pixel Fetching Illustration</small></center>
<!--more-->
<p><br /></p>
<p>Cube mapping is a popular method of environment mapping as it has simple mathematical form. This method treat environment as a surrounding box with the environment panorama wrapped and mapped into 6 faces. In GLSL, there is a function textureCube() to do the fetch in a given reflecting direction. However, the reflected ray is assumed to be at the exact center of the cube. This is not a severe issue for skyboxes that represents faraway environment. However, when we need to represent reflection inside a small room, the reflection is heavily distorted if we want to fetch the reflected color for a ball close to a wall.</p>
<p>To solve the problem, I found a method called box projected cubemap environment mapping (BPCEM) (behc, 2010).</p>
<p style="text-align: center;"><br /> <img src="/img/0004.png" alt="0004" /></p>
<center><small>Figure 4 - Box Projected Cubemap Environment Mapping</small></center>
<p><br /></p>
<p>This method has a simple mathematical form. As the Figure 4 shows, it requires the size of the room and the relative position of the shaded object. The position of the intersection between the borders and the reflected camera ray will then be calculated easily. The corrected fetching direction is then the vector between the assumed sampling center (center of the room by default) and the intersection. This method is very intuitive and has very good approximation result. Therefore, I adapted the method in GLSL for rendering inside closed rooms as scene.</p>
<h3 id="23--irradiance-map-and-spherical-harmonics">2.3 Irradiance Map and Spherical Harmonics</h3>
<p>After solving the texture fetching problem, we come back to calculate illumination in IBL. The diffuse part of IBL is particularly non-trivial. The texture we want to pre-calculate according to BRDF is known as irradiance map. Unlike specular reflection which only has a small range of sampling which increases with the surface roughness according to Cook-Torrance model, the IBL diffuse reflection need to consider contributions from pixels in all visiable directions, which is a vast amount comparing with specular reflection. Real-time sampling is nearly impossible and even preprocessing becomes hard. Thanks to SIGGRAPH, there is a efficient approximation to the calculation of irradiance map (Ravi & Pat, 2001). It turns out that by computing and using 9 spherical harmonic coefficients of the lighting, the average error of rendering result is only 1%.</p>
<p>I wrote a C++ program to compute the 9 spherical harmonic coefficient for any cubemap under the size of 2048x2048 almost instantly. With the 9 spherical harmonic coefficients, the irradiance value of a given pixel can actually be computed in real-time. However, to avoid long expressions in shader, I pre-compute the result as irradiance map by a traversal of all reflecting directions.</p>
<h3 id="24---efficient-approximation-of-specular-reflection">2.4 Efficient Approximation of Specular Reflection</h3>
<p>The Cook-Torrance microfacet specular shading model (Cook&Torrance, 1981) is used to calcualte the IBL specular reflection.</p>
<p style="text-align: center;"><br /> <img src="/img/0005.png" alt="0005" /></p>
<center><small>Figure 5 - Cook-Torrance Specular Shading Model</small></center>
<p><br /></p>
<p>The D, F and G stands for Beckmann distribution factor, Fresnel term and geometric attenuation term respectively. However, the 3 formulas are also complex and efficient approximation need to be found. Thanks to SIGGRAPH again, a real shading model in Unreal Engine 4 was presented in SIGGRAPH 2013 course (Karis, 2013), in which computationally efficient algorithms were chosen to approximate the formulas and the integration. The integration was done by importance sampling, a general technique for approximating properties of a specific distribution. To further reduce scale of computation, a method called Split Sum Approximation was raised in this essay. The integration is split into the product of two sums, both of which can be pre-calculated. The first sum is a convolution of environment map as a result of given roughness under Cook-Torrance microfacet model. Because we want to choose different level of roughness for different objects in the same environment, there is a need to store result for different roughness value in mip-map levels of a cubemap. There is a DirectX image format called DirectDraw Surface (.dds) that supports storage of self-created mip-map levels. Unfortunately, Blender does not support reading mip-maps in this format. Therefore, I come up with a method that arranges all cubemap mip-map levels in a normal bitmap texture (as shown in Figure 5) and fetches the desired pixel in corresponding region.</p>
<p style="text-align: center;"><br /> <img src="/img/0006.jpg" alt="0006" /></p>
<center><small>Figure 6 - Storing Cubemap Mip-map Levels in a Single Texture (my original picture)</small></center>
<p><br /></p>
<p>The second sum, equivalent with integrating specular BRDF with a pure white environment, is easier to compute. It can be further approximate to the sum of another two integrals and leaves roughness and incident angle as two inputs, giving a scale and a bias as two outputs. Furthermore, all parameters fall into the range between 0 and 1; therefore, the result of the function can be pre-calculated and stored in a texture. It is noticeable that the second sum contains Fresnel term. Fresnel term, a factor that describes how reflectivity changes with different incident angles, exhibits stronger contrast between center and edge when the shaded object has lower metalness (the base reflectivity of non-metal is lower and the reflectivity of all material approaches 100% when the reflecting angle gets closer to 90 degrees). Since this effect is empirically easy to notice, it is indispensable for the realistic rendering.</p>
<p style="text-align: center;"><br /> <img src="/img/0007.png" alt="0007" /></p>
<center><small>Figure 7 - Fresnel Effect on a Dielectric Ball</small></center>
<p><br /></p>
<h3 id="25-the-ibl-tool">2.5 The IBL Tool</h3>
<p>Doing physically based rendering is not an easy process. Physical entities has continuous geometry, while the calculation in computer science is on a discrete basis. Calculus used in physics like integral must be converted to the form of discrete sampling. Even that, the current computational power still requires many approximation techniques and procedural tactics like look-up texture (LUT). The IBL is a very good example to demonstrate the complexity of PBR. Because of that, a tool that does all the pre-computations with minimal input commands would be convenient for artists and programmers to use. Therefore, I developed IBL Tool, a windows console program to make things easy (Sorry, this IBL Tool is now a proprietary software of my intern company so I cannot release it :). User only need to put the environment map texture inside and all LUT textures for IBL are produced. The program also exhibits customizability. User can choose from 3 different kind of output pattern – the separated faces, the standard Blender format and spread box format, as how the game engine requires.</p>
<p>In addition, a pair of sample GLSL shaders (vertex and fragment) is also provided with the program with custom fields indicated. A tone mapping that adjusts exposure and Gamma value is also included, so that user can get higher dynamic range for shading if required.</p>
<p><br /></p>
<h3 id="references">References</h3>
<p>Behc (2010, April 20) Box Projected Cubemap Environment Mapping [Online forum post]. Retrieved from http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/.</p>
<p>Cook, R. L., & Torrance, K. E. (1981, August). A reflectance model for computer graphics. In ACM Siggraph Computer Graphics (Vol. 15, No. 3, pp. 307-316). ACM.</p>
<p>Karis, B., & Games, E. (2013). Real Shading in Unreal Engine 4. part of “Physically Based Shading in Theory and Practice,” SIGGRAPH.</p>
<p>Ramamoorthi, R., & Hanrahan, P. (2001, August). An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques (pp. 497-500). ACM.</p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 3: Cubemap Pixel Fetching Illustration. Source: https://en.wikipedia.org/wiki/Reflection_mapping</p>
<p>Figure 4: Box Projected Cubemap Environment Mapping. Source: behc’s post on http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/</p>
<p>Figure 5: Cook-Torrance Specular Shading Model. Source : Karis’ notes at SIGGRAH 2013 on http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf</p>
<p>Figure 6: Storing Cubemap Mip-map Levels in a Single Texture. Source: My original picture generated by OpenCV.</p>
<p>Figure 7: Fresnel Effect on a Dielectric Ball. Source : an article “Interaction of Light and Materials” on http://www.pernroth.nu/lightandmaterials/</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/02_ibl/
http://dqlin.xyz/tech/2016/12/02/02_ibl/tech[PBR Series - 1] Introduction to PBR<p>This series introduces you to the implementation of physically based rendering in light-weight game engines by GLSL shaders and precomputed textures. With our techniques, most phsically based surface-to-surface reflection can be simulated across a variety of material ranging from metals to dielectrics in HDR lighting and subsurface scattering can be simulated as well with human skin as an example, so that you can achieve high fidelity of game objects with fewest resources. Most of the contents are collected from my project report as an intern R&D engineer in a game company from May to Nov 2015. I want to share what I learned of, what I thought about and what I have done in this topic so that it makes easier for more people trying to dig into this topic. There are 6 episodes.
<!--more--></p>
<ul>
<li>1 Introduction to PBR</li>
<li>2 Image Based Lighting</li>
<li>3 Subsurface Scattering – Human Skin as Example</li>
<li>4 High Dynamic Range Imaging and PBR</li>
<li>5 Real-time Skin Translucency</li>
<li>6 Improvement of PBR</li>
</ul>
<p><br /></p>
<h3 id="11-photorealistic-rendering-and-shading-language">1.1 Photorealistic Rendering and Shading Language</h3>
<p>Photorealistic rendering is one crucial part of computer graphics. In early days of computer graphics, most hardware only supports fixed function rendering pipeline due to performance limitation. Things like lighting and texture mapping were processed in a hard-coded manner, resulting in coarse and unrealistic images. Nowadays, with the development of hardware power and software techniques, rendering pipelines have become programmable. High-level shading languages with unlimited possibilities of rendering effects are directly applied in specific stages of rendering pipelines. Many ideal physical models have been successfully implemented with GPU programming to render highly photorealistic 3D scenes and characters.</p>
<p>One of the most common high-level shading language is OpenGL Shading Language (GLSL), which is also the main tool used in this project. Unlike stand-alone applications, GLSL shaders require an application that calls OpenGL API and are passed as a set of strings to graphic driver for compilation. There are two kinds of most commonly used shader, the vertex shader and the fragment shader, which are applied to each mesh vertex and each fragment (i.e. covered pixel) respectively in the graphic pipeline. The vertex shader determines the final position of mesh vertices that are presented to user. It also passes variables that relate to normal, color or coordinates to fragment shader. The fragment shader is then executed to give each covered pixel an intended color. In addition, external uniforms can be passed into the shaders to provide material like texture sample and camera position.</p>
<p>Here is a simplified diagram showing where shaders are applied in the graphics pipeline:</p>
<p style="text-align: center;"><br /> <img src="/img/0001.png" alt="0001" /></p>
<center><small>Figure 1 – Rendering Pipeline (my original picture)</small></center>
<!--more-->
<h3 id="12--implementation-based-on-blender-game-engine">1.2 Implementation Based on Blender Game Engine</h3>
<p>In this series, we mainly use Blender as the example of light-weight game engine and the platform for graphic rendering and 3D games implementation. As a professional open-source 3D computer graphics software, Blender supports many functions including 3D modelling, UV unwrapping, texturing and rendering. Using OpenGL as its graphic library, Blender supports customized GLSL shaders in rendering in its Blender Game Engine. Figure 2 is a screen capture from Blender 2.75 showing the effect of a simplistic shader that shades a cube with purely red color with no lighting. It is noticeable that the shaders are passed as arguments of Blender’s Python API, which are then passed to the internal OpenGL API. Like other game engines, Blender also supports passing game primitives like material’s lighting configurations and camera’s view transformation matrix to the shaders, which allows intuitive coordination between shaders and the scene.</p>
<p style="text-align: center;"><br /> <img src="/img/0002.png" alt="0002" /></p>
<center><small>Figure 2 - A Sample Shader in Blender (my original picture)</small></center>
<h3 id="13--physically-based-rendering">1.3 Physically Based Rendering</h3>
<p>Physically based rendering (PBR) is a growing trend in CG game and film industry. Basically, PBR is rendering images according to math model of real-life physics. Thanks to the growth of computation power, many complex models can be approximated in high precisions currently, which gives highly realistic images.</p>
<p>A very important issue in PBR is to simulate realistic lighting effect, in which bidirectional reflectance distribution function (BRDF) is used. A BRDF gives the relationship between irradiance and radiance, i.e. how light intensities distribute in each reflection direction after a light ray hit an opaque surface in given angle – which includes all kinds of reflection (e.g. specular reflection and diffuse reflection), depending on the material of the surface. Usually, different BRDFs that assumes specific properties of the surface are combined to give realistic lighting result. Common BRDFs includes Lambertian model (assuming surface to be perfectly diffuse), Blinn-Phong Model (a traditional lighting model that approximates both diffuse and specular reflection) and Cook-Torrance model (a more complex model that treats surface as microfacets to give more accurate specular reflectance with Fresnel effects).</p>
<p>The use of complex textures for different physical attributes is a crucial factor that makes PBR realistic. When it comes to complex models that has different physical attributes in different part of surface, a very important constraint must be considered – conservation of energy. In other words, the more reflectivity (short for specular reflection) an object exhibits, the less diffusion (short for diffuse reflection) it gives. Metalness is a parameter used to determine the ratio between reflectivity and diffusion (which sum up to 1). The reason of the name is that metal usually has a high reflectivity and low diffusion and dielectric (non-metal) material has the reverse condition. Therefore, metalness map can be used as a texture to determine the metalness in different parts of the object (most real-life object has different metalness across its surface), which grants artists’ power. Another attribute that has to obey energy conservation is roughness. Treating any surface as infinite many microfacets, being rougher is only about having a larger variance of microfacet orientation, which obvious obeys energy of conservation – the rougher a surface is, the larger the reflected image spreads. Similarly, another texture – roughness map can be used here.</p>
<p>Direct illumination from simplified light models is easy to implement in shaders. Choosing a suitable BRDF and knowing related material attributes is enough to give a result. However, when it comes to global illumination, calculating the correct lighting becomes not so trivial. In most scenes, the background environment reflects light and illuminates its surroundings indirectly. Also, realistic lights may have different shapes and color distribution. These phenomena exceed the capability and constraints of light models by far. Although BRDF is also applicable to these situations, it requires integral and other complex mathematical computations that cannot be carried out directly. Therefore, a technique called image based lighting (IBL) comes to rescue, which is also the first issue that I come to research and implement. In next episode, I will talk about how I implement the image based lighting.</p>
<p>Another issue is subsurface scattering. BRDF assumes an opaque object surface, in which reflections all happens at the incident point (the diffuse reflection is actually scattered light due to material’s internal microscopic irregularities, and can be treated as directly happening at the surface). However, for materials that are not so opaque (for example, human skin), the effect of transmittance and subsurface scattering cannot be ignored. Calculating transmittance is easy as it is actually an inverted BRDF on the opposite side of the surface. However, surface scattering is relatively complex, as the lights reflected from subsurface can exit at another location other than the point of incidence. There is also a general function called Bidirectional Surface Scattering Reflection Distribution Function (BSSRDF) that describes the relation between the irradiance and radiance in this phenomenon. However, the function is too general and special case may be taken to handle specific tasks, in order to be more computationally efficient. Since my task is currently restricted to the rendering of human skin, I will focus on the special case of subsurface scattering happening inside human skin. Episode 3 will introduce the method I use and the detail of implementation.</p>
<p><br /></p>
<h3 id="list-of-figures">List of Figures</h3>
<p>Figure 1: Rendering Pipeline. Source: My original diagram created by Word</p>
<p>Figure 2: A Sample Shader in Blender. Source: My original screenshot of Blender</p>
Fri, 02 Dec 2016 00:00:00 -0700
http://dqlin.xyz/tech/2016/12/02/01_pbr/
http://dqlin.xyz/tech/2016/12/02/01_pbr/tech山西游记之一 (A Travel in Shanxi: Part 1)<!--more-->
<div id="google_translate_element"></div>
<script type="text/javascript">
function googleTranslateElementInit() {
new google.translate.TranslateElement({pageLanguage: 'zh-CN', layout: google.translate.TranslateElement.InlineLayout.SIMPLE}, 'google_translate_element');
}
</script>
<script type="text/javascript" src="//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>
<h3 id="抵达太原">抵达太原</h3>
<p>1.山西，太行山之西也。古又称河东，谓其黄河之东也。余尝觉山西神秘。何也？北方大多一马平川，唯山西耸于黄土高原之上。今人有言“五千年风物，地下看陕西，地上看山西”，此谓山西古物保存之善，独绝中华也。盖历代纷争，中原板荡，文物几俱毁于兵燹。南方虽宁，开辟甚晚，其文物不足为道也。九州现存之古物，大多出土所获。唯山西恃其险峻，得存古迹于地表。今五台有南禅寺，自有唐屹立至今，乃国内现存最早之木构建筑也。明以前之建筑，亦属晋地保存最多。得览此胜，余自闽入晋，亦不辞其辛也。(以上的蹩脚文言文实乃装逼失败的典范。。)</p>
<p>2.第一顿晚餐，开胃菜居然是一瓶老醋口服液！ 这特产推的确实有点生猛。 然而山西的醋有甜味，喝起来居然颇爽。。
<br /> <img src="/img/老醋.jpg" alt="生活从一瓶老醋开始" /></p>
<p>3.晚餐在山西会馆，貌似是山大办校友会的地方，装修风格是山大老校区和山西农家的结合。土菜很有特色。
<br /> <img src="/img/桌布.jpg" alt="桌布" /></p>
<p>3.吃饱饭后下起雷阵雨来，暑气顿散，雨滴颇凉。09年到陕西也是这样。其地气之寒，与南方迥异。</p>
<p>4.太原夜景不错，城市整洁大方，完爆郑州。
<br /> <img src="/img/太原夜景.jpg" alt="太原夜景" /></p>
<p>5.夜宿愉园酒店，貌似是太原的老牌宾馆。进去一看设施确实陈旧，没有吹风机，冰箱打不开，连浴室的锁都坏了。。。早晨起来，推开窗帘俯瞰太原老市区，轻霾弥漫，安静祥和。
<br /> <img src="/img/太原晨景.jpg" alt="太原晨景" /></p>
<p>6.第二天早上在开化寺古玩市场附近早市吃了早点。馄饨加鸡蛋灌饼。北方该有的早点这儿都有。大油饼长得像几根油条连起来一样，十分诱人。(后来在五台山早餐迟到油条，惊觉好吃，后悔在太原没买油饼）
<br /> <img src="/img/早市.jpg" alt="早市" /></p>
<p>7.早餐摊的老板娘人很好。过去端了几碗馄饨，每一次都细心嘱咐你小心后面人多。另外这儿的餐桌上也必须都有老醋:) 早市十分热闹，摩肩接踵。买了几个蜜桃，价格几乎是厦门的五分之一，味道也更好。看来吃水果确实要吃在地的。</p>
<p>8.准备上高速，收费站有好几个通道，然而标识不清，看起来像地产商广告牌的居然是路牌！害得我们走错，绕回来白花了半小时，刚上高速居然还堵车。方见识到山西城市管理的混乱。。。</p>
<p>9.上了二广高速，才知道山西的绿不限于太原，黄土高坡跟我想象的很不一样(在陕北的高速公路也有同样的感觉，希望不是只有高速公路看得到的地方做做形象工程。。。)。平原，山岭，草地和白杨交织，北国风光，十分赏心悦目。白杨叶的反光很强，远看以为是白花。
<br /> <img src="/img/二广高速.jpg" alt="二广高速" /></p>
<p>(To be continued…)</p>
Thu, 11 Aug 2016 00:00:00 -0600
http://dqlin.xyz/travel/2016/08/11/Shanxi-part1/
http://dqlin.xyz/travel/2016/08/11/Shanxi-part1/travel你好，世界<!--more-->
<p><br /><br />
<img src="/img/screenshot.jpg" alt="screenshot" />
<br /><br /></p>
Sat, 23 Jul 2016 00:00:00 -0600
http://dqlin.xyz/post/2016/07/23/hello-world/
http://dqlin.xyz/post/2016/07/23/hello-world/post