Daqi's Blog

Real-Time Stochastic Lightcuts Source Code Release!

Sat, 08 Aug 2020 00:00:00 -0700

Using RTX to Accelerate Instant Radiosity

Thu, 24 Jan 2019 00:00:00 -0800

The past 2018 was an exciting year for computer graphics. Nvidia announced RTX graphics cards which brings real-time ray tracing to consumers. Following the announcement, we saw new releases of mainstream game series including Battlefield V and Shadow of the Tomb Raider, putting RTX powered graphics in their games. This screenshot below captured from a Battlefield V promotion video (https://www.youtube.com/watch?v=rpUm0N4Hsd8) shows super clear ray traced reflections in water.

However, the recent new game releases branded with RTX graphics mostly use RTX for tracing reflections and shadows (including soft shadow), many other possibilities with real-time ray tracing are still to be explored. Of course, ray traced reflections and shadows can largely enhance the overall graphics quality, considering the importance of these two components and how bad their quality was even using very complex rasterization tricks. But there are still tons of possible applications of real-time ray tracing that can lift the overall graphics quality to a new level. For example, we can achieve more faithful subsurface scattering in translucent objects like marbles and human skin. Current technique use shadow maps to estimate the distanced traveled by light inside the object, which can fall short with concave objects and have artefacts around object edges. But ray tracing simply avoids all artefacts brought by rasterization tricks and everything will appear as they should be.

The thing I want to advocate is using RTX to generate virtual point lights (VPL) and trace shadow rays to them. This technique is known as instant radiosity, introduced by Alexander Keller in 1997. It resembles bidirectional path tracing and photon mapping in the sense that it also traces light paths and records hit positions along the paths. These surface records are then used as point lights that represent discretized indirect lighting. Comparing to photon mapping, instant radiosity is cheap and effective to render smooth diffuse reflection, thanks to the low frequency spatial radiance distribution of point lights. A few hundreds of VPLs are enough to provide indirect lighting from a small light source with reasonable quality (of course, there is singularity problem caused by the point light approximation, but that can be bypassed by setting a minimum distance between point lights and surface points), which often requires more than 1M photons when using photon mapping to eliminate the wavy artefact. As a result, virtual point lights have been used in games for flashlights. An example is the rendering of indirect illumination in rooms from gun mounted flashlights in Gears of War 4. [Malmros, 2017] (talk: https://www.youtube.com/watch?v=c0VxzGRIUCs) The developers used reflection shadow maps [Dachsbacher & Stamminger, 2005] (http://www.klayge.org/material/3_12/GI/rsm.pdf) to sample single-bounce VPLs and merge VPLs according to some geometric and material heuristics to lower the computational cost. This technique gives real-time single bounce GI (likely unshadowed). The following screenshots from the talk video a comparison between flashlight aiming at the red tapestry and the wall. Clearly, the technique generates reasonable color bleeding as shown by the change of color on the ceiling.

Looking carefully into the talk video, there are some temporal incoherence as the flashlight moves. However it is not obvious in a dark environment like this, especially when using a moving FPS game character. But the render quality can still be improved if the virtual point lights can be generated and evaluated at a lower cost. First, using more VPLs improves the temporal stability and reduces the bright blotches. Also, given strict performance requirements Gears of War 4 probably only used didn’t trace shadow rays to VPLs for indirect illumination shadows, which could be very important given a more complex scene setting. Now with RTX available, we can make VPLs much cheaper, solving the aforementioned problems.

DirectX Ray Tracing

There are several options to get access to the ray tracing function in RTX cards. But the most convenient one for PC gaming development is the DirectX Ray Tracing (DXR) API, which integrates seamlessly with the rasterization pipeline we use everyday. There is a very nice introduction to DXR from Nvidia (https://devblogs.nvidia.com/introduction-nvidia-rtx-directx-ray-tracing/). In most concise words, DXR breaks the ray tracing process into three new shaders, “raygen”, “hit” and “miss”. Rendering starts from “raygen” or ray generation, invoked in a grid manner like a compute shasder. In all shaders, calling TraceRay() executes the fixed function hardware scene traversal using an acceleration structure (BVH). Because can a ray can hit any object, a shader table is used to store shading resources for each geometry object. Upon intersection, the entry for the current object can be retrieved from the shader table to determine which shaders and textures to use. With these functions, we can virtually implement all possible ray tracing functions with DXR, except choosing and our own acceleration structure (for example, k-d tree).

Example Implementation

Here I provide a brief code walk through of using DXR to implement the original (brute force) instant radiosity algorithm. Some parts of code is modified from the Microsoft MiniEngine DXR example (https://github.com/Microsoft/DirectX-Graphics-Samples). Notice that it is not meant to be interactive as the original instant radiosity is an offline rendering algorithm that goes through all virtual point lights for each pixel, and shooting shadow rays to resolve the visibility. In our experiment we will generate 1 million VPLs for at most two-bounce indirect diffuse reflection and render a 1280x720 instant radiosity image. This means we need to trace 921.6 giga shadow rays against the scene. My DXR program running on an RTX 2080 takes 25 minutes to render a instant radiosity image for Crytek Sponza (262k triangles), about 600 mega rays per second which is quite impressive. Please notice that I a very unoptimzed way of tracing shadow rays (trace shadow rays for one light for all pixels in each frame with which contains a fixed G-buffer overhead). With proper optimization, the speed should at least reach 1 Giga rays per second. The VPL generation process is relatively fast, taking less than 2 ms. Here is the raygen shader I used to generate the VPLs.

To sample light rays from a directional distant light (sun light), I used a function like this to randomly sample a point from the top disk of the bounding cylinder of the scene bounding sphere aligned to the sun light direction (a technique introduced in PBRT). This point is then used as the origin of the light ray, which is then traced through the scene and make diffuse bounces when intersecting with surfaces.

void GenerateRayFromDirectionalLight(uint2 seed, out float3 origin, out float pdf)
{
	float3 v1, v2;
	CoordinateSystem(SunDirection, v1, v2); // get a local coordinate system
	float2 cd = GetUnitDiskSample(seed + DispatchOffset);
	float3 pDisk = SceneSphere.xyz - SceneSphere.w * SunDirection +
            SceneSphere.w * (cd.x * v1 + cd.y * v2);
	origin = pDisk;
	pdf = 1 / (PI * SceneSphere.w * SceneSphere.w);
}

DXR requires us to specify the ray using a ray description (RayDesc) structure that contains the origin, tMin (min parametric intersection distance along the ray), ray direction and tMax. Additionally, for this algorithm we also need to define a payload structure that records the carried radiance on the light path which I call “alpha”. It is multiplied with the surface albedo and the cosine factor during each surface hit. The payload also stores the current recursion depth. This payload feeds the last argument of TraceRay, which can pass the information from ray generation to hit shader and between bounces.

/// some resource and function definitions
    ...

struct RayPayload
{
	float3 alpha;
	uint recursionDepth;
};

[shader("raygeneration")]
void RayGen()
{
	float3 origin;
	float pdf;
        // using the 2D ray dispatch index as random seed
	GenerateRayFromDirectionalLight(DispatchRaysIndex().xy, origin, pdf);
	float3 alpha = SunIntensity / pdf;

	RayDesc rayDesc = { origin,
		0.0f,
		SunDirection,
		FLT_MAX };
	RayPayload rayPayload;
	rayPayload.alpha = alpha;
	rayPayload.recursionDepth = 0;

        // defition of parameters can be found on
        //   https://developer.nvidia.com/rtx/raytracing/dxr/DX12-Raytracing-tutorial-Part-2

	TraceRay(accelerationStructure, RAY_FLAG_NONE, ~0, 0, 1, 0, rayDesc, rayPayload);    
}

Here is the “closesthit” shader that creates VPLs at surface hits. I use three DX12 linear buffers (StructuredBuffer) to store VPL positions, normals and colors (flux). An atomic counter is incremented and returned each time a new VPL is created to prevent writing to the same position. Following that, a diffuse reflection ray is sampled from the hemisphere centered at the surface normal to generate next bounce. Notice how DXR provides a wide range of fixed functions and variables that stores necessary ray tracing information we need for lighting computation. For example, barycentric coordinates are fetched from “BuiltInTriangleIntersectionAttributes” and the intersection distance is provided in RayTCurrent().

/// some resource and function definitions
    ...

[shader("closesthit")]
void Hit(inout RayPayload rayPayload, in BuiltInTriangleIntersectionAttributes attr)
{
	uint materialID = MaterialID;
	uint triangleID = PrimitiveIndex();
	RayTraceMeshInfo info = meshInfo[materialID];

        ///fetch texture coordinates (uv0, uv1, uv2), vNormal), vBinormal, vTangent for triangle vertices
        ...

	float3 bary = float3(1.0 - attr.barycentrics.x - attr.barycentrics.y, attr.barycentrics.x, attr.barycentrics.y);
	float2 uv = bary.x * uv0 + bary.y * uv1 + bary.z * uv2;

	float3 worldPosition = WorldRayOrigin() + WorldRayDirection() * RayTCurrent();
	uint2 threadID = DispatchRaysIndex().xy + DispatchOffset;
	const float3 rayDir = normalize(-WorldRayDirection());
	uint materialInstanceId = info.m_materialInstanceId;
	const float3 diffuseColor = g_localTexture.Sample(defaultSampler, uv).rgb;
	float3 normal = g_localNormal.SampleLevel(defaultSampler, uv, 0).rgb * 2.0 - 1.0;
	float3x3 tbn = float3x3(vTangent, vBinormal, vNormal);
	normal = normalize(mul(normal, tbn));

        // sample a diffuse bounce
	float3 R = GetHemisphereSampleCosineWeighted(threadID * (rayPayload.recursionDepth+1), normal);
	float3 alpha = rayPayload.alpha;

        // attenuate carried radiance
	alpha *= diffuseColor;

        // increment atomic counter
	uint VPLid = g_vplPositions.IncrementCounter();

        // store a new vpl
	g_vplPositions[VPLid] = worldPosition;
	g_vplNormals[VPLid] = normal;
	g_vplColors[VPLid] = alpha;

        //push to vpl storage buffer
	if (rayPayload.recursionDepth < MAX_RAY_RECURSION_DEPTH)
	{
		TraceNextRay(worldPosition + epsilon * R, R, alpha, rayPayload.recursionDepth);
	}
}

Finally, we can gather the VPL contribution for all visible pixels to generate an irradiance buffer. Notice that texPosition and texNormal are surface world position and normal from the G buffer.

The irradiance $E$ can be calcuated as $ E = \Phi \frac{<n_s, r><n_l, -r>}{||r||^2} $, where $\Phi$ is VPL flux, $n_s$, $n_l$ are surface normal and VPL normal, $p_s$, $p_l$ are surface and VPL positions, and $r$ is the normalized shadow ray direction.

The following shader code shows computing the contribution from one VPL, indexed by VPLId. A shadow ray is traced to resolve the visibility between current pixel and VPL, and the max ray T is set to the light distance which means any hit is an occlusion. This message can be stored in the ray payload as a boolean. Because we are tracing shadow rays, it is better to set RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH argument in TraceRay() to avoid the unnecessary search of closest hit.

/// some resource and function definitions
    ...

/// an anyhit shader setting payload.IsOccluded to translucent
    ...

[shader("raygeneration")]
void RayGen()
{
	float3 output = 0;

	int2 screenPos = DispatchRaysIndex().xy;

	float3 SurfacePosition = texPosition[screenPos].xyz;
	float3 lightPosition = vplPositions[VPLId].xyz;
	float3 lightDir = lightPosition - SurfacePosition;
	float lightDist = length(lightDir);
	lightDir = lightDir / lightDist; //normalize

	// cast shadow ray
	RayDesc rayDesc = { SurfacePosition,
			  0.1f, //bias
			  lightDir,
			  lightDist };

	ShadowRayPayload payload;
	payload.IsOccluded = false;

	TraceRay(g_accel, RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH,
		~0, 0, 1, 0, rayDesc, payload);
	if (!payload.IsOccluded) // no occlusion
	{
		float3 SurfaceNormal = texNormal[pixelPos].xyz;

		float3 lightNormal = vplNormals[VPLId].xyz;
		float3 lightColor = vplColors[VPLId].xyz;

		output = max(dot(SurfaceNormal, lightDir), 0.0) * lightColor;
		output *= max(dot(lightNormal, -lightDir), 0.0) / (lightDist*lightDist + CLAMPING_BIAS);
	}

	irradianceBuffer[pixelPos] += output / numVPLs;
}

After we iterate through all VPLs, we can modulate the irradiance buffer with the surface albedo to produce the final indirect illumination and add that to the direct illumination. And here is the final image.

If you look into VPL-based GI literature, you’ll find a ton of methods that approximates the instant radiosity with a dramatic reduction in evaluation cost. But none of them can reach real time performance yet. But we’ll continue with this topic to see what RTX can bring us for real time global illumination.

References

Dachsbacher, C., & Stamminger, M. (2005, April). Reflective shadow maps. In Proceedings of the 2005 symposium on Interactive 3D graphics and games (pp. 203-231). ACM.

Malmros, J. (2017, July). Gears of War 4: custom high-end graphics features and performance techniques. In ACM SIGGRAPH 2017 Talks (p. 13). ACM.

The Hardest Problem on the Hardest Test

Sat, 01 Sep 2018 00:00:00 -0700

Yesterday, I saw this intriguing YouTube [video] talking about using a clever trick to solve what was meant to be the hardest problem in 1992 Putnam Math competition.

The problem goes as following: Choosing four random points on the surface of a sphere, what’s the probability that the center of the sphere lies in the tetrahedron spanned by the four points? At first sight, it occurred to me that this was a very difficult multivariate integral problem. However, it turned out that you just need to “flip coins” to get the correct answer.

The solution approaches the problem by considering the simplified version first. If there are only three points $P_1$, $P_2$, $P_3$, it’s much easier to consider the case where we fix two points and slide the third point on the circle. Apparently, only when the third point falls on the “opposite” arc (Drawing lines from $P_1$ and $P_2$ through the center, the arc intersected by the lines on the opposite side to that formed by $P_1$ and $P_2$), can the triangle spanned by the three points covers the center of the circle. So given any $P_1$ and $P_2$, the probability that adding a third point $P_3$ forms a triangle that covers the center is the probability that $P_3$ lands on that arc, i.e., the length of the arc over the circumference of the circle since $P_3$ is randomly chosen on the circle. To derive the probability with any $P_1$, $P_2$ and $P_3$ will require $P_1$, $P_2$ to be arbitrarily chosen, which means we can integrate the previous probability over all possible angles between $P_1$ and $P_2$. However, since parameterizing on the central angle is linear, we can directly take the “average length” where the arc formed by $P_1$ and $P_2$ is a quarter circle (we only consider the shorter arc), which gives us a $25\%$ probability.

The simplified 2D case.

I believe you are already excited after seeing this 2D case result which was derived without any computation. However, things are getting more nebulous when we go into 3D. We naturally want to extend the 2D reasoning to 3D. Of course, the tetrahedron formed by $P_1$, $P_2$, $P_3$ and $P_4$ covers and only covers the center of the sphere when $P_4$ falls on the circular triangle opposite to the one spanned by $P_1$, $P_2$ and $P_3$. However, getting the “average area” of the circular triangle is non-trivial as it involves evaluating a surface integral with four variables.

So here comes the plot twist, let us go back to the 2D case. Instead of choose three random points, let’s choose two random lines crossing the center of the circle and choose a random point. Each of the random lines represents a point from $P_1$ and $P_2$ and $P_1$ and $P_2$ can be on either end of their underlying line, at equal probability. This should be equivalent to the case of choosing three random points. Since there are only 4 different combinations of $P_1$ and $P_2$ lying on a specific end of their corresponding lines and only one of these combinations yields $P_1, P_2$ to be on the opposite side of the circle as the randomly chosen $P_3$ (in other words, not all of them are on the same semicircle), we have a one-fourth chance of forming a center-covering triangle. It is very similar to flipping two coins - one specific coin must be head and the other one must be tail, which has a $1/4$ probability.

The "coin flip" illustration.

And what’s awesome about this reasoning is that it extends seamlessly to the 3D case. Four random points just become three random lines crossing the sphere center and one random point. This time, we are flipping 3 coins to generate all possible combinations and only one out of all 8 cases gives us the situation when the fourth point is on the opposite side of the sphere as the other three points. So the probability that four randomly chosen points on a sphere covers the center of the sphere is $1/8$.

This was such an elegant proof and the reasoning process was very inspiring. In most often cases, you can go to simpler cases first (in this case, going down one dimension and fixing some points) before solving a hard problem. More complex cases are usually generalization or combination of simpler cases, the essence of the problem should be the same. Therefore, considering the simpler cases makes you more likely to grasp the essence of the problem (but we need to beware of the false positives when generalizing reasonings) rather than being overwhelmed by the harder problem. A joke probably from the Institute for Advanced Study (Einstein, Godel, von Neumann, Yang Chen-Ning, etc. have worked there…), there were signs at the balconies of the buildings saying “before you jump, have you ever considered dimension one?”. Another more important insight from this video is that, changing the formalization can be the plot twist of solving a hard problem. This happens when the solver of the problem found that generalizing the 2D result to 3D was very hard and went back to 2D. In more philosophical words, it is actually the language, whether it is text or graph, that defines the way we think. The essence of the problem is the “thing” itself but some languages might do a better job than others describing it. In this case, four random points and three random lines plus one random point are just two different languages describing the same thing. However, one language is a better tool to approach the essence of the problem than the other. And we are less likely to discover that language if we don’t go to the low dimension case. Although it is very subtle, we can always practice the idea by thinking in different angles and formalize the problem is multiple different ways.

The 3D case.

Up to now, the solution is only described in a very intuitive way so we can say we have the idea of the proof but not the proof itself. To formalize the proof, some mathematical tools need to be used. An [article] here gives a formal proof by expressing the points as position vectors with the sphere center being the origin, and by requiring the origin to be expressed as a convex combination of the four position vectors, i.e. all weights having the positive sign, matching the only case out of 8 possible cases. This article also generalized the problem and derived some interesting facts. Obviously, the easiest generalization is $n+1$ points on a $\mathbb{R}^n$ dimensional ball, which gives a probability of covering the center of $\frac{1}{2^n}$.

References

3Blue1Brown(Grant Sanderson). “The Hardest Problem on the Hardest Test.” YouTube, 8 Dec. 2017, www.youtube.com/watch?v=OkmNXy7er84. (All images in this article are captured from the video. The article also heavily used the ideas in the video.)

Howard, R., & Sisson, P. (1996). Capturing the origin with random points: Generalizations of a Putnam problem. The College Mathematics Journal, 27(3), 186-192.

Kedlaya, K. S., Kedlaya, K. S., Poonen, B., & Vakil, R. (2002). The William Lowell Putnam mathematical competition 1985-2000: problems, solutions and commentary. MAA. Retrived from https://www.amazon.com/William-Mathematical-Competition-1985-2000-Problem/dp/0883858274 (The covered of this book is used as the header image of this article.)

Latest Techniques in Interactive Global Illumination

Wed, 03 Jan 2018 00:00:00 -0800

This is my presentation at the University of Utah graphics seminar, which is about a very exciting topic - interactive GI!

Stencil Buffer Trick for Volume Deferred Shading

Tue, 02 Jan 2018 00:00:00 -0800

Deferred shading have been known to largely boost the efficiency of rendering when treating a large amount of lights. The mechanism is very simple: separate the geometry complexity from lighting. To do this, G-buffers which form an array of textures, often including position, material color, normal of the points to shade, are rendered in the first pass from the point of view. Then, an orthographic camera are used to render a quad that covers the whole screen. The normalized (0-1) screen coordinates are then used to retrieved the geometry/material data at the point of the screen, which is fed into the lighting function. In such a way, we avoid producing tons of fragments from the projected scene geometry, instead, only render those which are visible.

However, imagine we have a large group of lights. We’ll still have to go through the whole list of lights for each screen pixel. With a more physically based lighting model, in which each light has a influential radius (resulted from the physical fact that has an ideal point light source has inverse squared drop-off), fragments that are outside a certain light’s influential radius would waste time on waiting other fragments in the same batch to go to a different branch. We know that branching is bad for GPU. This leads to a severe time penalty. Many techniques have been proposed to alleviate this problem. Tiled deferred shading is a very popular method, probably most of you have heard of. It partition the screen into tiles and create a “light list” for each tile using only the lights that intersect with the tiles. This is of course an elegant method. However, we will always need to do some preprocessing before generating a new group of lights (if there is a need to).

A simpler solution is volume deferred shading. We just need to render “light volumes” for each light, which, as you might guess, can be a sphere with a radius equal to the light’s influential radius. For example, in OpenGL, we just need to create a list of vertices/indices of a sphere and prepare a model matrix for each light (which is simply scaling and translation). While rendering, we perform the draw call n times, where n is the number of lights. One such light volume will produce fragments that covers the specific region on the screen where the fragments are possible to be shaded by the light. Of course, by doing this we are losing the depth dimension. We have to explicitly test the fragments and make sure they are at about the same depth region of the light (which is only a necessary condition). Of course, tile rendering also require such testing, but if we have lights scattering everywhere in a very “deep” scene, the lights to be tested are significantly lesser. However, because no preprocessing are required, volume deferred shading have quite competitive performance in most cases.

Wait! We should not render n passes! Instead, a better way is to use the instancing function which is supported on every modern graphics card to avoid the latency caused by lots of communication between CPU and GPU. Also another important thing, depth write should be disabled and additive blending should be enabled. The reason that depth write should be disabled is that light volumes are not real geometry. While two light volumes are close to each other and are illuminating the same region of the scene, we don’t want them to occlude each other such that some part are only shaded by one of the light volumes.

If you do the volume deferred shading described above directly, we will immediately discover that something goes wrong. When you approaches a light’s location (with a moving camera), at some point the screen will suddenly be darkened. This is because no matter you turn backface culling on/off, you will fall in a dilemma that you either render the pixels twice as bright when you are out of the light volume, or not render anything when you are inside the light volume.

It turns out that this situation can be easily solved by switching the culling mode to front-face culling. However, this is not good enough. We can actually keep the Z buffer created by G-buffers rendering and use this information perform some rejection of fragments that are not intersected by light volumes. Here is a nice stencil buffer trick introduced by Yuriy O’Donnell (2009). What it do is basically using the stencil buffer as a counter to record whether the front face of a light volume is behind the visible scene geometry (so that it has no chance to shade the pixel). This is achieved by only rendering front faces (with color buffer write disabled) in the first pass and add 1 to the stencil of the Z-failed pixels. Another situation is that the backface of a light volume is before the visible scene geometry, which is solved by the second pass - rendering only back faces and use a Greater/Equal z-test to continue filter the final pixels from the pixels with a zero stencil value (already pass the first test). So that we can keep only the light volume pixels that “fail” the z-test (the original “LESS” one), which intuitively corresponds to the scene geometry-generated pixels that intersects with the light volume. Notice this trick also works when you are inside a light volume, in which case front faces won’t get rendered (it is illogical that the fact that we are inside a light volume and that the front face is behind the visible geometry hold at the same time!), leaving a zero stencil value that allow us to use Greater/Equal depth test only to filter the pixels to be shaded. Of course, in either pass we need to disable Z write. Certainly we don’t want the light volumes bumping into each other.

The original diagram used by Yuriy O'Donnell.

While this trick definitely increases the efficiency especially when the lighting situation is very complex, we can do something better. Often modeling a detailed sphere as polygon creates large number of vertices that cram up the pipeline. Instead, we can use a very coarse polygon-sphere (e.g., with 4 vertical/horizontal slices) with a slightly larger radius to ensure that the light volume is bounded correctly. We can even use just a bounding cube of the light volume! Of course, the least thing we can do is just use a quad. However, that gives up some aforementioned depth test benefits and it also involves some complex projective geometry. Just for fun, I also prepared a derivation of the axis-aligned screen space bounding quad of the projected sphere.

A Way of Rendering Crescent-shaped Shadows under Solar Eclipse

Mon, 01 Jan 2018 00:00:00 -0800

Happy new year! I haven’t posted for a long time. However, I have done many exciting projects in the last half year and I’ll upload some of them soon. The first project I want to share with you is the render I created in in the Utah Teapot Rendering Competition. We were all awed by the great solar eclipse on August 21th. Do you remember the crescent-shaped shadow casted by tree leaves? It was so beautiful that the first time I saw it, I want to render it with ray tracing. Before working on this competition, I thought that there must be some complex math to figure out to simulate this rare phenomenon. However, the problem turned out to be embarrassingly simple. We can just model the actual geometry of sun and moon and trace rays. You might think that it is a crazy idea. In fact, instead of creating a sun with a diameter of 1.4 million kilometers and putting it 150 million kilometers away, we can simply put a 1-unit wide sun 100 units away from our scene, where 1 unit is approximately how big our scene is. It is supposed to have almost the same result. Then we use the same kind of trick to put a moon with a slightly smaller diameter and slightly ahead of the sun to make sure that the sun is eclipsed by it with a crescent shape. By treating the sun as an isotropic spherical emitter, the moon as a diffuse occluding sphere, and use a tree model with detailed alpha-masked leaf texture, I got results that are surprisingly good and also fast to compute. In such a simple way, I created a nice image with a glass Utah teapot sitting under a tree on a lawn behind the Warnock Engineering Building with crescent-shaped leaf shadows at the background.

The render. Click to enlarge.

See the "sun" and "moon"? This is really how it works!

a close-up showing the crescent shape by eclipse

Hope you like this project. Sometimes complex effects are really that simple!

Remark:

The tree model and the grass model (include their texture) are downloaded from TurboSquird.com
Grass: https://www.turbosquid.com/FullPreview/Index.cfm/ID/868103
Tree: https://www.turbosquid.com/FullPreview/Index.cfm/ID/884484
TurboSquid Royalty Free License
https://blog.turbosquid.com/royalty-free-license/

The pavement texture is downloaded from TextureLib.com
http://texturelib.com/texture/?path=/Textures/brick/pavement/brick_pavement_0099
License: http://texturelib.com/license/

The environment map is a paranoma taking in the vicinity of Warnock Engineering Building. Taken by Cameron Porcaro, uploaded to Google street view. By Google’s Terms of use it is considered a fair use since it is not for commerical use.

[GAPT Series - 13] Conclusions

Fri, 30 Jun 2017 00:00:00 -0700

13.1 Summary

This series focuses on how to improve the solution to the real-time path tracing problem by introducing and discussing possible optimizations in 3 categories – SAS, sampling and SIMD, which are implemented in a program with real-time rendering and interaction capability. While the SIMD optimization bases itself on the parallel computing model in GPGPU and aimed specially for the real-time requirement, the first two categories – SAS and sampling – are not hardware dependent and also used in off-line renderers as they are defined in the domain of a single computing thread. However, it is also possible to improve the models involved in these two categories to achieve better collaboration with the GPGPU model. For SAS, as a common bottleneck of ray tracing processes, SAH based kd-tree and BVH were introduced for being the optimum of their peers in minimizing expected global cost of ray-primitive intersection test and their indispensable functions in different applications, and optimization techniques on such data structure including triangle clipping and short stack traversal for kd-tree and node refitting for dynamic BVH are also discussed with implementation details. In the chapter for sampling, different context-based optimization methods on Monte Carlo algorithm which are all aimed for decrease variance in rendering – importance sampling on BSDF, next event estimation for direct lighting, multiple importance sampling combining the previous two, and bidirectional path tracing for difficult lighting conditions – were introduced. Moreover, Metropolis Light Transport as a modification of the basic Monte Carlo process based on Markov Chain was introduced and some implementation details on GPU were shared. For SIMD optimization, data structure rearrangement, code-level thread divergence reduction, thread compaction as three different types were illustrated with codes and test cases. A more efficient ray-triangle intersection solution which transforms the problem space was cited for its contribution on the performance increase of our program. More importantly, we proposed a new GPU construction algorithm for SAH kd-tree in full details, which turns out to help greatly reducing the initialization overhead for complex model. In addition, the underlying mechanism of rendering effects chosen and supported in our program – surface-to-surface reflection/refraction, volume rendering, and subsurface scattering were analyzed to clarify possible complications in usage. For most methods we introduced and discussed, test cases on our path tracer were provided to verify the ideas. Finally, we benchmarked our program with the path tracing demo in NVIDIA’s Optix engine and a free mainstream path tracer to prove that our program has a large advantage in rendering simple scenes like the Cornell Box by improving the performance by up to 30% and slightly outperforms a free mainstream path tracer for a complex rendering of a car, which means it is at least competitive with most of the mainstream path tracers nowadays in real-time rendering of models with industrial complexity. By analyzing, gathering, testing, and integrating different optimization techniques into a whole process, and choosing the correct rendering methods, we can efficiently produce aesthetically-pleasing, photorealistic results.

13.2 Limitations & Recommendations for Further Work

Given the immense potential of GPGPU, it is possible to see path tracing offering a photorealistic, film standard experience, replacing rasterization-based graphics to be the gaming standard in the future as the hardware performance continues to multiply. However, improvements in algorithm and software structure are also necessary to reduce as much workload as possible to accelerate the coming of such day. This thesis addresses many distinctive issues of real-time path tracing such as large thread divergence and dynamic geometry. However, many problems that may appear in future real-world applications of path tracing have not been considered due to the time limit. One such problem is to efficiently render a large set of animation data which may contain particle system or complex deformation. Another problem is the insufficient optimization of the spatial acceleration structure which is a bottleneck in ray-traced graphics. New algorithms or hardware need to be developed to continuously improve the traversal speed and update or rebuild the SAS with minimal efforts. In addition, better parallelization methods are still required for some algorithms with relatively obscure parallelizability but tremendous serial performance like Metropolis Light Transport, even though many have been developed. Moreover, parsing can be transferred to the GPU to greatly reduce the initialization time of geometrically complex scenes.

Bibliography

Ashikhmin, M., & Shirley, P. (2000). An anisotropic Phong BRDF model. Journal of graphics tools, 5(2), 25-32.

Beason, K. (2007). Smallpt: Global Illumination in 99 lines of C++. Retrieved from http://www.kevinbeason.com/smallpt/

Chandrasekhar, S. (1960). Radiative Transfer. New York: Dover Publications. Originally published by Oxford University Press, 1950.

Chandrasekhar, S. (1960). The stability of non-dissipative Couette flow in hydromagnetics. Proceedings of the National Academy of Sciences, 46(2), 253-257.

Cook, R. L., & Torrance, K. E. (1982). A reflectance model for computer graphics. ACM Transactions on Graphics (TOG), 1(1), 7-24.

Foley, T., & Sugerman, J. (2005, July). KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (pp. 15-22). ACM.

Henyey, L. G., & Greenstein, J. L. (1941). Diffuse radiation in the galaxy. The Astrophysical Journal, 93, 70-83.

Jensen, H. W., Marschner, S. R., Levoy, M., & Hanrahan, P. (2001, August). A practical model for subsurface light transport. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques (pp. 511-518). ACM.

Kajiya, J. T. (1986, August). The rendering equation. In ACM Siggraph Computer Graphics (Vol. 20, No. 4, pp. 143-150). ACM.

Kopta, D., Ize, T., Spjut, J., Brunvand, E., Davis, A., & Kensler, A. (2012, March). Fast, effective BVH updates for animated scenes. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (pp. 197-204). ACM.

Lafortune, E. P., & Willems, Y. D. (1993). Bi-directional path tracing.

NVIDIA. (2015). Memory Transactions. NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide. Retrieved from http://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/ report/cudaexperiments/sourcelevel/memorytransactions.htm

NVIDIA. (2017). CUDA Toolkit Documentation. Retrieved From http://docs.nvidia.com/cuda/thrust/#axzz4dK4GrjBF

Pauly, M. (1999). Robust Monte Carlo Methods for Photorealistic Rendering of Volumetric Effects (Doctoral dissertation, Master’s Thesis, Universität Kaiserslautern).

Pharr, M., Jakob, W., & Humphreys, G. (2011). Physically based rendering: From theory to implementation. Second Edition. Morgan Kaufmann.

Santos, A., Teixeira, J. M., Farias, T., Teichrieb, V., & Kelner, J. (2012). Understanding the efficiency of KD-tree ray-traversal techniques over a GPGPU architecture. International Journal of Parallel Programming, 40(3), 331-352.

Schlick, C. (1994, August). An Inexpensive BRDF Model for Physically‐based Rendering. In Computer graphics forum (Vol. 13, No. 3, pp. 233-246). Blackwell Science Ltd.

Schmittler, J., Woop, S., Wagner, D., Paul, W. J., & Slusallek, P. (2004, August). Realtime ray tracing of dynamic scenes on an FPGA chip. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (pp. 95-106). ACM.

Veach, E. (1997). Robust monte carlo methods for light transport simulation (Doctoral dissertation, Stanford University).

Vinkler, M., Havran, V., & Bittner, J. (2014, May). Bounding volume hierarchies versus kd-trees on contemporary many-core architectures. In Proceedings of the 30th Spring Conference on Computer Graphics (pp. 29-36). ACM.

Wald, I., & Havran, V. (2006, September). On building fast kd-trees for ray tracing, and on doing that in O (N log N). In Interactive Ray Tracing 2006, IEEE Symposium on (pp. 61-69). IEEE.

Walter, B., Marschner, S. R., Li, H., & Torrance, K. E. (2007, June). Microfacet models for refraction through rough surfaces. In Proceedings of the 18th Eurographics conference on Rendering Techniques (pp. 195-206). Eurographics Association.

Ward, G. J. (1992). Measuring and modeling anisotropic reflection. ACM SIGGRAPH Computer Graphics, 26(2), 265-272.

Zhou, K., Hou, Q., Wang, R., & Guo, B. (2008). Real-time kd-tree construction on graphics hardware. ACM Transactions on Graphics (TOG), 27(5), 126.

Appendix

The following pictures show the result of rendering a BMW M6 car for one minute in Cycles Render, one minute in our path tracer, and one hour in our path tracer, successively. The BMW M6 car model was modeled by Fred C. M’ule Jr in 2006, under CC-Zero (public domain) license, downloaded from http://www.blendswap.com/blends/view/3557.

[GAPT Series - 12] Benchmarking

Fri, 30 Jun 2017 00:00:00 -0700

Benchmarking different path tracing engines is not a trivial task. Different engines have different strengths at different types of rendering tasks. In addition, rendering methods may be different for different engines, thus it is difficult to choose a measure of the performance. If one engine uses Metropolis Light Transport and another engine uses brute force path tracing, one cannot claim that the first engine has a better performance than the second one, just because it has a larger frame rate (or samples per second for offline path tracing). Normally, we compare the performance by convergence rate – in same amount of time, the engine converges more has a better performance – measured by the variance level. A special reminder is that one can only use the variance measure when the engines use same basic sampling method – the only two we introduced before are normal Monte Carlo sampling and Markov Chain Monte Carlo sampling (used only in MLT) – as MLT will always try to find a smallest variance even if the color is incorrect (also known as start-up bias). Alternatively, it seems that one can also compare the absolute difference between the rendered image and the ground truth. However, the BSDF used in different engines are usually slightly different, in case of which the absolute difference is an invalid measure. A practical solution of this issue is to force different engines use the same basic sampling method (in most cases it can be changed in options) and compare the convergence rate. It is important to notice that images generated by different engines must all be tone mapped or non-tone mapped before a variance comparison can be done. Or if it is known that the engines to be compared use the same specific sampling method (like next event estimation or bi-directional path tracing), it is simpler to directly compare the frame rate. However, one should also look at the differences between the rendered image and the ground truth to prevent some low-quality images or artefacts produced by incorrect implementation or the deviation from the industrial standard.

Two scenes are used to benchmark our path tracer against some free mainstream path tracers. The first scene is the default Cornell Box with all Lambertian diffuse surfaces and a diffuse area light, rendered by next event estimation. The real-time path tracing sample program of NVIDIA’s Optix ray tracing engine is used to compare with our path tracer (Figure 18). Since the sample program is open-source and uses the same next event estimation method and both our program and NVIDIA’s program are real-time path tracers, we can compare the performance by directly comparing the frame rate of rendering. Table 3 shows the frame rate of rendering of our path tracer and NVIDIA’s path tracer in 512x512 resolution with 4 samples taken for each pixel in each frame (Figure 18), on the mid-end NVIDIA GeForce GTX 960M graphics card and the high-end NVIDIA GeForce GTX 1070 graphics card.

Figure 18 Left: Our Render Right: NVIDIA's Render

Type	GTX 960M	GTX 1070
NVIDIA’s Path Tracer	13.52 fps	30.0 fps
Our Path Tracer	14.02 fps	41.5 fps
Speedup	3.7%	38.3%

Table 3 Speedup of our path tracer on different graphics cards, comparing with NVIDIA's

A reason for our path tracer to gain a larger speedup on high-end graphics card is that high-end graphics cards have larger memory bandwidth, which allows faster memory operations in stream compaction used in our path tracer but not in NVIDIA’s path tracer.

The second scene is the BMW M6 car modeled by Fred C. M’ule Jr. (2006), which aims for testing the capability of our path tracer to render models in real application. For comparison, we chose the Cycles Render embedded in the Blender engine. Albeit being an off-line renderer, it also has a “preview” function to progressively render the result in real-time. Notice that Cycles Render uses a different workflow to blend the material color and may use different BSDF formulae on same material attribute, causing the appearance to be different (the glasses and metal rendered by Cycles is less reflective on same attributes, and the overall tone is different). It is extremely hard to tune the rendering result to the same, but we can still guarantee that the workload on each path tracer is almost the same, as the choice of material component depends on the Fresnel equation.

Since the ways of implementation may also be vastly different, we use the convergence rate in one minute as the measure of the performance. It is important to know that it is invalid to use the variance of all pixel in the picture to compare for convergence. As convergence corresponds to noise level in Monte Carlo sampling, a small region that will be rendered to a uniform color is used for convergence test. For this scene, it is convenient to just choose the upper-left 64x64 pixel to compare for variance, as the wall has a uniform diffuse material which will produce nearly same color under current lighting condition. Also, the rendered result of one hour from our path tracer is used as the ground truth for variance comparison (it is equivalent to use either side’s). For illumination, a 3200x1600 environment map for a forest under sun is used.

The following images in Figure 19 are the grey scale value of the upper-left 64x64 square region from Cycles Render, our path tracer, and the ground truth. By only looking with the eyes, one is difficult to judge which of our result and Cycles’ result has a lower noise level. However, we can numerically analyze the variance by evaluation of the standard deviation of the pixel values. By using OpenCV, the average value and the standard deviation of all grey scale pixels can be easily obtained, which are listed in Table 4.

Figure 19 Left, Middle & Right: Cycles’, our, ground truth's upper left 64x64 pixels in greyscale

Type	Average	Std. Deviation	Convergence
Theirs	98.42	10.31	16.7%
Ours	98.95	9.41	18.3%
Ground Truth	104.6	1.72	100%

Table 4 Comparing convergence rate of our path tracer and Cycles Render in 1 minute's render time

From the data, we can see that our path tracer does have a slightly better performance than Cycles Render. Although due to time restriction, we are not able to carry out more tests using different scenes and with other mainstream renderers, the complexity of such test scene (700K+ faces, glossy, diffuse, refraction BSDF, environment light) can be a solid proof that our path tracer has at least the same level of performance with current mainstream rendering software. For the reader’s interest, we also provide the sample pictures of our and Cycles’ rendering results for 1 minute, and our rendering result for 1 hour, which can be found in the appendix.

[GAPT Series - 11] SIMD Optimization (cont.)

Fri, 30 Jun 2017 00:00:00 -0700

As mentioned half a year ago, apart from data structure rearrangement, thread divergence reduction, we can also optimize the SIMD performance by doing thread compaction. The first section below will introduce how I figure it out using the CUDA Thrust API, followed by a proposition of a new method for parallel construction of kd-tree on GPU.

The following sections will introduce three types of optimizations based on CUDA architecture – data structure rearrangement, thread divergence reduction and thread compaction used in our path tracer to increase the SIMD efficiency and reduce the overall rendering time. The necessity of most of these optimizations comes from the real-time rendering requirement, without the possibility to design fixed number of samples for each rendering branch. After that, two sections will be dedicated to discussion of optimizations on specific components - a ray-triangle algorithm better for SIMD performance will be introduced and

11.1 Thread Compaction

Russian roulette is necessary for transforming the theoretically infinite ray bounces to a sampling process with finite stages, which is terminated by probability. While decreasing the expected number of iterations for each thread in every frame and causing an overall speedup due to early terminated thread blocks, it scatters terminated threads everywhere, giving a low percentage of useful operations across warps (32 threads in a warp are always executed as a whole) and an overall low occupancy (number of active warps / maximum number of active warps) in CUDA, which aggravates as number of iterations increases.

Relating to the set of basic parallel primitives, one naturally finds that stream compaction on the array of threads is very suitable for solving this problem. As illustrated in Figure 16, assuming each warp only contains 4 threads and there is only one block with 4 warps running on GPU for simplification and using green and red colors to represent active and inactive threads, before stream compaction the rate of useful operations is 25% (3 active threads out of 12 running threads) and after grouping the 3 active threads to a same warp, the percentage of useful operations becomes 75%, equivalently, same amount of work can be done by 1/3 amount of warps, leaving space for other blocks to execute their warps. Also, if the first row is the average case for multiple blocks, the occupancy would be 75% since each block with 4 warps has an inactive warp, implying that less amount of work can be done with same amount of hardware resources. With stream compaction, occupancy is close to 100% in first few iterations, before the time when total number of active threads is not enough to fill up the stream multiprocessor.

Figure 16 Upper: Before thread compaction Lower: After thread compaction

In order to measure the performance impact of thread compaction, we designed a test comparing frame rate of path tracing with and without thread compaction on our NVIDIA GeForce GTX 960M for maximum trace depth from 1 to 10, and 20, 50, 100. The test scene is the standard Cornell box rendering with next event estimation with 1,048,576 paths traced in each frame.

Figure 17 Frame rate as the function of max trace depth, for program with and without thread compaction

As shown by Figure 17, without thread compaction, the frame rate experiences a rapid decline in first 5 increments of max trace depth, after which the declination of frame rate approximates a linear function until the depth when all threads become inactive probably between 20 and 30. With thread compaction, the frame rate starts to surpass the original one in depth 3 with only little falloff for every depth increment and become almost stable after depth 5.

The reason thread compaction causes first two max depths slower is that thread compaction has some overhead of initialization, which cannot be offset by the speedup provided by stream compaction when terminated threads are too few. A struct stores next ray, mask color, pixel position and state of activeness needs to be initialized at the beginning for each thread and retrieved in every bounce. For stream compaction, we also use Thrust library introduced in Chapter 2, which offers a remove_if () function to remove the array elements satisfying the customized predicate. For this task, the customized predicate takes the struct as the argument and checks whether the state of activeness is false to determine elements to discard.

Nevertheless, we can also use stream compaction to do a rearrangement of threads such that threads that will be running the same Fresnel branch in next iteration are grouped together. The number of stream compaction operations will be equal to the number of Fresnel branches (which in our case is 3). By using double buffering, the results of stream compaction can be copied or appended to another array. After generating the resorted array, the indices for the buffers are swapped. In our experiment with a simple scene adapted from the Cornell box with glossy reflection, diffuse reflection and caustics, up to 30% speedup can be achieved from regrouping the threads.

11.2 GPU SAH Kd-tree Construction

We will propose a GPU SAH kd-tree construction method in this section. So far, the CPU construction of SAH kd-tree has a lower bound of O(N log N), which is still too slow for complex scenes with more than 1 million triangles. It takes more than 10 seconds to construct the SAH kd-tree for the 1,087,716-face Happy Buddha model on our Intel i7 6700HQ, which is a serious overhead. Given the immense power of current GPGPU, it is a promising task to adapt the kd-tree construction to a parallel algorithm. A GPU kd-tree construction algorithm was proposed by Zhou et al. (2008), which splits the construction levels into large node stages where median of the node’s refitted bounding box is chosen as the spatial split the and small node stages where SAH is used to select the best split. Although with a high construction speed, the method sacrifices some traversal performance due to the approximated choice of best splits in large node stages. In contrast, we will now propose a full SAH tree construction algorithm on GPU.

First, similar to Wald’s CPU kd-tree construction model (2006), we create an event struct containing the 1D position, type (0 for starting event, 1 for ending event), triangle index (which is actually triangle address since at the beginning the node-triangle association list is same as the triangle list), and a “isFlat” Boolean which marks whether the opposite end has the same coordinate for every end of bounding boxes of triangles in all 3 dimensions, which are stored in 3 arrays. For each dimension, the event array is sorted by ascending position coordinate while keeping ending events before starting event when the positions are same (we use the same routine as in the Wald’s algorithm – subtracting the triangle of ending event from the right side before SAH calculation and adding the triangle of starting event to the left side after the SAH calculation, which can guarantee that triangles with an end lying on the splitting plane can find the correct parent – except for being parallel). Such sort should be a highly efficient parallel sort like the parallel radix sort.

After that, we separate the struct attributes into a SoA (structure of arrays) for better memory access pattern. Also, we need to create an “owner” array of length of number of triangles, which is initialized to zeros as root has an index of 0, to store the index of owner node, since we will be processing the nodes in parallel. So far, we have three position arrays, three type arrays, three triangle address arrays, three isFlat arrays, and one owner array, each of which has the same length of events from all nodes in current construction level. Nevertheless, we also need an array for node-triangle association, which lists the indices of triangles associated with nodes in current level in node-by-node order. Again, this node-triangle association list (which will be called triangle list for short) also needs an owner list, which we call “triOwner”, also initialized to zeros.

What still left for initialization are the two dynamic arrays – nodeList for storing all the processed nodes, which are pushed into as groups from the working node array of current construction level, linearly and leafTriList for storing all the triangles in leaves in leaf-by-leaf linear order.

After all initializations are done, we choose a dimension with the largest span in the root’s bounding box. Note that the selection of such dimension will be processed in parallel in following iterations, at the moment of creating node structs for all newly spawned children from the current level. The following explanation will treat the current construction level a general level with many nodes other than level 0. The first parallel operation other than sorting we perform is the inclusive segmented scan on the type array, the purpose of which is to count the number of ending events before the current event (or including the current event if it is an ending event) for use in the following calculation of number of triangles on the left and right side of the splitting plane, alongside with the surface areas of bounding boxes of the potential left child and right child, as is required to calculate the SAH function. In this segmented scan, the owner array is used as a key to separate events from different nodes. It is worth mentioning that for SAH calculation, the offset of the node’s events in the event list is stored in the node struct, so that an event is able to know its relative position in its belonging part in the array, which will be used together with the scanned result of number of starting events to the left to derive the number of triangles in the left or right subspace of the splitting plane. For SAH calculation for splitting plane with flat triangle lying on it, we simplified the process by grouping all such flat triangles to the left side, which in most cases has no influence on traversal performance, so that we do not need to deal with the flat case specially in triangle counting. The information of a potential split is stored in a struct containing SAH cost, two child bounding boxes, splitting position, and number of left side and right side triangles. The array of such struct then undergoes a segmented reduction to find the best split (with minimal SAH cost) for each node.

The next step is assigning triangles to each side, which is also the step where we determine whether to turn the interesting node to a leaf. In the assigning function which is launched for every event in current splitting dimension in parallel, we check whether the best cost is greater than the cost of not splitting (which in most cases is proportional to the number of the triangles in the node) or the number of triangles in the node is below a threshold we set. If it is the case, we create a leaf by marking the “axis” attribute in the node struct with 3. For assigning triangles to both children, our key method is to use a bit array of twice the size of the current triangle list and let the threads of current events to assign 1 at the address at the belonging side (or two sides if the triangle belongs to both left and right side), after which the bit array is scanned to obtain the address of the triangle list in next level. Since the events are in sorted order, an event can decide its belonging by comparing the index with the index of the event chosen for best split. If the event is a starting event, and index is smaller than the best index, the event will assign its triangle to the left side; and if the event is an ending event, and the index is greater than the best index, the event will assign its triangle the right side. Notice that because we are launching a thread for each event, a triangle spanning across the splitting plane will be correctly assigned to both side by different threads, without special care. In addition, flat triangles lying on the splitting plane will be assigned to both sides (where isFlat variable is checked) to avoid the effect of numerical inaccuracy in traversal which can cause artefacts.

Also, a leaf indicator array is assigned by the threads in the triangle assignment function such that the indicator array would have a 1 in the position of triangles that belong to a newly created leaf in the triangle list, which will be scanned to determine the address of the triangle in the leafTriList, similar to how the addresses of triangles in the next level’s triangle list are determined, and reduced to obtain the number of triangles in the leafTriList in current level which is used to calculate the new size of the whole leafTriList to be used as next level’s offset. Since we also need to know the local offset of the leaf’s triangles in the part of current level in leafTriList, we need to do a segmented reduction followed by an exclusive scan on the leaf indicator array before assigning the offset to the leaf’s struct.

Before spawning new events for the child nodes, we need to finish the rest of the operations on the triangle list. The triOwner list for the new level can be easily generated by “spawning” a list from the original triOwner list with doubled size by appending the list to itself with the owner index offset by the original number of owners of nodes in the second half and performing a stream compaction using the aforementioned bit array as the key to remove the entries for triangle not belonging to the specific side. A question may be that after the stream compaction, the owner indices are not incremental, which cannot be used for indexing. However, this issue can be easily solved by doing a parallel binary search on the returned key array of the segmented reduction (or counting, more properly) on the constant array of 1 (the returned value array of which is stored as the counts of the triangles in next level’s nodes) with the just generated next level’s triOwner array itself as the key, whose result is used to replace the array. In a similar way, the triangle list for next level is “spawned” from the original triangle list and compacted by the bit array.

Finally, we explain how the next level’s events (type, split position, isFlat and triangle address) are generated. The method is surprisingly simple – after duplicating the event list, we only need to produce a bit array for events by checking the corresponding values in the bit array for triangles, which only requires reading the values in current events’ triangle address list as the pointer to the position in the bit array for triangle. The 3 attributes type, split position and isFlat can be spawned by duplicating the original array and perform a stream compaction with the bit array as the key. The triangle address array itself can spawn the array for next level by duplicating, reading the new addresses in the previously scanned result of the triangle bit array and also doing a stream compaction.

So far, there is only one last array to spawn – the event’s owner list in the next level, which can be generated in the same method as the triOwner array uses – “stream compaction – segmented reduction – binary search”. Before next iteration begins, node structs for next level are created using data like counts and offsets in the corresponding previous generated arrays and pushed to the final node list as a whole level. The splitting axes for the next level are also chosen in this process by comparing the lengths of the 3 dimensions of the bounding box. If an axis different from current axis is chosen, the 4 event arrays for the 3 dimensions are “rotated” to the desirable place – if 0 stands for the splitting axis and current splitting axis is x, y and z will be stored under index 1 and 2; if next splitting axis is z, the memory will have a “recursive downward” rotation so that z is rotated to 0, x is rotated to 1, y is rotated to 2. Finally, the pointers of all working arrays are swapped with the buffered arrays. The termination condition is that the next level has no nodes.

We also performed a test comparing the speed between Wald’s CPU construction and our GPU construction of the same SAH kd-tree (full SAH without triangle clipping) on a computer with Intel i7-4770 processor and NVIDIA GTX 1070 graphics card. The result (Table 2) shows that a sufficiently large model is required for our GPU construction to outperform the CPU counterpart, due to the overhead of memory allocation and transfer. A 5x speedup can be obtained when the model size goes beyond 1M, which indicates that our method can be used for ray tracing large models to greatly reduce the initialization overhead while maintaining the same tree quality.

Model	Face Count	CPU(s)	GPU(s)	Speedup
Cornell	32	0.001	0.046	0.02x
Suzanne	968	0.016	0.095	0.17x
Bunny	69,451	1.442	0.655	2.20x
Dragon	201,031	3.705	1.100	3.37x
Buddha	1,087,716	13.903	2.801	4.96x

Table 2 Speedup of our GPU SAH Kd-tree comparing with Wald's CPU algorithm

[GAPT Series - 10] Rendering Effects

Fri, 30 Jun 2017 00:00:00 -0700

Before going to the discussion of SIMD optimization, we present this chapter to briefly introduce the rendering effects supported by the path tracer, the importance of which comes from the fact that it is the direct application of the sampling methods discussed before.

10.1 Surface-to-surface Reflection/Refraction

As a guarantee of its practical capability, our path tracer simulates the optical effects of all kinds of surface-to-surface reflection or refraction, not including the cases like polarized light or fluorescence which are rare in practice. For diffuse reflection, the Lambertian model adjusted by Ashikhmin and Shirley formula is used (Section 4.5) while Cook-Torrance microfacet model is responsible for specular or glossy reflection. Our path tracer also supports anisotropic material standardized by Wald model (Ward, 1992), which only modifies the Beckmann distribution factor in Cook-Torrance model (Cook & Torrance, 1982),

For importance sampling the Wald BRDF, two uniform unit random variables and are generated and it is easy to solve the equations for azimuth angle and altitude angle :

Figure 10 Isotropic & anisotropic specular

For surface-to-surface refraction, the Cook-Torrance microfacet model can be modified by recalculating the Jacobian matrix for the transform between half-vector and outgoing vector (Walter, Marschner, Li, Torrance, 2007), yielding

, where D is the microfacet distribution function (Beckmann in our implementation, G is the shadowing term (a numerical approximation to Smith shadowing function in our implementation, the roughness coefficient inside whom is substituted by 1/ for anisotropic material) and J = , where are the index of refraction of the two media, is the absolute value of the Jacobian matrix. The half-vector in refraction indicates the normal of the sampled microfacet, which can be obtained by calculating if BSDF needs to be determined from arbitrary incoming and outgoing radiance so as in the case of multiple importance sampling.

10.2 Volume Rendering

The rendering techniques so far are based on the assumption that spaces between surfaces are vacuum, which is only an effective simplification in ordinary cases with clear air. For phenomena like fog, smog, smoke, obvious scatter, absorption, and emission can happen between surfaces, which affect the radiance towards viewer. In the presence of such participating media, an integro-differential equation of transfer (Chandrasekhar, 1960) shows the directional radiance gradient of a point in participating media to model the change of radiance in space.

where is the point in space and is the direction in interest with being the measure of displacement along the direction. is the attenuation coefficient accounting for both absorption and out-scattering, while is the scattering coefficient controlling the magnitude of in-scattering, which has the phase function to define the probability density of in-scattering from each direction. and stand for media emission and incoming radiance, respectively. For isotropic media, has a constant value of , which is intuitive as the integral of differential angles in the sphere gives . For general media, there is a phase function developed by Henyey and Greenstein (1941) which provides a simple asymmetry parameter ranging from -1 to 1 to control the “polarity” of the participating media.

In practice, the integro-differential equation is solved by decomposing different parts, calculating their values separately before using ray marching to accumulate the values in each sample point on the incoming ray. These sample points are treated as differential segments with constant coefficients. Indeed, such numerical integration can be estimated by Monte Carlo sampling. The parameter t of ray can naturally be used as the measure of displacement along the ray direction. Depending on the targeted convergence rate or frame rate, the step size can be set larger or smaller. Normal Monte Carlo sampling would randomly pick a point in each segment for radiance estimation. However, for estimating the light transfer integral which is only one dimensional and often has a smooth function, standard numerical integration may have an edge over the Monte Carlo method. By using a stratified pattern (Pauly, 1999) which assigns same offset for each segment and randomizes the offset for a new ray, it can be shown to have a lower variance than Monte Carlo. The p.d.f. for such samples is also very simple, which is a constant equal to the step size .

In the simplest case of volume rendering, where the participating media has homogenous attributes, both emission and attenuation factors can be directly evaluated by (where s is the length of the ray segment of interest), which is the direct result of solving the differential equation , also known as Beer’s law. If the distribution of media density or other properties has an analytical solution for such differential equation, the analytical solution (if it is an elementary function) can be directly evaluated without using sampling techniques, which is used where an analytical solution is impossible, unknown, or too complex, or in the case where the distribution is a customized discrete data set.

As mentioned before, the attenuation term and the augmentation term can be treated separately in computation, which maps well to the accumulation of mask and intermediate color in our implementation, expressed by

where is multiplied to the mask and is estimated with samples and added to the immediate color. Transmittance coefficient T can either be analytically evaluated or estimated with samples as mentioned in previous paragraph.

The sampling estimation of the augmentation term can use the same stratified pattern mentioned before to reduce the variance. However, inside , there is another integral which accounts for in-scattering from all directions, the estimation of which is another non-trivial task. For simplicity, we only consider single scattering from direct lighting. A light sample can be taken as in next event estimation and a shadow ray is shot from the current ray segment’s sample point to the light to detect visibility. Note that this method neglects the contribution from indirect lighting to the in-scattering, which is often too weak to affect the rendering equality.

It is worth mentioning that it is also possible to use metropolis light transport for sampling the in-scattering. A random number can be stored here for mutation in every frame so that directions with large contribution can be easily discovered and focused on, which is especially suitable for very anisotropic participating media.

Two sample images are shown in Figure 11 to exhibit the visual effect of volume rendering. The first image shows strong scattering and absorption of light in a room with dense homogenous smog. The smog has a Henyey-Greenstein asymmetry parameter of 0.7, indicating that incident lights are primary scattered forward, as can be seen in the narrow shape of the illuminated cone under the area light. The second image illustrates fog with white emission and exponentially attenuated density in vertical direction, which is the miniature of the atmosphere in a box.

Figure 11 Left: Homogenous smog with strong absorption and forward scattering Right: Fog with exponential density

10.3 Subsurface Scattering

Scattering and absorption can happen inside objects as well. The reason for using BSDF to estimate the radiance from material is that many material can be categorized into metallic (which reflects most of the energy at surface) or dielectric which are either too opaque or too transparent to exhibit any obvious scattering effect. For material with an albedo high enough to be considered as non-transparent and not enough to be considered as opaque like jade, milk and skin, the effect of scattering inside cannot be ignored, for which BSDF cannot give sufficient approximation of the surface radiance. Instead, BSSRDF (bi-directional subsurface scattering reflectance distribution function) is used to include the contribution to the outgoing radiance of the point in interest from incoming radiance on other surface points. A 6-dimensional function

( is known, the other 3 variables are all 2D) can be used to describe the sum of all radiance scattered from to in all possible paths. Again, to calculate the outgoing radiance of the point in interest, one must integrate the contribution from points all over the object surface, which can be estimated by randomly sampling a surface point as a Monte Carlo process. However, the BSSRDF itself is largely unknown due to the complexity of the multiple scattering problem. Also, points in the surrounding may contribute most of the radiance which implies that indiscriminately choosing a surface point has an intolerably low convergence rate.

To provide a practical approximation of general subsurface scattering, Jensen et al. introduced the dipole diffusion model (2001), which decomposes the BSSRDF into a diffusion term and a single scattering term as a simplification. Observing that the radiance distribution becomes nearly isotropic after thousands of scatterings in material with very high albedo like milk, they proposed a diffusion model that transforms the incoming ray into a dipole source and uses the radial diffusion profile of the material to compute the outgoing radiance. The diffusion term has an exponential falloff with respect to the distance from the incidence point, which provides an effective p.d.f. for importance sampling. Note that the key idea of this model is to interpolate between 2 extreme cases - pure single scattering and pure diffusion – for general material, which turns out to be an insufficient approximation when highly physically authentic pictures are required.

In our implementation, we only consider the contribution of the single scattering term as a demonstration of the idea of subsurface scattering. The program can be easily extended with an additional module for the diffusion term estimation. Since single scattering only happens when the refraction rays of and meet inside the material, BSSRDF is not directly evaluated by taking a surface sample. Instead, after the ray intersects the surface with BSSRDF component, a random distance is generated by , where is the unit uniform random variable and is the reduced scattering coefficient corresponding to the tendency of forward scattering, followed by moving the intersection point by such distance in the ray direction to become the point of single scattering and using the phase function p to importance sample the direction of scattering as the direction of the next ray. Note that this method is only suitable for rendering translucent objects with low to moderate albedo like jade. Objects with high albedo still require at least the dipole model to render a reasonable appearance.

A pair of sample images are shown in Figure 12 to show the Stanford bunny with a deep jade color rendered as translucent and opaque material. The result of the translucent material shows the effect of subsurface single scattering. Note that thin parts of the object like the ear has a more acute response to the change of albedo.

Figure 12 Left: Bunny with subsurface scattering Right: Bunny without subsurface scattering

10.4 Environment Map and Material Texture

As mentioned in the introduction, the heavy usage of textures can often be found in rasterization based graphics used in most of the mainstream video games. PBR workflow, as the de-facto industrial standard, uses a variety of textures to describe the varying material attributes across the texture space. Apart from the basic diffuse color texture or albedo map, a roughness map is used to define the shape of the distribution of normal in the microfacet model, such that the roughness value of 0 indicates a perfectly smooth surface point which only reflects at the mirroring direction and the roughness value of 1 indicates a surface point with almost equal amount of reflection in every direction which makes it no longer look glossy. Similarly, a metalness map is used to define the level of similarity to metal of each texel. A metalness value of 1 defines a surface point that only has specular reflection with some absorption of specific wavelengths which simulates the behavior of metal, while a metalness value of 0 defines a completely dielectric surface point that follows the Fresnel equation with real index of refraction. More commonly, a uniform white specular color is defined for every opaque object to substitute the evaluation of in the Fresnel equation when index of refraction is not defined. Furthermore, a normal map is used to simulate microscopic geometry for material with a complex surface texture and ambient occlusion maps are used to compensate the corresponding effect in the lack of global illumination in rasterization based graphics. For still objects, light maps which baked the color bleeding or shadow (requiring still light model) can be used in surrounding surfaces like walls or floors to give a more realistic feeling of the rendering. Since it is generally infeasible to generate real-time samples in rasterization based graphics, usually an irradiance map and a pre-filtered mipmap are precomputed from the environment map of the scene to simulate diffuse and glossy reflection of indirect lighting respectively.

However, for path tracing, we only need the maps related to material attributes in texture space, i.e. albedo map, metalness map, and roughness map, because we are using a global illumination algorithm. We may also want to use environment maps as image-based lights – lighting from the outer environment like sky and sunlight are usually hard to be modeled as solid objects. Also, if the surrounding environment is sufficiently far, using environment map can avoid the ponderous task of modeling all objects from the border of the bounding environment to the point being shaded and accumulating all possible indirect lightings between any two objects. One can simulate the intricate indirect lighting effect by sampling the texel in the reflected ray direction if nothing in the local setting (the models in interest) is hit, treating the outer environment as an infinitely far sphere so that all rays can be seen as being shot from the center of the sphere. The environment map can either be stored as a cubemap or a spherical projection map. In our implementation, we use spherical projection maps due to easier computation and less chance of artefact.

Two sample images are shown in Figure 13 to demonstrate the effect of environment map as image based lighting and the material texture as surface attribute control. The second image simply provides albedo, roughness and metalness map to transform the diffuse back wall of the original Cornell box to a realistic scratched metal.

Figure 13 Upper: Armadillo under environment lighting Lower: Cornell Box with scratched metal wall

[GAPT Series - 9] Sampling Algorithms (Cont.)

Fri, 30 Jun 2017 00:00:00 -0700

This post corrects some misconception in the former section [GAPT Series - 3] Path Tracing Algorithm as well as introduces some new rendering methods.

A key feature of path tracing that differentiates it with normal ray tracing is that it is a stochastic process (provided that the random numbers are real) instead of a deterministic process. Normal path tracing depends on Monte Carlo algorithm which gradually converges the result to the ground truth as the number of samples increases. Theoretically, one can uniformly sample all paths to converge to the correct result. However, given limited hardware resource and time requirement, we need to adapt the brute Monte Carlo algorithm by various strategies for different cases. The following sections will introduce the rendering equation we need to solve in path tracing and some most popular sampling methods.

9.1 BSDF and Importance Sampling

Importance sampling is an effective method for solving rendering equation in high convergence rate. Basically, importance sampling chooses samples by a designed probability density function and divides the sample value by p.d.f. to return the result. If the designed p.d.f. turns out to be proportional with the values, variance will be very low. Since the sample value is determined by , both irradiance and BSDF decides the p.d.f. of the surface sample. Importance sampling BSDF is a more trivial task than importance sampling the spatial distribution of incoming radiance. As long as a BSDF formula (Lambertian, Cook-Torrance, Oren-Nayar, etc.) and corresponding surface characteristics are provided, one can calculate as the result. However, distribution of is harder to calculate in general case. For indirect lighting on the surface point, it is impossible to know the distribution of the incoming radiance, which is a chicken and egg paradox. For direct lighting where , different methods can be applied to find the p.d.f. With a simplified lighting model like point light or area light, the distribution of incoming radiance is rather explicit. However, when image-based lighting is used (e.g. environmental map or sky box), advanced techniques are required to perform importance sampling efficiently. The following sections will all focus in the case of having explicit lighting models since it is easier to exemplify how to utilize the p.d.f. of incoming radiance. Discussion for image-based lighting will be continued in the next chapter. Since combing the effect of two p.d.f. requires not only sampling on the surface point, but the lighting model or lighting image as well. A technique called multiple importance sampling will be introduced in Section 9.3.

9.2 Next Event Estimation

For direct lighting with explicit model, lighting computation can be directly done without shooting ray to the lights, which is called next event estimation. Literally, it accounts for the contribution of what may happen in the next iteration. For ideal diffuse surface, whose BSDF is spatial uniform (usually expressed by Lambertian model), this task is very simple. One only needs to sample a point in one of the lights. Take diffuse area light (emission of a point is uniform in all directions) as an example, if the emission across the light emission surface is the same, one only needs to uniformly sample the shape of the light. Otherwise, usually there is an existing intensity distribution to use. After that, to convert the area measure of p.d.f to solid angle measure, from formula (Veach, 1997) we can derive , which can be used as the p.d.f for incoming radiance. Given that Lambertian surface has a BSDF of , the final color can be expressed in the simple formula .

For diffuse area lights of simple geometric shape like triangle, uniform sampling can be trivially done by using barycentric coordinates. Apart from light sampling, we also need to shoot a shadow ray from the surface point being shaded to the sampled light point. Since ray-triangle intersection test is a bottleneck in ray tracing for reasonably complex scenes, this means we will almost halve the frame rate. However, the benefit of next event estimation totally worth the costs it takes. Figure 5 is an example picture comparing the convergence rate of diffuse reflection of two balls under highlight with and without next event estimation, where 16 samples are taken for each pixel for both sampling methods. Although the frame rate for rendering with next event estimation is only 60% of that without next event estimation, the noise level of the former is dramatically lower than the latter.

Figure 5. Comparison between same scene of high dynamic range with and without next event estimation

However, this example only shows the case where the shaded surface has a uniform BSDF. For surfaces with general non-uniform BSDF like glossy surface in the Cook-Torrance microfacet model, sampling the light is inefficient for variance reduction. An extreme case is the perfect mirror reflection, whose BSDF is a delta function. Since only the mirrored direction of the view vector w.r.t the surface normal contributes to the result, sampling from a random point in the light model has zero probability to contribute. We would naturally want to sample according to the BSDF. A general case is, a glossy surface whose BSDF values concentrate in a moderately small range and the lighting model occupies moderately large portion of the hemisphere around the surface being shaded. As a rescue for general cases, multiple importance sampling will be introduced later. However, if we want to have a balance between quality and speed, next event estimation can be mixed with direct path tracing for different BSDF components. Especially for the case of surface material with only diffuse and perfectly specular reflection, doing a next event estimation by multiple importance sampling would be redundant. Instead, if the BSDF component in current iteration is diffuse, we will set a flag to avoid counting in the contribution if we hit a light in next iteration. Otherwise, such flag will have a false value, allowing the radiance of the light hit by the main path to be accounted into. For determining which BSDF component to sample, we will use a “Fresnel switch”, which will be introduced in the next section.

9.3 Fresnel Switch and The Russian Roulette

Physically, there are only reflection and refraction when light as an electromagnetic wave interacts with a surface, the ratio of which is determined by the media’s refractive indices in two sides of the surface and the incidence angle of the light. The original formula is actually different for s and p polarization component in the light ray. However, in computer graphics, we normally treat the light as non-polarized. Under this assumption, Schlick’s approximation (Schlick, 1994) can be used to calculate the Fresnel factor: . However, in the standard PBR workflow, the refractive index is usually only provided for translucent objects. Although we can look up for the refractive index of many kinds of material, for metals and semiconductors, the refractive index n is a complex number, which complicates the calculation. Since the refractive index of the metal indicates an absorption of the color without transmission, specular color is used to approximate the reflection intensity in normal direction. In practice, there is usually an albedo map and a metalness map for metallic objects. The metalness of a point determines the ratio in interpolation between white color and albedo color. In the case of metalness equal to 1, the albedo color will become the specular color in reflection, as what we refer to as the color of metal actually suggests how different wavelengths of light is absorbed rather than transmitted or scattered in the case for dielectric material.

Briefly, if there is an explicit definition of index of refraction, we will use that for calculation of the material. Otherwise, we interpolate the albedo color provided in material map or scene description file with the white specular color meaning complete reflection of light according to the metalness of the material, as defined in normal PBR workflow. However, considering the fact that dielectric material also absorbs light to some degree, the specular color can be attenuated by some factor less than 1 to provide a more realistic appearance.

After is calculated, we can calculate the Fresnel factor by substituting the dot product of view vector and the surface normal into the Schlick’s formula. Notice that there is a power of 5, which is better calculated by brute multiplication of 5 times than using the pow function in the C++ or CUDA library for better performance. The Fresnel factor R indicates the ratio of reflection. In path tracing, this indicates the probability of choosing the specular BRDF to sample for the next direction. Complementarily, T = 1 – R expresses the probability of transmission.

In our workflow, diffuse reflection is also modeled as a transmission, which will immediately be scattered back by particles beneath the surface uniformly in all directions, which is usually modeled by Lambertian BSDF. However, to be more physically realistic, we can refer to Ashikhmin and Shirley’s model (2000) which models the surface as one glossy layer above and one diffuse layer beneath with infinitesimal distance between. The back scattering in realistic diffuse reflection happens at the diffuse layer as a Lambertian process, after which the reflected ray is attenuated again by transmitted across the glossy surface, implying less contribution of next ray in near tangent directions. For energy conservation, Ashikhmin and Shirley also include a scaling constant in the formula. Therefore, the complete BSDF is :

Russian roulette is used to choose the BSDF component given the Fresnel factor and other related material attributes. If the generated normalized uniform random number is above the Fresnel threshold R, next ray will be transmission or diffuse reflection. Again, the “translucency” attribute of the material will be used as the threshold for determining transmission or diffuse reflection, which is actually an approximation of general subsurface scattering, which will be discussed in next chapter. The Fresnel switch guarantees we can preview the statistically authentic result in real-time. However, the thread divergence implies a severe time penalty in GPU. Suggestion will be given in the SIMD optimization analysis in Chapter 6.

The Russian Roulette is also responsible for thread termination. Since any surface cannot amplify the intensity of incoming light, the RGB value of the mask ( ) is always less than or equal to 1. The intensity of the mask (which is the value of the largest component of RGB, or the value in HSV decomposition of color) is then used as the threshold to terminate paths. To be statistically correct, the mask is always divided by the threshold value after the Russian Roulette test, which is a very intuitive process – if the reflectance of the surface is weak, early termination with value-compensated masks is equivalent to multiple iterations of normal masks. Such termination decision greatly speeds up rendering without increasing the variance.

For generating photorealistic result, it is possible to use the Russian Roulette solely to determinate termination without setting a maximum depth. However, considering the extreme case where the camera is inside an enclosed room and all surfaces are perfectly specular and reflect all lights, there is still a need to set a maximum depth.

9.4 Multiple Importance Sampling

Returning to our question of next event estimation or direct lighting computation for general surface BSDF, the technique called multiple importance sampling was introduced by Eric Veach in his 1997’s PhD dissertation. Basically, it uses a simple strategy to provide a highly efficient and low-variance estimate of a multi-factor integral mapped to a Monte Carlo process where two or more sampling distributions are used, usually in different sampling domains. Given an integral with two available sampling distributions and , a simple formula given to as the Monte Carlo estimator , where and are number of samples taken for each distribution and and are specially chosen weighting functions which must guarantee that the expected value of the estimation equals the integral . As expected, the weighting functions can be chosen from some heuristics to minimize the variance.

Veach also offers two heuristic weighting functions: balance heuristic and power heuristic. In a general form where k is any particular sampling method, balance heuristic always takes = 1 while power heuristic gives the freedom to choose any exponent . Veach then uses numerical tests to determine that is a good value for most cases.

In order to verify the practical result of multiple importance sampling and the effect of the choice of weighting function, some test cases were performed and sample images of the results are listed below.

Figure 6 Left: multiple important sampling Right: single importance (light) sampling

The first test case compares the result of multiple importance sampling (both from BSDF and light) and single importance sampling (only from BSDF). In each frame, a path is traced for each pixel. Although rendered only 10 frames, the MIS result clearly displays the shape of the reflection of the strong area light on the rough mirror behind the boxes and its further reflection on the alongside walls. In contrast, the non-MIS result generated by 100 frames still has a very noisy presentation of the reflected shape of the highlight. Note that for reflection on the floor and ceiling, which has a low roughness, non-MIS in frame 100 still has a lower noise level than that of MIS in frame 10. Since most contribution to the color comes from samples generated from BSDF, the strength of multiple importance sampling diminishes, which is also the case when the BSDF is near uniform.

Figure 7 Left: balance heuristic Middle: power heuristic Right: ground truth

Another test case aims at comparing the effect of balance heuristic and power heuristic. The images in Figure 7 above show the result of rendering the same Cornell box scene with moderately rough back mirror in 10 frames for both methods. Without very carefully inspecting the images, it is nearly impossible to observe any difference between two images. However, noise level is indeed lower when using power heuristic. Intuitively, if one carefully looks at some dark regions in the picture like the front face of the shorter box, an observation that many noise points are brighter in the image generated by balance heuristic (Since both cases use the same seed for random number generation, it is possible to compare noise point at same pixel position). However, we also offer numerical analysis to compare the percentage of difference between the two images and the ground truth (rendered by metropolis light transport in 80000 frames). The result of calculating the histogram correlation coefficient with the ground truth for each image shows that image generated by power heuristic has a value of 0.7517, larger than the image generated by balance heuristic which has a value of 0.7488. Although this is only a minor difference, it proves that power heuristic indeed has lower variance.

9.5 Bi-directional Path Tracing

An important fact of the sampling methods we have discussed so far is that the variance level depends on both geometry of the emissive surfaces or lights and the local geometry of surrounding surfaces. For brute path tracing, the rate of hitting the background (out of the scene) will be much larger than hitting the light if the summed area of lights is too small or lights are almost locally occluded, which results in only a few of terminated rays would carry the color of emission, causing large variance. For importance sampling or multiple importance sampling, shadow rays must be shot to the light. Expected variance level can only be guaranteed when the chance of misses is of the same order of magnitude as that of chance of hits; otherwise, the variance level will degenerate to that of brute path tracing. Thankfully, this problem can be solved by also shooting a ray from the light and “connect” the end vertices of the eye path and light path to calculate the contribution, which is called bi-directional path tracing (Lafortune & Willems, 1993).

The core problem in bi-directional path tracing is the “connect” process. Since the contribution of incoming radiance is sampled as a point (end vertex of light path) in the area domain, we must convert that to the solid angle domain to obtain the result. The term in the rendering equation can be converted to an equivalent in the domain of all surface areas, where is the visibility factor which equals to 1 if the connection path is not occluded and 0 otherwise and is the geometry term.

Importantly, we not only connect the terminated end points of light path and eye path, but the intermediate path ends as well. However, since each connection involves a ray-triangle intersection test, performance will be greatly affected - for path lengths of O(n), there will be O(n^2^) intersection tests. In some situation like perfect mirror reflection, we can exclude the contribution of light path since it has a zero probability as a way to save computation time. It is worth mentioning that for all combinations with a specific path length, the contribution of each should be divided by the total number of combinations to maintain energy conservation. For any combined path length of n, there are n + 1 ways of combination of eye path and light path if we want to include the contribution of all kinds of combination. It is also possible to apply importance sampling here as a specific path combination can be weighted by the p.d.f. in all path combinations of the same length. However, that requires additional space to store the intermediate results and may not be good for GPU performance. Next event estimation can also be applied here, so that direct lighting component exists in all combined path length less than or equal to (maximum eye path length + 1), for which the denominator of contribution should be incremented by 1 to include this factor.

In order to test the effect of bi-directional path tracing, we simply inverted the light in the original Cornell Box scene such that light faces the ceiling and the light can only bounce to other surfaces via the small crevices on the rim of the light. The sample image in Figure 8 shows that both rendered in 100 frames, bi-directional path remarkably reduces the noise level comparing to that when only using next event estimation, totally worth for the reduction of frame rate to 50%.

Figure 8 Left: Bi-directional path tracing Right: Uni-directional (eye) path tracing

9.6 Metropolis Light Transport

Using multiple importance sampling and bi-directional path tracing, it is still difficult to maintain low variance in integration estimation for solving problems like bright indirect light, caustics caused by reflected light from caustics and light coming from long, narrow, and tortuous corridors. The key problem of previous sampling methods is that they only consider local importance (light or surface BSDF) rather than the importance of the whole path. In terms of global importance sampling, the original Monte Carlo method is still a brute force solution. To our rescue, there is a rendering algorithm called Metropolis Light Transport (MLT) adapted from the Metropolis–Hastings sampling algorithm based on the Markov Chain Monte Carlo (MCMC) method (Veach, 1997). It has the nice feature that the probability of a path being sampled corresponds to its contribution in the global integral of the radiance toward camera and such paths can be locally explored by designing some mutation strategy. Basically, it proposes new perturbation or mutation to current path in every iteration and accepts the proposal with a probability , where the and are the radiance values and and are tentative transition functions which indicate the probability density of transforming from a state to another state in the designed mutation strategy. In general cases, and are not equal. For example, given the specific task of sampling caustics, we can define a mutation strategy that only moves the path vertices at the specular surface. As a return, such mutation can have a transitional probability corresponding with local p.d.f. – if such point in a specular surface contributes more highlight than another in the same surface as its BSDF suggests, the probability of moving from to is greater than moving from to , giving . However, different mutation strategies need to be designed in different kinds of task in order to achieve the highest rendering quality. For general kinds of task, we can ignore local p.d.f. in mutation while still maintaining a good performance. A way of doing so is to store and mutate only the random numbers for every samples generated in the path (selecting camera ray, choosing next ray direction, picking points on area light, determining Russian roulette value, etc.), which will function as “global-local” perturbations on current path, as suggested by Pharr & Humphreys (2011).

Another issue for MLT is ergodicity (Veach, 1997). The MCMC process must traverse the whole path space without getting stuck at some subspaces, which turns out to have a solution of setting a probability for large (global) mutation. Each iteration will test a random number against the threshold. If, for example, the random number is lower than the threshold 0.25, it means on average there is 1 large mutation and 3 small (local) mutations out of 4 mutations. The local mutations are sampled by an exponential distribution (Veach, 1997), implying much larger chance of less movement from the original place while allowing moderately large local mutations. The global mutations are sampled uniformly across the [0,1] interval.

Still another problem of MLT is that in order to choose a probability density function p corresponding to the radiance contribution of the path, which is a scalar, we must find a way to map the 3D radiance value to 1D space to determine the acceptance probability. A reasonable way of doing this is using the Y value in XYZ color space which reflects the intensity perceived by human eye. A simple conversion formula is given as Y = 0.212671R + 0.715160G + 0.072169B. Note that no matter what mapping formula is chosen, the result is still unbiased. Choosing a mapping closer to eye perception curve allows faster convergence and better visual appearance in same number of iterations.

Nevertheless, another important issue of MLT is the start-up bias (Veach, 1997). Considering the estimation function , where f is the radiance, p is the mapped intensity and w is the lens filter function. The Metropolis–Hastings algorithm guarantees that the sampling probability in equilibrium, giving the minimal variance. However, we have no way to sample in such before equilibrium is reached but gradually converge to the correct p.d.f, even though we use as the denominator which is actually not the real p.d.f. for current sample. This causes incorrect color in first few samples, known as start-up bias. Depending on the requirement, if the rendering task is not time-restrictive or not aimed for dynamic scene, we can just discard the first few samples until the result become reasonable. Using a smaller large mutation rate is also a remedy for start-up bias. However, tradeoffs are that complex local paths become harder to detect and while the global appearance converges quickly, local features like caustics and glossy reflection emerge very slowly. In practice, if the scene contains mostly diffuse surfaces, large mutation rate can be set larger. On the other hand, if the intricate optical effects are the emphasis of rendering, large mutation rate should be set much smaller. In fact, designing specific mutation strategies with customized transition functions may be better than just using “global-local” mutations.

It turns out that MLT can be trivially mapped to GPU by running independent tasks in each thread, as is implemented by this project. However, such method still has its defects. For decent convergence rate, the number of threads is set to be equal to the number of screen pixels (so that for the average case, every pixel can be shaded in every frame), which implies high space complexity, due to stored random numbers (stored as float) in graphic memory. If the screen width and height are W and H, maximum path length (or combined path length if bi-directional path tracing is used) is k and about 10 samples need to be generated for each path segment (as in our implementation), there will be 40k*W*H bytes in total for storing the MLT data. With 1920*1080 resolution and a reasonable k = 30, there will be 2.37G of data, exceeding size of the graphic memory for many low to mid end graphic cards nowadays. To solve this issue, some data compression can be done and it is also possible to let the threads collaborate in a more efficient way, which means running less number of tasks while keeping same or lower variance. However, we will not study this topic in this project due to the effectiveness of existing performance and existence of other important issues.

Last but not least, the estimation of global radiance (||x|| indicates a measure of the magnitude of intensity of the RGB color) at the beginning will affect the overall luminance of the final result. From the formula , is chosen to be , yielding as the radiance contribution from one sample, where functions as a scaling factor for all samples. As a result, in the case where the rendered result is required to be physically authentic, more sample should be taken to estimate the global radiance, although it causes a considerable overhead.

To illustrate the advantage of MLT, some sample image from tests are shown in Figure 9. In both cases, bi-directional path tracing with multiple importance sampled direct lighting is used; the first row shows the result in frame 1, 30, 100 for normal Monte Carlo (MC) sampling whereas the second row shows the results at same frames for MLT. Reader will notice difference of the ways of the two estimators accumulate color. While maintaining a low noise level from the initial frame, MLT estimator exhibits start-up bias as shown by the dim color of the part of the ceiling directly illuminated by the light. In frame 100, the MLT estimator almost reaches the equilibrium as observed in comparison with the MC estimator with respect to color of directly illuminated part of the ceiling. It is worth mentioning that such level of variance is can only be achieved in frame number > 1000 by MC estimator while the MLT estimator is only slightly slower than MC estimator, cause of which mainly attributes to the atomic memory write as each thread runs an independent MLT task which can write color to all pixels on the screen.

Figure 9 Upper: Monte Carlo Lower: Metropolis (Both use bi-directional path tracing)

[GAPT Series - 8] Spatial Acceleration Structure (Cont.)

Fri, 30 Jun 2017 00:00:00 -0700

Now we’ve added BVH as an alternative choice of SAS! It is necessary for real-time ray tracing against dynamic scene geometry with complex moving meshes as it maintains the interior hierarchy of the mesh and only updates exterior hierarchy which is usually much simpler to do.

8.1 Triangle Clipping in Kd-Tree Construction

An important issue in kd-tree construction is the necessity to clip triangles which span across the splitting plane in each level for accelerating intersection. In their 2006’s paper, Wald & Havran suggested that on average, for a kd-node with N triangles, there are O() triangles spanning across a splitting plane. Normally, we check whether the ends on current chosen splitting axis of the triangle’s bounding box are in distinct sides of the splitting plane to determine whether to add it in both of the child nodes. However, it may be the case that the triangle does not overlap one of the child node, even though in one dimension it does. Such error will be accumulated to a situation that the whole triangle lies outside of its node’s bounding box. With the increase of kd-tree level, will be increasingly close to 1, which means at leaf level, one would expect many false positives to occur in intersection test, unnecessarily increasing the time for intersection. In addition, adding spanning triangles to both sides unnecessarily increases the workload of construction as one has to test the vertices against the boundary of the bounding box in current kd-tree level to avoid choosing a splitting plane outside the bounding box.

Since it is convincing from the analysis above that clipping triangle does has an obvious boost on intersection performance, which is often a bottleneck for path tracing, we offer some test cases to quantize the rate of performance improvement. After that, we will explain how to clip the triangles, which is a relatively simple task without the need of importing third-party libraries.

Table 1 Speedup of triangle clipping for different models

From the table, we can observe that clipping triangles results in a speedup from 3.5% to 9.1% for different mesh complexity, which is not very drastic but obvious enough to confirm the effect of triangle clipping.

We will now illustrate how to clip arbitrary triangle against box with an example where a triangle is clipped to a pentagon. In Figure 2 below, we determine the intersection between any pair of vertices that lies in different sides of each of the 6 faces of the cuboid. Then, in the plane spanned by each of the 6 faces, we calculate the intersections the line extended by the two intersection points (if there is two), which may be zero, one or two in number. Any intersection will become a vertex of the clipped polygon. If the end vertex itself lies inside or on the border of the box’s face, it also become a vertex of the clipped polygon. Notice that the process is trivially parallelizable for the 6 faces, except for the memory write, i.e. expanding current bounding box to contain the new vertex. If the construction is running on GPU, parallelizing on 6 faces can decrease considerable amount of time for the clipping stage in each level, which does not process too many triangles and is solely occupying the GPU.

Figure 2 A triangle clipping example

8.2 SAH-based BVH

As mentioned before, BVH is a crucial component for real-time ray tracing against dynamic scene geometry. Similar to kd-tree, the probabilistic analysis applies to the decision of splitting plane in each tree level, which naturally leads to the fact that greedy choice of local optimum gives the best available algorithm for optimizing traversal cost. The only difference between kd-tree and BVH in terms of surface area heuristic is – BVH is a hierarchy of objects and kd-tree is a hierarchy of subspaces. Therefore, refitting bounding box is necessary when dividing the node into child nodes, which turns out to be a time-consuming bottleneck in construction.

Unlike kd-tree construction which uses a dynamic array or vector to store all triangle events, BVH construction needs to maintain a binary search tree for each dimension. The shrinking side of the refitting process requires us to search and delete the events that are recently switched to the expanding side and add them to the BST of that side. Initialization of the BSTs costs O (N log N). Each check of splitting position costs O(log N) and the whole process of best plane determination costs O (N log N), leading to a O (N log N) total time complexity. Even worse, maintaining binary search tree for three dimensions implies a big constant of 3. For this reason, the construction of SAH-based BVH often causes serious overhead latency in complex cases. Fortunately, a simplification method which equally divides the space into a small number of buckets was proposed by Pharr & Humphreys (2011) in their famous Physically Based Rendering book. Partitions are then only considered at bucket boundaries, which gives much more efficient construction while remaining as effective in traversal as the original method. It is easy to figure out that the whole construction only requires O(N log N) time since bucket number is a constant. Meanwhile, a binned partition implies easier and more efficient parallelization on GPU. Due to time limit, a GPU construction for BVH is not implemented, as there exist well known parallel binned BVH construction methods.

The traversal of BVH is easier to implement than kd-tree but less efficient in performance. Since we cannot guarantee any deterministic spatial order of BVH nodes as they may overlap each other in any form, we cannot terminate traversal after we find a hit in leaf. Although the child nodes of a BVH node are also stored in a “front-back” order. It only indicates the spatial order of the centroid of two bounding boxes as in construction the decision of affiliation of triangle is based on the side of its centroid. It is entirely possible that the “back” BVH node contains a nearer hit than the “front” node. Nevertheless, such probability is not high. In most cases, the “back” BVH node will not have any intersection in the trimmed range after intersection is done for the “front” BVH node, after which it will be popped from stack. If then the stack is checked to be empty, the function returns the nearest intersection if there is any.

8.3 Automatic BVH Refitting

A simple solution of tracing dynamic scene geometry is to recursively refit the local BVH whenever intended object moves beyond its parent’s bounding box. If tree levels outside the intended object is much less than tree levels inside or movements of the object is spatially limited, such method has a very low time cost. Attention must be paid to the fact that shrinking refitting is also necessary when object moves towards the original position, for which we can store a record of the moving direction of current frame as bitmask of 3 axes. If the bounding box of moving object in last frame borders its parent or ancestor with respect to any of the current dimension of moving read from the bitmask, we perform a shrinking refit. Also, for every movable object, a translation vector is stored to be used as an offset in triangle intersection. However, when assumption of less exterior tree level does not hold or there is violent movement of objects, we need to consider alternative methods other than the purely refitting. A combination of splitting, merging and rotation operations can be performed on tree structure (Kopta et al., 2012), which massively increases the rendering speed for complex animated scenes as it avoids structural degeneration in naïve refitting.

However, such method also has its limitation. When most of the objects in the scene are animated (e.g. particle system), update of BVH is serialized due to necessary atomic operations for many threads changing the boundaries of the same bounding box. In this situation, it is better to rebuild the BVH rather than modify it.

[GAPT Series - 7] Overview of Software Workflow

Fri, 30 Jun 2017 00:00:00 -0700

Figure 1 UML flowchart of our path tracer>

The diagram above shows the simplified workflow of the proposed path tracer. Note that modules for metropolis light transport and bi-directional path tracing are not included in this diagram, which only describes the normal unidirectional path tracing. However, one can easily modify the diagram to get the versions for metropolis light transport and bi-directional path tracing, which shares most of the processes. Also, it only shows the case when kd-tree is used as the spatial acceleration structure. In fact, BVH is also implemented especially for user manipulation of scene geometry.

An obvious characteristic of the displayed architecture is that modules seem to be evenly distributed on CPU and GPU. In fact, GPU consumes most of the living time of the program, as CPU is only responsible for some preprocessing work like handling I/O, parsing, invoking memory allocation, calling CUDA and OpenGL API and the coordination between thread pool and GPU kernel. Despite its frequent involvement in the path tracing task (it appears between every two recursion levels in a frame), CPU occupies only a fractional of the time. Notwithstanding its conciseness, the diagram shows all 3 optimization factors in path tracing. Kd-tree, which is constructed on GPU in this project, is highly optimized by the “short-stack” traversal method which will be introduced later. Multiple importance sampling, as a generalization of single importance sampling, can be used for rendering glossy surface under strong highlights to reduce the variance. Thread compaction, a crucial method for increasing proportion of effective work and memory bandwidth, representative of the SIMD optimization, is shown at the bottom of the diagram. A thread pool is maintained to coordinate the work of thread compaction.

After initialization, the program spends all of the rest of the time in two loops, the outer of which uses successive refinement algorithm to display image in real-time, and the inner of which executes a path tracing recursion level for all threads in every iteration. Between two inner iterations, thread compaction is used to maintain occupancy as mentioned before. Between two outer iterations, any user interaction is processed by CPU. After updating corresponding values in device memory, OpenGL API is called to swap frame buffers.

It is noticeable that the Thrust library is also an important component of our workflow. Developed by NVIDIA, Thrust is a C++ library for CUDA providing all parallel computing primitives like map, scatter, reduce, scan and thread compaction. As a high-level interface, Thrust enables high performance parallel computing capability while dramatically reduces the programming effort (NVIDIA, 2017). Rather than “reinventing the wheel”, we use thrust for all parallel computing primitives required for GPU kd-tree construction and thread compaction in our program due to the proven efficiency it provides and the flexibility of its API.

Overall, the diagram shows a very macroscopic outline of the software structure, whose detail will be introduced in the following chapters. In addition, some limitations and recommended improvements will be addressed in the final chapter.

[GAPT Series - 6] Real-time Path Tracing on GPU - Introduction

Fri, 30 Jun 2017 00:00:00 -0700

Hello, it’s been half a year since last update and now I’m back! I’ve finally finished the project and renamed it to Real-time Path Tracing on GPU. As I told you before, the second part will try to push the boundary of GPU accelerated path tracing to approach real-time performance. I want to clarify here that by “real-time” I don’t mean that you can get same image quality of mainstream game graphics with a speed comparative with rasterized graphics using a ray-tracing technique. However, with proper constraints (limitation of BRDF types, resolution and model complexity) you can actually get a pretty close real-time performance which must be implemented with tons of texture tricks in a rasterization setting (e.g. the 512x512 Cornell box, as I will show below). Besides introducing optimizations I’ve used for such a great leap in speed, I will analyze the gap between current performance and the ideal performance we want to have in future and try provide some suggestions what we can do for the improvement of this technique.

6.1 Objectives of the Second Part of the Project

The main objective of this project is to explore the capability and performance of combination of the power of current GPGPU with existing and new path tracing algorithms in the task of real-time path tracing. To achieve this, a variety of different factors that determines the efficiency of path tracing are studied, amongst which spatial acceleration structures, sampling algorithms, and single-instruction-multiple-data (SIMD) optimization are most important. Since the variable (lighting configuration, scene geometry and material) and measure (frame rate, convergence rate) of path tracing performance are both multi-dimensional, optimization concepts are provided in a case-by-case analysis. A standalone program is written to demonstrate these concepts. To guarantee that the concepts are applicable to general and complex cases, a considerably large subset of all functionalities found in state-of-art path tracers is integrated into the program which includes PBR (physically-based rendering) material and participating media. Besides, bi-directional path tracing and Metropolis light transport are also studied to deal with cases containing difficult lighting configuration and to improve the rendering quality under same time constraints.

6.2 Layout

The main content of the second part will be divided into 6 posts as post 7 – post 12. In post 7, we will have an overview of the workflow of our path tracer, followed by the spatial acceleration structures including SAH-based Kd-Tree and BVH as the first studied factor of optimization in post 8. In post 9, the rendering equation will be reviewed with normal Monte Carlo sampling algorithm, after which advanced sampling techniques - multiple importance sampling, bi-directional path tracing and Metropolis light transport based on Markov Chain Monte Carlo method will be studied in response to difficult rendering cases and noise reduction. Before the discussion of SIMD optimization in post 11, several important shading models including surface-to-surface BSDF, ray-marching volume rendering and a simplification of subsurface scattering will be introduced due to their close relationship with sampling methods. In particular, I will propose a parallel SAH-based Kd-Tree construction algorithm that is suitable for current GPGPU in post 11. In post 12, benchmark methods will be introduced which carry out the comparison between my path tracer, NVIDIA’s Optix path tracing demo, and Cycles Renderer – a free mainstream path tracer.

[GAPT Series - 5] Current Progress & Research Plan

Sat, 03 Dec 2016 00:00:00 -0800

Currently, I have implemented a Monte Carlo path tracer (demo) with full range of surface-to-surface lighting effects including specular reflection on anisotropic material simulated by GGX-based Ward model. A scene definition text file is read from the user, whose format is modified from the popular Wavefront OBJ format by adding material description and camera parameters. I use triangle as the only primitive due to simplicity and generality. Integrated with OpenGL and using a successive refinement method, the path tracer can display the rendering result in real time. Optimization methods include algorithm-based methods: SAH based kd-tree, short stack kd-tree traversal, ray-triangle intersection in “unit triangle space”, next event estimation (explicit light sampling); and hardware-based methods: adoption of GPU-friendly data structure which has a more coalesced memory access and better cache use, reduction of thread divergence which boosts warp efficiency, etc.

The standard Cornell box is chosen to benchmark the performance. With successive refinement that takes a sample for every pixel in each frame, in 512x512 resolution, my path tracer runs at an average 33.5 fps on my NVidia GeForce GTX 960M without explicit sampling enabled. In comparison, the state-of-art implementation by NVidia Optix engine ray tracing engine runs the same scene at an average 60.0 fps without explicit sampling enabled. Although many parts in the code of NVidia’s demo are hardcoded to speed up, one can still expect a significant performance gap between my path tracer and NVidia’s Optix engine. Therefore, the most important task in the rest of my research is to further optimize the program, both in algorithm and hardware use.

One planned optimization is thread compaction, which can solve the problem of under-utilized warps when some threads are terminated earlier than others. Apart from that, existing optimization on thread divergence and data structure will be pushed further. Since there is not a single type of SAS that performs better than other types in all different levels of scene complexity, BVH will also be implemented as an alternative of choice.

The other task is to enrich function and improve rendering quality. To enable dynamic scene, I planned to adapt the kd-tree construction to GPU. To enrich the range of optical effects, support for subsurface scattering and participating media will be considered. To support explicit sampling for translucent objects, bi-directional path tracing will be studied and implemented so that fast generation of caustics is possible. To solve convergence problem in pathological scenes like lighting from a narrow corridor, new sampling techniques such as Metropolis Light Transport will be researched and experimented. If possible, more innovative approaches will be proposed.

The following table summarizes current progress made and the future research plan.

[GAPT Series - 4] SIMD Optimization

Sat, 03 Dec 2016 00:00:00 -0800

With each thread rendering a screen pixel, the problem of path tracing can be solved in an embarrassingly parallel way, without the need of inter-thread communication. However, it is hard to exploit the full capability of single-instruction-multiple-data (SIMD). There is very little locality in the memory access pattern due to generally inconsistent scene geometry, which means almost all scene data needs to be stored in global memory or texture memory. Even when the first ray hit has congruent pattern, the consequent bounces can be as divergent as possible. Moreover, sampling by Russian roulette method cannot avoid branching, which implies thread divergence. However, two types of optimization based on CUDA architecture – data structure rearrangement and thread divergence reduction can be achieved to reduce the overall rendering time.

4.1 Data Structure Rearrangement

First, “flattening” the data structure to continuous memory spaces is a key method to improve memory coalescing and reduce memory access. Using kd-tree as SAS, a traditional CPU path tracer stores a tree structure with a deep hierarchy of pointers (Figure 6).

Figure 6. The commonly-used tree structure of kd-tree

Undoubtedly, this structure is unsuitable for GPU. The dynamic memory allocation can give very bad memory coalescing, seriously limiting effective memory bandwidth in path tracing. One can easily flatten the kd-nodes to an array with the child pointers placed by child indices, giving an array-of-structures (AoS). However, this is far away from the optimum. Instead of keeping a separate triangle indices list for every node, we can store the pointers to triangles continuously in an array and keep only the array offset in node structure. This in large chance gives either coalesced memory access or better cache use because unlike triangles, kd-nodes has a better locality - if we use serial recursion in kd-tree construction, indices of nodes near the bottom of the tree with a near common ancestor will be very near to each other. Similarly, the triangle data can also be stored in an array, with pointers in the triangle list array substituted by indices.

Second, compression of the data structure is another aspect we need to concern about to improve memory efficiency. Notice that in the above kd-node structure, we have some variables that can be represented using few bits – axis (0, 1 or 2 for x, y or z), isLeaf (0 or 1) and the number of triangles (a leaf only contains a few triangles) if we want to only keep offset in the global triangle list. Rather than using separate variables to store them, one can compress them to one variable. In my path tracer, axis, triangle number, isLeaf and left child index are compressed into one 32-bit integer with 2-5-1-24 partition using bit operations, which helps to compress a kd-node from 25 bytes to 16 bytes, where 3 words are reserved for split position, right child index and triangle array offset. By compressing left child index from 32 bits to 24 bits, it limits the number of kd-nodes to 16,777,216, which is enough for most scenes. The 16-byte compression not only reduces space complexity, but provides memory alignment, improving efficiency of memory access.

Third, SoA can be used in place of AoS when spatial locality is high for neighboring threads or statements. As mentioned before, path tracing does not have a consistent locality for each procedure. Thus, a mixture of SoA and AoS can be used to find a balance between fewer memory accesses and more coalesced memory accesses that can optimize the overall speed. The catenated triangle indices array is an example. In addition, some triangle data (precomputed transformation coefficients, to be introduced later) can be extracted to separate SoA to achieve better cache use when iterating through all triangles in a leaf as triangle indices in a leaf are usually closed to each other. In CUDA architecture, a 128-byte cache line is fetched for each global memory access (NVIDIA, 2015). In a loop that visits some continuous elements, if number of attributes is fewer than number of elements, in large probability that fewer global memory access will take place as following access of each attribute can be already in the cache.

4.2 Thread Divergence Reduction

Another important factor of SIMD optimization is minimizing thread divergence. My following code snippet (Figure 7) illustrates some strategies I used to reduce thread divergence. First, common statements in branches are extracted to minimize number of operations in each branch. Second, in-place swap is used to replace hardcoded-like assignment. Third, if possible, bit operations are used to replace if-else. Branches with different values assigned to the same variable can be substituted by masking and adding the two values.

Figure 7. illustration of an optimization by reducing thread divergence in kd-tree traversal

The above snippet also shows another optimization – reduction of global memory access. Rather than storing the pointer to kd-node in stack, it is better to store the index. Otherwise, there will be an extra memory read for every other possibility which will not be executed. The optimization of memory access and thread divergence is often mutual. By reducing memory access, one decreases time taken to execute a divergent branch. By decreasing branch divergence, one reduces possible needs of redundant memory access.

4.3 A more efficient Ray-triangle Intersection Algorithm

A specific optimization on speed is the adoption of a more efficient intersection algorithm. Ray-triangle intersection can be a performance bottleneck if the math operations are too complex. Woop et al. (2004) introduced afﬁne triangle transformation for acceleration of a hardware pipeline. Instead of testing intersection of a fixed ray with varied triangles, it tests a “unit triangle” against different expressions of the ray in different “unit triangle space” from the view of each triangle, which requires an affine transformation.

The inverse matrix and the translation term can be computed offline and stored in a 4D float vector for each dimension. Based on extraction of common geometry information, this method reduces the 26 multiplications, 23 addition/subtraction required in the standard ray-triangle intersection algorithm to 20 multiplications and 13 addition/subtraction. By separating the data of precomputed terms from the remaining necessary triangle data (vertex normal and material index) to form a structure of two arrays, higher performance can be obtained.

References

NVIDIA. (2015). Memory Transactions. NVIDIA® Nsight™ Development Platform, Visual Studio Edition 4.7 User Guide.

[GAPT Series - 3] Path Tracing Algorithm

Sat, 03 Dec 2016 00:00:00 -0800

3.1 The Rendering Equation

The rendering equation introduced by Kajiya (1986) defines the radiance seen from a point in the reflection direction , i.e. view vector in ray tracing’s grammar as a result of reversibility of light path:
which is the emittance of the point itself plus the reflectance caused by all incident radiance summed from the surrounding hemisphere. This is a physically correct model of global illumination considering only surface to surface reflection. In computer graphics, this is usually mapped into a recursion where the integral is decomposed into randomly picked samples. For path tracing, all possible light paths from the set of light rays within a pixel are sampled individually from random position within the pixel whose results are then averaged. Upon hitting a surface, only one secondary ray is shot for each sample. It intuitionally follows that the secondary ray must be generated with some probability mechanisms, which is defined by BRDF with respect to the surface property. However, there are some problems. Given limited number of samples, how do we choose a proper sampling strategy to maximize the image quality? Given a specific BRDF, how to translate it into an algorithm that fits into the sampling strategy we use?

3.2 Stratified Sampling vs. Anti-aliasing Filters

To generate samples within pixels, a naïve solution is uniform sampling. In CUDA device code, the function curand_uniform(seed) can generate 1D uniform pseudorandom samples from 0 to 1. However, the uniformly distributed samples tend to form clusters, producing high noise level. A common way to overcome this is stratified sampling, which divides a pixel into grids and takes uniform samples of same number within each grid. Theoretically, it reduces the error of estimation from to , where n is the number of samples. A problem of stratified sampling is that it is not suitable for successive refinement required in real-time rendering. When using successive refinement, usually one sample is taken for every pixel in each frame. If we want to have minimum aliasing effect in any given frame, the former samples must at least not follow any certain fixed pattern, which is not possible for stratified sampling which must use at least samples as a unit. Therefore, we want to find a solution having both low noise level and successive refining capability.

In the famous 99 lines of C++ implementation of path tracing SmallPT, Beason (2007) applies a tent filter to the uniform random samples which shift more samples towards the center of the pixel. In my test, this method produces same image quality given same sampling number as stratified sampling does with the ability of successive refinement. In fact, it approximates the sinc function, the ideal anti-aliasing filter (Figure 3). There are actually other choices of approximation with higher quality such as bicubic filter and Gaussian filter. However, these filters have much higher computation overheads while the tent filter is a more practical choice in real-time rendering.

Figure 3. comparison between sinc function and tent function

Sample rays reflected by the surface can be chosen from all possible directions within the hemisphere of the incident point. Without using importance sampling, it is hard to achieve a reasonable convergence rate when the variance of radiance is high enough. Consider two factors: surface characteristics and spatial distribution of incoming radiance. The first factor can be easily described by BRDF models that offers distribution function for importance sampling. In contrast. the second factor is more difficult to analyze. For cases that light comes directly hitting the surface such as diffuse reflection, references to emitting surfaces (lights) can be stored separately to enable explicit light sampling. However, when transmission is included, this method no longer works. Effects like caustics must be achieved using bi-directional path tracing, if importance sampling with respect to incoming radiance distribution is required. This is a topic I planned to study in the next half of project.

3.3 BRDF & Anisotropic Material

For importance sampling based on BRDF, an important progress I have achieved is the simulation of anisotropic material. Ward (1992) introduced a practical BRDF by modifying the Beckmann distribution factor in Cook-Torrance Model (1982),

where and correspond to “roughness” of the material in x and y direction w.r.t. tangent space. Taking azimuth angle as argument, it is easy to see that when the distribution is completely determined by and when the distribution is totally decided by . However, in my implementation of anisotropic materials I chose GGX function instead of Beckmann function due to its simplicity and faster computation. An isotropic version of GGX was adopted in Unreal Engine 4 (Karis, 2013):

By replace with , the equation works for anisotropic surfaces. It turns out that the sampling azimuth angle and altitude angle can be computed with and , where and are two unit uniform random variables. Notice that this result in , which can be computed faster than in Beckmann’s case, where two extra math functions cos() and log() are involved. Below is a sample picture (Figure 4) showing the anisotropic effect produced by the modified Ward model with GGX distribution.

Figure 4. Comparison between isotropic and anisotropic surfaces

3.4 Next Event Estimation

In order to utilize spatial distribution of incoming radiance in importance sampling, I applied next event estimation implemented by explicit light sampling. As mentioned before, reference to triangles with emittance greater than 0 (“light triangles”) are stored in an array. When ray hits a surface, shadow rays are shot from the hit point to every light triangle. Notice that more shadow rays imply faster convergence but lower frame rate due to extra kd-tree traversal and ray-triangle intersection cost. When doing real-time path tracing, one can determine the balance between convergence rate and frame rate heuristically.

To sample shadow rays, an end point is picked randomly within the boundary of every light triangle, which is then subtracted with the fixed start point to derive the ray direction. However, for importance sampling, the probability distribution function (pdf) needs to divide all other terms. For triangle, it is determined by the solid angle it spans w.r.t. the hit point divided by the hemispherical solid angle (). To compute this solid angle, an elegant formula was found by Oosterom and Strackee (1983):

where are the position vectors of the three triangle vertices w.r.t. to the origin.

If we normalized the three position vectors in advance, the formula is simplified to

which is more convenient for computation.

An important notice is that the triangles used for explicit lighting sampling must be excluded in the scene intersection of next ray to avoid repetitive summing of radiance. A bitmap can be used here to mark which light triangles have been explicitly sampled if the number of light triangles is within a proper limit. If it turns out that the intersected triangle was explicitly sampled in former ray, the ray will be abandoned.

Next event estimation is crucial for increasing the convergence rate for scene of high dynamic range. For example, the light’s emittance factors can be 100 times greater than the surface reflectance factors while the area of light is small. Below (Figure 5) is an example showing the significant reduction of noise level by NEE given same number of successively refined frames.

Figure 5. Comparison between same scene of high dynamic range with and without next event estimation

3.5 The Russian Roulette Algorithm

So far, all possible surface-to-surface reflection and refraction types are supported in my path tracing program. Russian roulette is used here to determine which type of light path to take. Apart from the diffuse color and specular color, I also defined 4 extra material parameters - transparency metalness and roughness ranging from 0.0 to 1.0, which plays the role of threshold in Russian roulette. For every ray hit, the transparency value is tested against first to determine the chance of transmission/reflection. If the transparency value is 1.0, the uniform random variable will always be smaller or than or equal to it, branching to the transmission case. If , it is tested against the metalness value to determine the ratio between specular and diffuse reflection. If , the next ray will be generated from diffuse BRDF (lambert in my implementation). Otherwise, the next ray will be treated as the incoming radiance of a specular BRDF. Roughness is a 2D float vector, which determines the and parameters in Ward model, which will be reduced to Cook-Torrance model when two roughness values are same.

References

Beason, K. (2007). smallpt: Global Illumination in 99 lines of C++. Retrieved from http://www.kevinbeason.com/smallpt/

Cook, R. L., & Torrance, K. E. (1982). A reflectance model for computer graphics. ACM Transactions on Graphics (TOG), 1(1), 7-24.

Kajiya, J. T. (1986, August). The rendering equation. In ACM Siggraph Computer Graphics (Vol. 20, No. 4, pp. 143-150). ACM.

Karis, B. (2013). Real shading in unreal engine 4. part of ACM SIGGRAPH 2013 Courses.

Van Oosterom, A., & Strackee, J. (1983). The solid angle of a plane triangle. IEEE transactions on Biomedical Engineering, 2(BME-30), 125-126.

Ward, G. J. (1992). Measuring and modeling anisotropic reflection. ACM SIGGRAPH Computer Graphics, 26(2), 265-272.

[GAPT Series - 2] Spatial Acceleration Structure

Sat, 03 Dec 2016 00:00:00 -0800

2.1 Choice of SAS

The naïve implementation of ray tracing related algorithms iterates through the set of all primitives in the scene and checks ray-primitive intersection for each, which is very time-consuming (linear to the number of primitives) and is a severe bottleneck in performance when the number of primitives gets high. In reality, different spatial acceleration structures (SAS) are applied to solve the problem. They generally improve the speed to logarithmic time and therefore can make interactive ray tracing for complex or even dynamic scene. Octree, BSP (binary space partitioning), BVH (bounding volume hierarchy) and kd-tree are some representatives of the SAS. The SAS generally divide the scene or mesh into recursive sub-spaces which often has a tree-like structure. Among them, octree and BSP are the type of solution which chooses split position in a fixed routine. For example, a typical octree always chooses the center of the space to divide it into 8 sub-spaces. Since they are indiscriminate to the specific geometry that the scene has, they generally exhibit lower efficiency than BVH or kd-tree. BVH or kd-tree, on the other hand, uses some heuristics to determine the partition position based on the specific scene geometry. In terms of the efficiency of BVH and kd-tree, Vinkler et al. (2014) has shown that kd-tree has higher performance for complex scenes than BVH while BVH defeats kd-tree for simple to moderately complex scenes. Considering this, I planned to implement both structures for the freedom of choice with respect to different kinds of scenes. However, due to time limit, I only studied and implemented kd-tree up to now. Therefore, this part will focus on the findings I have on kd-tree.

2.2 Surface Area Heuristics

The construction of kd-tree depends on choosing one dimension and the split position in that dimension in every iteration. A naïve solution is to cycle through the 3 axes and choose the space median every time, giving no better performance than octree. A popular mechanism is Surface Area Heuristics (SAH) (Wald & Havran, 2006), which is based on the greedy algorithm to find a local optimum based on the surface areas of the two child nodes in every step. Instead of finding the global minimum cost which is practically infeasible as number of possible trees grows exponentially with scene complexity, SAH assumes all the primitives in child nodes of a particular step are in leaves, giving the formula of the expected cost of a particular split:

where is the surface area of volume x, correspond to the cost of a traversal step and an intersection step, and are number of primitives in left and right child node. According to probability theory, gives the chance of uniformly distributed rays hitting the left node which leads to the fact that. This is actually a reasonable assumption as the distribution of rays tends to not follow any certain pattern with the number of ray bounces increasing and geometry of the scene varies. On the other hand, although treating both nodes as leaf overestimates the real cost, the strategy works well in practice (Wald & Havran, 2006). Another advantage of SAH is determination of when to stop splitting is easy, as one can compare the cost of splitting and not splitting directly from the above formula.

In my implementation, I complete the construction of kd-tree on CPU followed by transferring the structure to GPU because construction process involves large data set written to memory and branching are excessively used. Without special optimization and modification to the algorithm, the efficiency on GPU can be much lower than the CPU version. However, Zhou et al. (2008) proposed a clever method of GPU construction of kd-tree which outperforms single core CPU algorithms significantly and is competitive with multi-core CPU algorithms. This potential endows meaning of making GPU construction of kd-tree an issue in the rest of my research.

2.3 Dealing with Pathological Cases

Some pathological cases are often encountered in the construction of kd-tree. The following are illustrated using triangles as primitives. Using a O(N) algorithm which scans through the spatially ordered set of triangle vertices and adds the triangle to the left child when encountering the starting vertex of its bounding box and add it to the right child when encountering the ending vertex, it guarantees if a triangle has any measurable area on left or right side of the split plane, it will exist within that node. However, when a triangle completely lies on the split plane, it will only be added into one of the child nodes (the split position is chosen from the vertices, but it cannot be considered as a candidate whose belonging triangle needs to be added into child node due to unnecessarily extra intersection cost within that node). Since all three vertices have the same coordinates in current dimension (x, y or z), whether the triangle is added into left or right node is arbitrary (Figure 1). However, this causes structural disparity among different child nodes containing the triangle, which further leads to incongruent thread paths as some of the rays directly hit the triangle in front while others may undergo complicated traversal steps to route back from the back side. Nonetheless, a simple solution is to store a marker for every starting and ending triangle vertex indicating whether it is at same position in current dimension with its counterpart. When the split position happens to fall on the specific position of such triangle, the marker is read to determine whether to add it to another child node, which completely solves the problem.

Figure 1. The vertices lying on the splitting plane can belong to either left or right child node

2.4 Kd-tree Traversal

For the traversal of kd-tree, the standard CPU algorithm with stack which stores backside node cannot be directly applied to GPU. First, the stack need to be implemented in fixed length array which guarantees coalesced memory access and reduces memory read and write instead of using linked list or dynamic array implementation of stack. Second, size of the stack item should be compressed as small as possible as complex scene with dozens of kd-tree levels will require the stack to be allocated in local memory instead of the thread registers with very limited capacity. Foley & Sugerman (2005) introduced two stackless traversal algorithms called kd-restart and kd-backtrack. However, without the proper priority information stored for traversal, these algorithms require modification of the traversal path which brings extra time and space complexity: kd-restart directly goes into the nearest leaf and restarts from the root with ray range propelled forward and kd-backtrack stores extra node data to improve traversal restart efficiency as it can restart from a node’s parent. Meanwhile, the worst case of kd-restart degenerates to linear.

A neat solution proposed by Santos et al. (2012) adopted a “short stack” method. Instead of storing 12 bytes as in standard algorithms (4 bytes for node address, 4 bytes for near ray distance (tnear), 4 bytes for far ray distance (tfar)), they discovered that tnear can be derived in the traversal process and it only updates when the traversal finishes checking a leaf, giving an “8-bytes” stack algorithm. The advantage of “8-bytes” stack is not only fewer local memory required, but faster memory access thanks to the fact that an 8-byte load is faster than a 12-byte loads in local memory in CUDA architecture.

By combing a SAH construction of kd-tree and a “short stack” traversal, my SAS has the optimal performance comparing with other combinations. Below is the experiment data of different combinations of construction and traversal methods on different scene (Figure 2). Notice that these data are the result of some SIMD optimization applied to the data structure, which I will discuss and compare the performance with non-optimized ones in part 4.

Figure 2

References

Foley, T., & Sugerman, J. (2005, July). KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (pp. 15-22). ACM.

Wald, I., & Havran, V. (2006, September). On building fast kd-trees for ray tracing, and on doing that in O (N log N). In 2006 IEEE Symposium on Interactive Ray Tracing (pp. 61-69). IEEE.

[GAPT Series - 1] Introduction to GPU Accelerated Path Tracing

Sat, 03 Dec 2016 00:00:00 -0800

This series records the progress of my final year paper - GPU Accelerated Path Tracing. It will be divided into 6 chapters.

1 Introduction
2 Spatial Acceleration Structure
3 Path Tracing Algorithm
4 SIMD Optimization
5 Current Progress & Research Plan

1.1 Background Information

Polygon rasterization method has been the de-facto standard of the real-time graphic generation technique of video gaming in past few decades, where ray tracing related methods are still mainly used in off-line rendering of animation films and industrial designs. However, recent years have witnessed a rapid growth in the capability of real-time ray tracing with the advent of General Purpose GPU (GPGPU) and associated programming interfaces like CUDA and OpenCL. Nowadays, it is possible to ray-trace complex scene without global illumination in real-time on high-end GPUs. Because of its theoretical straightforwardness of dealing with complex optical effects and the huge potential of performance growth thanks to the performance scalability of GPU which responds directly to Moore’s law without the power wall faced by CPU, ray tracing related methods has been considered as the standard graphic rendering technique of the future.

However, to achieve photorealistic effects which, in rasterization based graphics, are implemented by heavy use of textures, a ray tracing model must include global illumination beyond the simplified light paths defined in Whitted ray tracing. Some ancillary methods like radiosity and distribution ray tracing have been added into original Whitted framework to enhance the lighting effect, yet each of them targets at a specific subset of the general global illumination and the combination of them is not exhaustive of all possible light paths. Path tracing, which samples all possible light paths using Monte Carlo methods and the rendering equation has come to the rescue. Fundamentally expanding the scope of traditional ray tracing, path tracing can naturally generate authentic global illumination effect within its theoretical simplicity. Yet, it requires thousands of samples per pixel to reduce the noise on picture below human perception and is therefore used mostly in offline rendering tasks like film production. Interactive refinement method which average the result among consecutive frames (each frame takes an extra sample for every pixel and is accumulated to the result of last frame if the camera is still) has been adopted in some real-time path tracing demos. However, such method is still unfriendly to games due to the consistent noise when anything in scene changes.

1.2 Problem Statement

Admittedly, with limited capability of current hardware, mature real-time path tracing still has a long time to wait. However, given any GPGPU, current parallel algorithms on path tracing still have a lot to improve to exploit all SIMD (Single Instruction Multiple Data) capability. It is therefore an interesting and promising task to optimize current parallel algorithm and hardware use as much as possible, as it helps accelerate the implementation of real-time path tracing.

1.3 Project Objectives

In this project, I choose CUDA as my platform of path tracing implementation. The main objective is to accelerate path tracing as much as possible by optimization of different factors, while other objectives include enhancement of and rendering quality and enrichment of function. Integrated with NVidia graphic cards and C++ programming interface, CUDA has been popular among developers due to its efficiency and scalability. In this interim report, three topics that I studied so far - spatial acceleration structures (SAS), path tracing algorithm and SIMD optimization - will be discussed. More specifically, analysis will focus on how to optimize the construction and traversal of SAS, enhance the efficiency and functionality of path tracing algorithm and exploit as much as possible parallelism on the graphic card.

[PBR Series - 6] Improvement of PBR

Fri, 02 Dec 2016 00:00:00 -0800

6.1 Re-arrangement of Textures

A problem of current implementation of PBR is that black seam occurs when the roughness of object is high. The “mip-map” arrangement in pre-filtered environment map in interim report uses a pyramid-like scaling. Given a 256*256 resolution of a single face in LOD 0, the size of a single face in LOD 7 will be 4*4, in which UV components may fall outside of the region due to float precision limit in OpenGL texture2D function. In addition, compressed texture sizes also cause inaccurate color, which is more apparent when size of object gets larger. Therefore, a re-arrangement of the “mip-maps” is necessary.

Noticing that there are blank spaces unused in left upper corner and right lower corner, why don’t we put maps for higher roughness there? For LOD levels beyond 2, the sizes are maintained as same as that of level 2. This method solves the aforementioned problems, while it utilises image canvas more evenly and efficiently. The modification of shader is also simple – for LOD greater than 2, an array is declared to store their relative coordinates; in texture fetch, a condition is used to branch the UV coordinates to the contents of the array. Below is a sample picture of our new pre-filtered environment map.

Figure 16. New arrangement of Pre-filtered Environment map

Another adjustment of game textures is combining several single monochrome textures into one texture. The maps for metalness, roughness, ambient occlusion and transmittance are assigned to R, G, B and A channel respectively. This is helpful for conserving texture slots since our system has a mobile version using OpenGL ES rasterizer that limit at most 8 game textures. The total number of textures for a single material will be limited to 6 (diffuse map, mixed map (metalness, roughness, ambient occlusion, ambient occlusion and transmittance), normal map, irradiance map, pre-integrated BRDF map, pre-filtered environment map), regardless of additional environment maps required for environment switch.

6.2 SSS Based on Normal Map

A problem for current local subsurface scattering is that it depends on direct light source, which is a problem if there is only image-based lighting. However, indirect lighting often has more evenly distributed irradiance intensity, which means the effect of local scale scattering will not be affected too much by the object orientation. Therefore, I approximate the case by assuming amount of environment light coming equally from all directions. Under this assumption, it is equivalent to scatter the normal map, as explained in the interim report of a similar case. The scattering is only determined by the skin’s diffusion profile. Also, scattering the normal map turns real-time calculation into pre-processing calculation, which can help to boost performance.

The calculation uses convolution of 6 Gaussian functions as introduced in interim report. However, since the texture space is distorted in geometry, we need to “stretch” it back to the magnitude in the original 3D mesh. As introduced in the texture space diffusion method (introduced in interim report), baking a “stretch map” aids the problem efficiently. In my implementation, this is done by calling the “ddx/ddy” function which returns local curvature in GLSL, using the fragment color the store the stretch magnitude in X and Y direction, and capturing the rendering result which has been projected to texture space. This was then used as an input of the normal map scattering program as a LUT.

This method has a good rendering result. The grainy and dry look of skin without scattering is replaced by a pinkish and smooth appearance, while the details are also preserved. The lack of local subsurface scattering in case of strong contrast in environment light can be partly compensated by the translucency effect.

6.3 Environment Switch

Up to now, our PBR only supports a single environment or a single room if using box projected cube environment mapping. But the fact is our game system has complex scenes, game characters need to walk around rooms in a building, for example. Therefore, we must use different environment maps for different places and even different parts of a single large place. It is obviously impossible to store all textures in the texture slots of an object. Actually, we only need two sets of environment maps, each of which is for one of the two adjacent scenes. When entering a place from another place, color interpolation can be used between the two sets of environment maps, which only needs to use the position of the adjacent border and the scale and position of the shaded object. Assuming we are walking from environment map 1 to environment map 2. We can derive the formula below: . where the positions and lengths are all projected to 1D – the line perpendicular to the environment border.

We can always keep 4 slots for environment maps – 2 for older one, 2 for newer one (each environment has an irradiance map and a pre-filtered environment map), keeping the total number of textures 8. Upon the event of touching a new environment, we can put the corresponding environment maps into slots for newer one. After fully entering that environment, we can move the maps to the slots for older one. The assignment of texture slots can be done by calling appropriate APIs in the game engine. Thus, we can guarantee a smooth environment switch for PBR.

6.4 Conclusion

Physically-based rendering is the main methodology to achieve photorealistic rendering quality. In common polygon-rasterization-based game engines with limited computational resources, this methodology is implemented by a collaboration of precomputation (irradiance map, pre-filtered environment map) and real-time shading using HDR image-based lighting. Meanwhile, many specified methods are researched and implemented to tackle particular problems including subsurface skin scattering and translucency. If you pack all these into a PBR Tool with user-friendly api and detailed documentation, other people in your mobile or pc game projects like artists can implement realistic PBR effects without extra efforts.

All the research and implementation indicate that realistic rendering is a complex task. There is not a “put things right once and for all” formula for it. The nature, governed by physics laws, has infinite structures that lead to phenomenon. However, the computational power is always finite. It is impossible to simulate all phenomena by one formula. Instead, particular problem should be tackled particularly and approximation is always used. The goal should focus on giving customers a satisfying result, providing the level of realism that meets their need. However, experimental methods should always be encouraged as computer graphics should develop infinitely until perfection.

List of Figures

Figure 16: New arrangement of Pre-filtered Environment map. Source: My original picture generated by OpenCV.

[PBR Series - 5] Real-time Skin Translucency

Fri, 02 Dec 2016 00:00:00 -0800

5.1 Diffusion Profile and Skin Translucency

In Episode 3, I have explained the usage of diffusion profile in subsurface scattering. The multipole model reducing the problem into sum of Gaussian functions is a good approximation. Since the BSSRDF is a generic function for both reflected and transmitted light, we can also use it for calculating the translucency color of skin as a result of subsurface scattering inside skin tissues. The only difference is that when we calculate the reflectance caused by local subsurface scattering, we only consider light at the same side as the normal; in this case, the light source is at back. Therefore, to calculate the radiance color, we only need to know the irradiance at back and the distance light travelled in the skin. However, for translucency the biggest problem is that theoretically, it is hard to determine the distance travelled and incident point at the back side since the geometry of the human skin is actually complex.

Figure 14. (a) without / (b) with local subsurface scattering (c) without / (d) with translucency

5.2 Approximation Method

In a 2010 paper by Jimenez et al. on real-time skin translucency, an approximation method is raised. Noticing that the translucency factor is more obvious in thinner parts like ear, we can solve the problem by using the inverse normal of the point of shading to approximate the normal direction of the incident point at back.

For direct lighting, the irradiance is simply the dot product of inverse normal and light direction. For image-based lighting, we can use the inverse normal to fetch from irradiance map.

For the distance travelled by light in skin, Jimenez et al. also uses an approximation – they simply ignored the refraction, using the distance travel in the skin region by straight light that starts from light source and ends at the shading point as if there is nothing in the middle, which requires a depth map from the light’s perspective.

5.3 Transmittance Map

Actually, we can approximate further. We can avoid rendering depth maps and use average thickness instead. An idea is sampling distances in skin by ray tracing from all (-90° to 90°) different incident (or departing) angles and store the value in a look-up texture. The ray tracing may not be easy if we implement it manually. Luckily, there are many software solutions that can bake such map from an input mesh. One of such software is Knald, in which the aforementioned map is called transmittance map.

Figure 15. Transmittance map

However, after fetching from texture, we may want to do some inversion and scaling or add an offset to translate the texture value to real length.

Combing the aforementioned factors, the final transmittance color is .

The value must be added on the reflectance value, since the light sources are different. It is worth noticing that because we don’t know the transitivity (ratio of transmitted light) of skin, we can empirically determine a scaling coefficient by the actual rendering effects.

References

Jimenez, J., Whelan, D., Sundstedt, V., & Gutierrez, D. (2010). Real-time realistic skin translucency. IEEE computer graphics and applications, (4), 32-41.

List of Figures

Figure 14: (a) without / (b) with local subsurface scattering (c) without / (d) with translucency. Source: screen capture of online pdf of Real-time realistic skin translucency by Jimenez, J. on http://iryoku.com/translucency/downloads/Real-Time-Realistic-Skin-Translucency.pdf.

Figure 15: Transmittance map. Source: screen capture of a transmittance map generated by Knald.

[PBR Series - 4] High Dynamic Range Imaging and PBR

Fri, 02 Dec 2016 00:00:00 -0800

4.1 Introduction to HDR

An important element for realistic physically based rendering is the using of high-dynamic-range (HDR) images. So far in our series, this is approximated by ordinary low-dynamic-range images to produce some sample results and to verify the correctness of the algorithm. However, since high-dynamic-range images require floating-point picture format, the reading and processing of the source images need to be redesigned. In addition, since the built-in GLSL shader cannot directly read data from floating-point textures, the textures need to be compressed and encoded into integer formats like PNG, which is also the scope of research.

Light intensity in real-life scene has a huge range. Sun as a light source has a ~109 cd/m² luminance, while the intensity of average starlight is below 0.001 cd/m². However, common digital images (e.g. BMP, JPEG) have only 24-bit color depth, i.e. each color channel has a 0-255 integer range. Therefore, the range of light intensity a normal computer image can represent is rather limited.

Dynamic range is such a unit that measures the scale of color intensity difference across a picture. It is defined as the logarithm of the difference of highest and lowest pixel luminance (in RGB color space, a formula of luminance is 0.2126R + 0.7152G + 0.0722B). Following this definition, a JPEG image has a dynamic range of only 2.4, while real-life scenes often have a value above 9. The former one is often referred as low dynamic range (LDR) image and the latter one is referred as high dynamic range (HDR) image.

Figure 12. comparison of PBR using LDR and HDR environment maps

Because in PBR most we use a great amount of image-based lighting, we need to use HDR images to achieve real-life lighting effect. With LDR images, one can barely feel the existence of a light source present in the photo. It is even worse when the object has low metalness - whole lighting becomes dim and unrealistic. With HDR images and proper conversion techniques, these problems can all be solved.

HDR images use floating point number to represent a pixel, which gives it an almost full coverage of real-life light intensities. However, there are also different floating point formats, which we will explain in next section.

4.2 Floating-point Image Formats

Common floating-point image formats include FloatTIFF, Radiance RGBE and OpenEXR. Each of the three formats has its advantages and disadvantages. However, we are going to pick the format that is most suitable for our needs.

The FloatTIFF is a special extension of TIFF (Tagged Image File Format) to support floating point images that is specified in the tag of the file. In FloatTIFF, each color channel is represented by 32 bits, i.e. 96 bits in total for an RGB image. Despite its high fidelity, TIFF is blamed for its huge size, since compression is rarely used due to compatibility reasons. It may not be a good choice since our system needs to be transplanted into mobile platform and big files can lead to slow reading speed.

Radiance RGBE is a popular HDR image format originally developed by Gregory Ward Larson for his Radiance ray-tracing software system. The RGBE format is unique for storing the intensity in a separate channel (E for exponent), while the rest RGB channels maintain the color ratio same as in LDR images. RGBE format has an advantage for its small size (8 bits for each channel, 32 bits in total). Although RGBE is good for its small size and wide dynamic range, the trade-off is lower color accuracy.

OpenEXR is an HDR standard developed by Industrial Light & Magic. There are two sub-types: half float (16 bits) and full float (32 bits). The former one is more commonly used for its smaller size and enough accuracy for game level rendering. The 16 bits are divided into sign (1 bit), exponent (5 bits) and mantissa (10 bits). OpenEXR supports a dynamic range of 12, which exceeds the capability of human eye. Meanwhile, it support ZIP compression, which is lossless. This format is also widely supported in different software. Blender can bake images in half and full OpenEXR format. OpenCV also supports automatic conversion of the half float to full float for its internal processing. Due to the aforementioned advantages, we decide to use OpenEXR as the standard format for HDR images in our system.

4.3 Processing HDR image in OpenCV

Given the source code of irradiance map and pre-filtered environment maps generation for LDR images, only a few changes need to be made to fit it into HDR processing. The trickiest one is that the second argument of cv::imread needs to change into negative number to indicate the function to read the image as raw data. Otherwise, OpenCV will treat the floating number as integer value so that the image cannot be correctly represented. When creating cv::Mat, the data type should be chosen as CV_32FC3, which is compatible to both half and full float OpenEXR images. Since our environment map does not contain transparency information, the alpha channel can be safely ignored. In addition, when fetching pixel intensity from cv::Mat using “at” method, the return type tag need to change to cv::Vec3f instead of cv::Vec3b.

The calculation functions need not to be modified since they are physics-based. As long as the value is proportional to the luminance (the physics unit), the result is correct relative to the source. However, special attention needs to be paid into the sampling process in numerical integration. The accuracy of result is affected by the number of samples taken in Monte Carlo integration. For LDR images, 1024 samples are enough for a good approximation. When it comes to HDR images, it depends on the actual dynamic range to decide the number of samples. To determine the number of samples as a function of dynamic range is mathematically complicated. However, we can use experience to estimate a threshold for most images. In some cases (dynamic range > 10), number of samples exceeds 1 million, which is extremely expensive in terms of time. Unfortunately, such case cannot be ignored as it often appears in dark room lit by an intense light source. If the number of samples is as same as that for LDR images, many bright noise points like fireworks will appear around the intense light source. To deal with this, we need to consider dynamic range compression.

4.4 RGBM Compression

From the former section we know that compression is important for HDR imaging used in game level rendering. However, since our Godot game engine only supports PNG and WebP textures, we need to encode our HDR image into these LDR formats. Fortunately, there is a solution provided by Brian Karis (2009) that solves both problems, called RGBM encoding.

Figure 13. RGBM Encoding function

The encoding algorithm is simple, as shown in the image above. The basic idea of this algorithm is storing a multiplier in the alpha channel, which is determined by the largest value from R, G and B. The compression is applied when you use saturate function to the maximum value. It is easy to see that the maximum possible value after compression is the constant 6.0, since all colors are divided by 6.0 at first. Therefore, we can also treat this as the compression rate. If we want to preserve higher dynamic range, we should increase the value. However, this is at the expense of loss of color accuracy. In my implementation, I found that 36.0 is a good choice to balance these two factors.

With this technique, we can export our HDR images as PNG textures with alpha. In game engine, we simply need to enable the alpha channel and decode the texture as the figures shows. The constant in front should be our chosen compression rate.

4.5 Tone Mapping

There is still an issue we need to deal with. Since an HDR image records the real intensity of pixels, we need to map it into sRGB space to show it on the screen, which is called tone mapping. The resulting image looks more vivid since it combines details of all parts like human eyes do, creating the feeling of a wider dynamic range. To understand tone mapping, we must first introduce some concepts: exposure, gamma correction and white balance.

In photography, an HDR picture is often the composition of several photos of different exposure values (EV), for the same scene. Defined by , where N is relative aperture (f-number) and t is the exposure time in seconds, exposure value is inversely proportional to the amount of exposure. The larger the aperture is and the longer the exposure time is, the smaller the EV is, i.e. there is a greater amount of light coming in. The decrease of 1 EV is also called increase of 1 “stop”. In HDR imaging, usually 3 to 5 pictures in the range of -2~2 stops are taken, to capture the enough range of lighting conditions. In larger EV, overexposure can be avoided for the very bright parts like object lit directly by the sun. Conversely, in lower EV, dark areas can preserve more details. Software like Adobe Photoshop can be used to compose the HDR image from the several LDR images, during which each pixel is assigned a float value that is directly proportional to the real light intensity, without color correction for display.

However, to present a HDR image on the screen, a process called gamma correction must be done. This concept comes from the interesting fact that the human perceived light intensity is not the real light intensity. Stevens’ power law indicates that for magnitude of sensation stimulus (S) and physical intensity (I), for some power p. Image captured by digital device records the physical intensity. Therefore, when presenting the image on screen, the intensity value must be taken to the power of 1/p to recover the sensation effect of that intensity. Current sRGB pictures are already encoded with the gamma, i.e. the RGB values are already powered by 1/p. In sRGB standard, p is equal to 2.2. However, for HDR images, the gamma correction is not applied; the floating values record raw intensity. Therefore, we must apply gamma correction in tone mapping process, which is both important for having a more realistic result and color blending with LDR textures in game.

White balance is another necessary process, which is also related to human visual perception. Different global illumination conditions have slight impact on the reflected color of objects, but human eyes tend to recover the material color (or diffuse color in PBR), which is often measured by the magnitude of compensation for white color. For example, in cloudy condition, the white objects seem to have colder color while in sunlit condition, they have warmer color. Therefore, in tone mapping, we may want to compensate some red color for cloudy condition to make the object preserve its original color so that the whole scene looks more natural, which is called white balance. While some images may be taken by white balanced camera, the display color temperature can be another factor that requires us to do white balancing.

Having these HDR data and knowing the aforementioned factors, how are we going to present the picture on screen? Actually, it depends on our purpose. Some algorithm based on human visual perception system produces more realistic image, while others may present a more artistically pleasing result (in which the contrast of color is stronger). Since we pursue realistic rendering, we prefer the former one. A solution provided by John Hable (2010) at Filmic Games has nice realistic render results. The algorithm takes two parameters – white balance level and exposure compensation, which can be determined empirically. In my implementation, I take 1.0 as white balance level and 4.0 as exposure compensation.

In PBR, metalness value is directly multiplied with the HDR float value, which is equivalent to the using of a different exposure value. Therefore, with non-metals, bright parts like light sources often remain bright (the halo is reduced) while darker environment is almost invisible, just like what we see in real life. However, there is one tricky part. Since we are working with LDR textures like diffuse maps, reverse gamma correction must be applied on those textures (by taking a power of 2.2) due to the encoding for sRGB color space.

References

Hable, J. (2010, May 5). “Filmic Tonemapping Operators” [Online blog post]. Retrieved from http://filmicgames.com/archives/75.

Karis, B. (2009, April 28). “RGBM color encoding” [Online blog post]. Retrieved from http://graphicrants.blogspot.sg/2009/04/rgbm-color-encoding.html.

List of Figures

Figure 12: Comparison of PBR using LDR and HDR environment maps. Source: http://gamedev.stackexchange.com/questions/62836/does-hdr-rendering-have-any-benefits-if-bloom-wont-be-applied.

Figure 13: RGBM Encoding Function. Source: screen capture of http://graphicrants.blogspot.sg/2009/04/rgbm-color-encoding.html.

[PBR Series - 3] Subsurface Scattering – Human Skin as Example

Fri, 02 Dec 2016 00:00:00 -0800

3.1 Subsurface Scattering and Fidelity

So far, our PBR model only considers the interaction of light at the surface of an object. This is not a problem as many models in 3D games is nearly opaque. However, when it comes to objects with a certain level of depth and translucency, like skin, marble, leaves and wax, the effect of subsurface scattering cannot be neglected if we want to get realistic results. Subsurface scattering often has an effect to make the overall lighting softer as light at a position seems to be overflowing to its neighboring areas.

Among all kinds of materials that requires subsurface scattering, human skin is relatively complex. Human skin has numerous layers and realistic rendering requires a model of at least 3 layers (the thin oily layer, the epidermis and dermis) (d’Eon & Luebke, 2007). The first layer contributes to the specular reflection effect which can be simulated by Cook-Torrance Model. The other two layers requires the task of subsurface scattering. This kind of multi-layer subsurface scattering has an important meaning for research. The pictures below shows the comparison between rendering without (a) and with (b) subsurface scattering. The result without subsurface scattering appears very dry-looking and unrealistic, while the latter one seems natural (Figure 8).

Many games include highly detailed human models. Complex textures including bump map has been applied to simulate the human skin. To match with that level of fidelity, subsurface scattering is a must-do.

Figure 8 - Comparison Between Skin Rendering Without and With Surface Scattering

3.2 Diffuse Profile and Gaussian Approximation

As mentioned before, in subsurface scattering, the relationship between the irradiance and radiance is governed by BSSRDF. As the skin is highly scattering, a function called dipole diffusion function has been introduced (Jensen et al., 2001) to give efficient simulation of this situation. The BSSRDF is reduced to rely only on the material’s scattering properties, the Fresnel terms at point of incidence and point that radiance comes out, and the distance between the two points.

Figure 9 - Dipole Diffusion Function

In the formula above, S represents BSSRDF, Ft stands for Fresnel transmittance and R is the material’s diffusion profile.

The diffusion theory is later extended to simulate scattering effect in multiple-layer material by the use of multipole (Donner & Jensen, 2006), to support more realistic rendering. The advantage of diffusion profile is that it is an empirically determined function. One only needs to do experiment on specific material to get a numerical form of the function and draw the shape of the function. The dipole or multipole is then used to give an analytical approximation. However, the computation of such functions is very complex. Luckily, d’Eon et al. invented an approximation method (2007). They fit the curve of multipole by using weighted sum of multiple Gaussian functions with some heuristic coefficients.

Figure 10 – Error of Gaussian Approximation

Certain coefficients are chosen, such that the error represents by the formula above is minimized. For skin, six Gaussians are required to match the three-layer diffusion profile accurately. The coefficients for each Gaussian are determined respectively for red, green and blue channel.

3.3 Texture-Space Diffusion vs. Pre-Integration

With the present of skin diffusion profiles, many methods can be chosen to compute the subsurface scattering. A popular method invented by Borshukov and Lewis is called texture-space diffusion (2005). The idea is storing incident lights in texture space and uses convolution steps to simulate diffusion, which is similar to the pre-filtered environment map in IBL. However, human characters in games are likely to moving frequently and changing of diffusion texture is inevitable, which implies real-time rendering to texture (RTT). Blender, as my working platform, does not have a customized texture buffer. In order to avoid editing Blender’s source code, I chose to find another method that does not use RTT.

Penner and Borshukov introduced an approximation method by pre-integrating the subsurface scattering effect in skin (2011). In this method, shading process becomes local; the calculation is reduced to a simple pixel shader - no extra rendering passes are required. The idea is that the visible scattering effect can be considered as a composite result of mesh curvature change, normal map bumps and shadows.

The first term, change of mesh curvature, affects incident light (and thus scattering) together with the angle between light direction and normal direction (described by N∙L). A 2D LUT using the curvature and N∙L can be calculated representing accumulated light at each outgoing direction as a result of lighting a spherical surface of a given curvature, which approximates a spherical integration in BSSRDF with a ring integration. Of course, skin diffusion profiles are used in the computation.

Figure 11 - Illustration of Diffuse Scattering Integration on Ring

When using a normal map as bumping texture, since the effect of scattering light coming from a bump appears very similar to reflecting light from a surface with no scattering and blurred out bumps, the scattering effect can be approximated by blurred the normal map. The level of blurring is different for each color channel, because different wavelength has different diffusion profiles. Using a normal map for scattering without blurring it for each color channel in different degrees would result in an unnatural grainy shading.

In my implementation of this pre-integration method, the influence of shadows is temporarily omitted since currently I have not found the method to add a real-time shadow when customized shaders are used in Blender Game Engine. This issue will be tackled later. The combination of ring integration and blurry normal map already gives a nice result, if we ignored the shadow

The ring integration texture was generated in C++ with the aid of OpenCV. 1024 samples were taken for each independent variable combination from –PI/2 to PI/2 to ensure accuracy.

Another point worth to mention is that GLSL has a built-in function in its fragment shader to help calculating the curvature. Although it offers relief for programmers, it is computed in the basis of triangle orientation and thus curvature is uniform inside a mesh triangle. For low-poly meshes it becomes a problem as the edge of triangles at positions with strong scattering effect would be very obvious. To alleviate this problem, I stored the curvature in a texture and read from it in vertex shader. The vertex colors would then be interpolated across each triangle like what happens in Gouraud shading.

Indirect light sources can also be used for subsurface scattering. The combination of image based lighting with pre-integration scattering gives more vivid rendering results.

References

Borshukov, G., & Lewis, J. P. (2005, July). Realistic human face rendering for the matrix reloaded. In ACM Siggraph 2005 Courses (p. 13). ACM.

d’Eon, E., & Luebke, D. (2007). Advanced techniques for realistic real-time skin rendering. GPU Gems, 3, 293-347.

Donner, C., & Jensen, H. W. (2005). Light diffusion in multi-layered translucent materials. ACM Transactions on Graphics (TOG), 24(3), 1032-1039.

Penner, E., & Borshukov, G. (2011). Pre-integrated skin shading. Gpu Pro,2, 41-54

List of Figures

Figure 8: Comparison Between Skin Rendering Without and With Surface Scattering. Source: GPU Gem 3 online book on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html

Figure 9: Dipole Diffusion Function. Source: online document of d’Eon et al on http://www.eugenedeon.com/wp-content/uploads/2014/04/efficientskin.pdf

Figure 10: Error of Gaussian Approximation. Source: Same as Figure 9.

Figure 11: Illustration of Diffuse Scattering Integration on Ring. Source: Penner’s slides at SIGGRAPH 2011 Courses on http://advances.realtimerendering.com/s2011/Penner%20-%20Pre-Integrated%20Skin%20Rendering%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx

[PBR Series - 2] Image Based Lighting

Fri, 02 Dec 2016 00:00:00 -0800

2.1 The Concept

Image Based Lighting is basically treating all pixels in an image as light sources. Usually, an environment map (usually cubemap) created from a panoramic, high dynamic range (HDR) image will be used as the source of texture fetch. Assuming the shaded object to be opaque, we only need to consider specular reflection and diffuse reflection. However, since the light source is numerous continuous pixels, we need to integrate BRDF to get the shading result of a surface point. In computer graphics, integration is approximated by sampling. To achieve more accuracy, the number of samples is proportional to the number of pixels, which is a large number in real-time rendering. Therefore, a method is baking necessary steps into texture and fetching the pixel in real-time rendering. Before that, we need to solve a problem – how to fetch a pixel from environment map?

2.2 Fetching Pixels from Environment Map

On any kind of surface, the radiance value of a pixel can be seen as reflected from the other side of the surface normal (this is actually the case for specular reflection on perfectly smooth surface, but for other situation like diffuse reflection, the environment map can store an imaginary source point as the composite result of radiance), having the same angle with normal as the view direction. In other words, the pixel we need to fetch can be seen as the target that the camera ray hit after reflection.

Figure 3 - Cubemap Pixel Fetching Illustration

Cube mapping is a popular method of environment mapping as it has simple mathematical form. This method treat environment as a surrounding box with the environment panorama wrapped and mapped into 6 faces. In GLSL, there is a function textureCube() to do the fetch in a given reflecting direction. However, the reflected ray is assumed to be at the exact center of the cube. This is not a severe issue for skyboxes that represents faraway environment. However, when we need to represent reflection inside a small room, the reflection is heavily distorted if we want to fetch the reflected color for a ball close to a wall.

To solve the problem, I found a method called box projected cubemap environment mapping (BPCEM) (behc, 2010).

Figure 4 - Box Projected Cubemap Environment Mapping

This method has a simple mathematical form. As the Figure 4 shows, it requires the size of the room and the relative position of the shaded object. The position of the intersection between the borders and the reflected camera ray will then be calculated easily. The corrected fetching direction is then the vector between the assumed sampling center (center of the room by default) and the intersection. This method is very intuitive and has very good approximation result. Therefore, I adapted the method in GLSL for rendering inside closed rooms as scene.

2.3 Irradiance Map and Spherical Harmonics

After solving the texture fetching problem, we come back to calculate illumination in IBL. The diffuse part of IBL is particularly non-trivial. The texture we want to pre-calculate according to BRDF is known as irradiance map. Unlike specular reflection which only has a small range of sampling which increases with the surface roughness according to Cook-Torrance model, the IBL diffuse reflection need to consider contributions from pixels in all visiable directions, which is a vast amount comparing with specular reflection. Real-time sampling is nearly impossible and even preprocessing becomes hard. Thanks to SIGGRAPH, there is a efficient approximation to the calculation of irradiance map (Ravi & Pat, 2001). It turns out that by computing and using 9 spherical harmonic coefficients of the lighting, the average error of rendering result is only 1%.

I wrote a C++ program to compute the 9 spherical harmonic coefficient for any cubemap under the size of 2048x2048 almost instantly. With the 9 spherical harmonic coefficients, the irradiance value of a given pixel can actually be computed in real-time. However, to avoid long expressions in shader, I pre-compute the result as irradiance map by a traversal of all reflecting directions.

2.4 Efficient Approximation of Specular Reflection

The Cook-Torrance microfacet specular shading model (Cook&Torrance, 1981) is used to calcualte the IBL specular reflection.

Figure 5 - Cook-Torrance Specular Shading Model

The D, F and G stands for Beckmann distribution factor, Fresnel term and geometric attenuation term respectively. However, the 3 formulas are also complex and efficient approximation need to be found. Thanks to SIGGRAPH again, a real shading model in Unreal Engine 4 was presented in SIGGRAPH 2013 course (Karis, 2013), in which computationally efficient algorithms were chosen to approximate the formulas and the integration. The integration was done by importance sampling, a general technique for approximating properties of a specific distribution. To further reduce scale of computation, a method called Split Sum Approximation was raised in this essay. The integration is split into the product of two sums, both of which can be pre-calculated. The first sum is a convolution of environment map as a result of given roughness under Cook-Torrance microfacet model. Because we want to choose different level of roughness for different objects in the same environment, there is a need to store result for different roughness value in mip-map levels of a cubemap. There is a DirectX image format called DirectDraw Surface (.dds) that supports storage of self-created mip-map levels. Unfortunately, Blender does not support reading mip-maps in this format. Therefore, I come up with a method that arranges all cubemap mip-map levels in a normal bitmap texture (as shown in Figure 5) and fetches the desired pixel in corresponding region.

Figure 6 - Storing Cubemap Mip-map Levels in a Single Texture (my original picture)

The second sum, equivalent with integrating specular BRDF with a pure white environment, is easier to compute. It can be further approximate to the sum of another two integrals and leaves roughness and incident angle as two inputs, giving a scale and a bias as two outputs. Furthermore, all parameters fall into the range between 0 and 1; therefore, the result of the function can be pre-calculated and stored in a texture. It is noticeable that the second sum contains Fresnel term. Fresnel term, a factor that describes how reflectivity changes with different incident angles, exhibits stronger contrast between center and edge when the shaded object has lower metalness (the base reflectivity of non-metal is lower and the reflectivity of all material approaches 100% when the reflecting angle gets closer to 90 degrees). Since this effect is empirically easy to notice, it is indispensable for the realistic rendering.

Figure 7 - Fresnel Effect on a Dielectric Ball

2.5 The IBL Tool

Doing physically based rendering is not an easy process. Physical entities has continuous geometry, while the calculation in computer science is on a discrete basis. Calculus used in physics like integral must be converted to the form of discrete sampling. Even that, the current computational power still requires many approximation techniques and procedural tactics like look-up texture (LUT). The IBL is a very good example to demonstrate the complexity of PBR. Because of that, a tool that does all the pre-computations with minimal input commands would be convenient for artists and programmers to use. Therefore, I developed IBL Tool, a windows console program to make things easy (Sorry, this IBL Tool is now a proprietary software of my intern company so I cannot release it :). User only need to put the environment map texture inside and all LUT textures for IBL are produced. The program also exhibits customizability. User can choose from 3 different kind of output pattern – the separated faces, the standard Blender format and spread box format, as how the game engine requires.

In addition, a pair of sample GLSL shaders (vertex and fragment) is also provided with the program with custom fields indicated. A tone mapping that adjusts exposure and Gamma value is also included, so that user can get higher dynamic range for shading if required.

References

Behc (2010, April 20) Box Projected Cubemap Environment Mapping [Online forum post]. Retrieved from http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/.

Cook, R. L., & Torrance, K. E. (1981, August). A reflectance model for computer graphics. In ACM Siggraph Computer Graphics (Vol. 15, No. 3, pp. 307-316). ACM.

Karis, B., & Games, E. (2013). Real Shading in Unreal Engine 4. part of “Physically Based Shading in Theory and Practice,” SIGGRAPH.

Ramamoorthi, R., & Hanrahan, P. (2001, August). An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques (pp. 497-500). ACM.

List of Figures

Figure 3: Cubemap Pixel Fetching Illustration. Source: https://en.wikipedia.org/wiki/Reflection_mapping

Figure 4: Box Projected Cubemap Environment Mapping. Source: behc’s post on http://www.gamedev.net/topic/568829-box-projected-cubemap-environment-mapping/

Figure 5: Cook-Torrance Specular Shading Model. Source : Karis’ notes at SIGGRAH 2013 on http://blog.selfshadow.com/publications/s2013-shading-course/karis/s2013_pbs_epic_notes_v2.pdf

Figure 6: Storing Cubemap Mip-map Levels in a Single Texture. Source: My original picture generated by OpenCV.

Figure 7: Fresnel Effect on a Dielectric Ball. Source : an article “Interaction of Light and Materials” on http://www.pernroth.nu/lightandmaterials/

[PBR Series - 1] Introduction to PBR

Fri, 02 Dec 2016 00:00:00 -0800

This series introduces you to the implementation of physically based rendering in light-weight game engines by GLSL shaders and precomputed textures. With our techniques, most phsically based surface-to-surface reflection can be simulated across a variety of material ranging from metals to dielectrics in HDR lighting and subsurface scattering can be simulated as well with human skin as an example, so that you can achieve high fidelity of game objects with fewest resources. Most of the contents are collected from my project report as an intern R&D engineer in a game company from May to Nov 2015. I want to share what I learned of, what I thought about and what I have done in this topic so that it makes easier for more people trying to dig into this topic. There are 6 episodes.

1 Introduction to PBR
2 Image Based Lighting
3 Subsurface Scattering – Human Skin as Example
4 High Dynamic Range Imaging and PBR
5 Real-time Skin Translucency
6 Improvement of PBR

1.1 Photorealistic Rendering and Shading Language

Photorealistic rendering is one crucial part of computer graphics. In early days of computer graphics, most hardware only supports fixed function rendering pipeline due to performance limitation. Things like lighting and texture mapping were processed in a hard-coded manner, resulting in coarse and unrealistic images. Nowadays, with the development of hardware power and software techniques, rendering pipelines have become programmable. High-level shading languages with unlimited possibilities of rendering effects are directly applied in specific stages of rendering pipelines. Many ideal physical models have been successfully implemented with GPU programming to render highly photorealistic 3D scenes and characters.

One of the most common high-level shading language is OpenGL Shading Language (GLSL), which is also the main tool used in this project. Unlike stand-alone applications, GLSL shaders require an application that calls OpenGL API and are passed as a set of strings to graphic driver for compilation. There are two kinds of most commonly used shader, the vertex shader and the fragment shader, which are applied to each mesh vertex and each fragment (i.e. covered pixel) respectively in the graphic pipeline. The vertex shader determines the final position of mesh vertices that are presented to user. It also passes variables that relate to normal, color or coordinates to fragment shader. The fragment shader is then executed to give each covered pixel an intended color. In addition, external uniforms can be passed into the shaders to provide material like texture sample and camera position.

Here is a simplified diagram showing where shaders are applied in the graphics pipeline:

Figure 1 – Rendering Pipeline (my original picture)

1.2 Implementation Based on Blender Game Engine

In this series, we mainly use Blender as the example of light-weight game engine and the platform for graphic rendering and 3D games implementation. As a professional open-source 3D computer graphics software, Blender supports many functions including 3D modelling, UV unwrapping, texturing and rendering. Using OpenGL as its graphic library, Blender supports customized GLSL shaders in rendering in its Blender Game Engine. Figure 2 is a screen capture from Blender 2.75 showing the effect of a simplistic shader that shades a cube with purely red color with no lighting. It is noticeable that the shaders are passed as arguments of Blender’s Python API, which are then passed to the internal OpenGL API. Like other game engines, Blender also supports passing game primitives like material’s lighting configurations and camera’s view transformation matrix to the shaders, which allows intuitive coordination between shaders and the scene.

Figure 2 - A Sample Shader in Blender (my original picture)

1.3 Physically Based Rendering

Physically based rendering (PBR) is a growing trend in CG game and film industry. Basically, PBR is rendering images according to math model of real-life physics. Thanks to the growth of computation power, many complex models can be approximated in high precisions currently, which gives highly realistic images.

A very important issue in PBR is to simulate realistic lighting effect, in which bidirectional reflectance distribution function (BRDF) is used. A BRDF gives the relationship between irradiance and radiance, i.e. how light intensities distribute in each reflection direction after a light ray hit an opaque surface in given angle – which includes all kinds of reflection (e.g. specular reflection and diffuse reflection), depending on the material of the surface. Usually, different BRDFs that assumes specific properties of the surface are combined to give realistic lighting result. Common BRDFs includes Lambertian model (assuming surface to be perfectly diffuse), Blinn-Phong Model (a traditional lighting model that approximates both diffuse and specular reflection) and Cook-Torrance model (a more complex model that treats surface as microfacets to give more accurate specular reflectance with Fresnel effects).

The use of complex textures for different physical attributes is a crucial factor that makes PBR realistic. When it comes to complex models that has different physical attributes in different part of surface, a very important constraint must be considered – conservation of energy. In other words, the more reflectivity (short for specular reflection) an object exhibits, the less diffusion (short for diffuse reflection) it gives. Metalness is a parameter used to determine the ratio between reflectivity and diffusion (which sum up to 1). The reason of the name is that metal usually has a high reflectivity and low diffusion and dielectric (non-metal) material has the reverse condition. Therefore, metalness map can be used as a texture to determine the metalness in different parts of the object (most real-life object has different metalness across its surface), which grants artists’ power. Another attribute that has to obey energy conservation is roughness. Treating any surface as infinite many microfacets, being rougher is only about having a larger variance of microfacet orientation, which obvious obeys energy of conservation – the rougher a surface is, the larger the reflected image spreads. Similarly, another texture – roughness map can be used here.

Direct illumination from simplified light models is easy to implement in shaders. Choosing a suitable BRDF and knowing related material attributes is enough to give a result. However, when it comes to global illumination, calculating the correct lighting becomes not so trivial. In most scenes, the background environment reflects light and illuminates its surroundings indirectly. Also, realistic lights may have different shapes and color distribution. These phenomena exceed the capability and constraints of light models by far. Although BRDF is also applicable to these situations, it requires integral and other complex mathematical computations that cannot be carried out directly. Therefore, a technique called image based lighting (IBL) comes to rescue, which is also the first issue that I come to research and implement. In next episode, I will talk about how I implement the image based lighting.

Another issue is subsurface scattering. BRDF assumes an opaque object surface, in which reflections all happens at the incident point (the diffuse reflection is actually scattered light due to material’s internal microscopic irregularities, and can be treated as directly happening at the surface). However, for materials that are not so opaque (for example, human skin), the effect of transmittance and subsurface scattering cannot be ignored. Calculating transmittance is easy as it is actually an inverted BRDF on the opposite side of the surface. However, surface scattering is relatively complex, as the lights reflected from subsurface can exit at another location other than the point of incidence. There is also a general function called Bidirectional Surface Scattering Reflection Distribution Function (BSSRDF) that describes the relation between the irradiance and radiance in this phenomenon. However, the function is too general and special case may be taken to handle specific tasks, in order to be more computationally efficient. Since my task is currently restricted to the rendering of human skin, I will focus on the special case of subsurface scattering happening inside human skin. Episode 3 will introduce the method I use and the detail of implementation.

List of Figures

Figure 1: Rendering Pipeline. Source: My original diagram created by Word

Figure 2: A Sample Shader in Blender. Source: My original screenshot of Blender

山西游记之一 (A Travel in Shanxi: Part 1)

Thu, 11 Aug 2016 00:00:00 -0700

抵达太原

1.山西，太行山之西也。古又称河东，谓其黄河之东也。余尝觉山西神秘。何也？北方大多一马平川，唯山西耸于黄土高原之上。今人有言“五千年风物，地下看陕西，地上看山西”，此谓山西古物保存之善，独绝中华也。盖历代纷争，中原板荡，文物几俱毁于兵燹。南方虽宁，开辟甚晚，其文物不足为道也。九州现存之古物，大多出土所获。唯山西恃其险峻，得存古迹于地表。今五台有南禅寺，自有唐屹立至今，乃国内现存最早之木构建筑也。明以前之建筑，亦属晋地保存最多。得览此胜，余自闽入晋，亦不辞其辛也。(以上的蹩脚文言文实乃装逼失败的典范。。)

2.第一顿晚餐，开胃菜居然是一瓶老醋口服液！这特产推的确实有点生猛。然而山西的醋有甜味，喝起来居然颇爽。。

3.晚餐在山西会馆，貌似是山大办校友会的地方，装修风格是山大老校区和山西农家的结合。土菜很有特色。

3.吃饱饭后下起雷阵雨来，暑气顿散，雨滴颇凉。09年到陕西也是这样。其地气之寒，与南方迥异。

4.太原夜景不错，城市整洁大方，完爆郑州。

5.夜宿愉园酒店，貌似是太原的老牌宾馆。进去一看设施确实陈旧，没有吹风机，冰箱打不开，连浴室的锁都坏了。。。早晨起来，推开窗帘俯瞰太原老市区，轻霾弥漫，安静祥和。

6.第二天早上在开化寺古玩市场附近早市吃了早点。馄饨加鸡蛋灌饼。北方该有的早点这儿都有。大油饼长得像几根油条连起来一样，十分诱人。(后来在五台山早餐迟到油条，惊觉好吃，后悔在太原没买油饼）

7.早餐摊的老板娘人很好。过去端了几碗馄饨，每一次都细心嘱咐你小心后面人多。另外这儿的餐桌上也必须都有老醋:) 早市十分热闹，摩肩接踵。买了几个蜜桃，价格几乎是厦门的五分之一，味道也更好。看来吃水果确实要吃在地的。

8.准备上高速，收费站有好几个通道，然而标识不清，看起来像地产商广告牌的居然是路牌！害得我们走错，绕回来白花了半小时，刚上高速居然还堵车。方见识到山西城市管理的混乱。。。

9.上了二广高速，才知道山西的绿不限于太原，黄土高坡跟我想象的很不一样(在陕北的高速公路也有同样的感觉，希望不是只有高速公路看得到的地方做做形象工程。。。)。平原，山岭，草地和白杨交织，北国风光，十分赏心悦目。白杨叶的反光很强，远看以为是白花。

(To be continued…)

你好，世界

Sat, 23 Jul 2016 00:00:00 -0700