·ChinaUnix首页 ·论坛 ·博客 
Linux首页 | Linux新闻 | Linux文档 | Linux论坛 | Linux下载 | Linux博客 | Linux搜索
新手入门 | 安装启动 | 管理员指南 | 开发者手册 | 桌面应用 | 程序开发 | 数据库 | 网络技术| CentOS | Fedora | RHEL | Ubuntu | Apache | MySQL | PHP
  Linux时代 >> 技术文档 >> 程序开发
 
使用SIMD指令高度优化Matrix类
来源: ChinaUnix博客  日期: 2008.04.14 16:25 (共有条评论) 我要评论
 

图形学中最常用的底层类为矢量类(Vector)和矩阵类(Matrix).已经存在很多实现的版本,甚至包括用汇编语言写的内联函数库版本。但这些是否是最优化的类呢?下面介绍的矩阵类版本利用了SIMD指令集优化技术实现了数据对齐与并行处理,极大地提高了矩阵操作速度,甚至比微软d3dmatrix.h中的类速度快上两倍,比用内联汇编技术编写的类快上一倍多。
Introduction
This article is about one of the most common elements in 3D graphics:3D transformation matrices and vectors. As we all know, 3D graphics uses [x,y,z] or [x,y,z,w] vectors to represent points in the three-dimensional space or in the homogeneous space, respectively. You can manipulate these vectors by multiplying them with 4x4 transformation matrices. The standard manipulation matrices are translation, rotation and scaling.
For example, to rotate a vector by 90° around the X axis and then move it 20 units toward the positive infinity of the Y axis, create a rotation matrix and a translation matrix, and multiply the vector with both of them. It is simpler and faster to multiply both the matrices in advance. Later, you can multiply the vector with the product when needed.


However, standard multiplication of one matrix by another takes 64 products and 48 sums. A standard multiplication of a vector by a matrix takes 16 products and 12 sums.
This article presents a better way to manipulate vectors and matrices using Intel’s Streaming SIMD Extensions, to enable the calculation of 4 products or 4 additions at once. The SIMD register can hold 4 floating-point numbers with single precision. Hence one register can hold a full vector, and four registers can hold a full 4x4 matrix.
The Library
Using the Streaming SIMD Extensions, you can complete a matrix multiplication with only 16 products and 12 additions. The library provided in this article was written with the goal to get the most out of the Streaming SIMD Extensions, and to reduce the amount of time needed for matrix/vector operations. Most of the library is written in C++, except for one small section in assembly. The library’s functions are about twice as fast as the equivalent scalar version of those functions.The library includes three classes:
GPMatrix class
The GPMatrix class is a 4x4 matrix of floats.
  
class GPMatrix {
    union {
       struct {
           __m128 _L1, _L2, _L3, _L4;
       };
       struct {
           float _11, _12, _13, _14;
           float _21, _22, _23, _24;
           float _31, _32, _33, _34;
           float _41, _42, _43, _44;
       };
    };
    // ...
};
Figure. 1 The GPMatrix Class
  The class contains 16 float elements (_11 to _44). The elements are placed in four lines, where each line is represented as one SIMD variable (_L1 to _L1).
Data elements can be referenced by their row and column: Mat[0][1] or Mat(0,1).
There is one limitation for the use of this class:
The class must be 16-byte aligned, so it won’t split between too many cache lines, and to enable faster reading/writing of the class’ elements. However, the class is aligned automatically by the
Intel C++ compiler
.
The GPVector class

The GPVector class is a vector of 4 floats.
  
class GPVector {
    union {
        __m128 vec;
        struct {
            float x,y,z,w;
        };
    };
    // ...
};
Figure 2. The GPVector Class
  The GPVector class has the x,y,z and w elements of the vector as floats, and also represented as one SIMD variable. As with the GPMatrix class, the GPVector class must be 16-bytes aligned.
A variant of the GPVector class is the GPVector3 class, which does not have the w element. This class holds "pure" 3D vectors. However, for alignment and for other reasons, the w element, which is not used, is replaced with a spacer.
Constructors & Operators
Operators on GPMatrix:
A * B
matrices multiplication
A ± B
matrices addition/subtraction
±A
matrix unary minus/plus
A * s
matrix multiplication with scalar
A *= B
matrix multiplied by matrix
A *= s
matrix multiplied by scalar
A ±= B
matrix added/subtracted by matrix

matrix transpose

matrix inverse

matrix determinant

matrix min/max element
GPMatrix Constructors:

Identity matrix

Zero matrix

Rotation matrices (around the X axis, Y axis and Z axis)

Translation matrices

Scaling matrices
Operators on GPVector:
v * M
vector multiplication with matrix
v * s
vector multiplication with scalar
v * w
vectors dot (inner) product
v % w
vectors cross product (in 3D)
v ± w
vectors addition/subtraction
± v
vector unary minus/plus
~ v
vector normalization
v *= M
vector multiplied by matrix
v *= s
vector multiplied by scalar
v ±= w
vector added/subtracted by vector

Examples
This section provides some examples of Streaming SIMD Extensions usage, compared with equivalent scalar code.
Using SIMD code
First, lets take a look on one of the most obvious and fundamental operator in the library – multiplying a vector by a matrix.
Scalar Code
The scalar and standard way is to calculate each element of the destination vector by multiplying the source vector with the appropriate column of the matrix. The computation takes 16 multiplications and 12 sums.
  
void scalarVectorMult (const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    Res.x = Vec.x*Mat._11 + Vec.y*Mat._21 + Vec.z*Mat._31 + Vec.w*Mat._41;
    Res.y = Vec.x*Mat._12 + Vec.y*Mat._22 + Vec.z*Mat._32 + Vec.w*Mat._42;
    Res.z = Vec.x*Mat._13 + Vec.y*Mat._23 + Vec.z*Mat._33 + Vec.w*Mat._43;
    Res.w = Vec.x*Mat._14 + Vec.y*Mat._24 + Vec.z*Mat._34 + Vec.w*Mat._44;
}
Figure 3. Vector Multiplication – Scalar code
SIMD Code

void VectorMult(const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    F32vec4 Result;
    Result = F32vec4(Vec.x) * Mat._L1;
    Result += F32vec4(Vec.y) * Mat._L2;
    Result += F32vec4(Vec.z) * Mat._L3;
    Result += F32vec4(Vec.w) * Mat._L4;
    Res = Result;
}
Figure 4. Vector Multiplication – SIMDified code
If we take a closer look in the scalar multiplication process, we can see that we can calculate the whole vector at once: In the scalar code, Vec.x is multiplied with the first four elements of the matrix. Those four elements are represented as the first line of the matrix, and are already placed in one SIMD variable. So we only need to expand the X element of the vector and multiply it with the first line of the matrix. This is done in the first assignment (third line) of the SIMD code. Next we multiply the expanded Y, Z and W elements of the vector with the second, third and forth line of the matrix respectively, as for the first element. Note that the X element of the result vector is the first element in the sum of the four vectors calculated before, the Y element is the second and so on. Therefore, the final result is just the sum of all the four vectors, and you don’t even need to rearrange the results.
This equivalent SIMD code takes only 4 multiplications and 3 sums. Since one SIMD calculation takes about half the time of four scalar calculations, the SIMD code runs more than twice as fast as the scalar code.
Some Numbers
This table shows the performance gain of using the Matrix Library, compared to Microsoft*‘s D3DXMATRIX class from D3DXMath.H of DirectX*7, which implements the same functions the scalar way.
Please note that this is a synthetic test, so those ratios may change when measured in application.
  
Function
Scalar Code
Matrix Library
ratio
Vector Multiplication
  60I
26
2.30
Matrix Multiplication
282I
87
3.26
Inverse Matrix
328
170
1.92
Make Rotation Matrix
169
143II
1.18
I The scalar version of the function is not inlined, so accurate numbers should be ~10 cycles less.
II There is another version of this function that uses
fast approximations
, and takes only 112 cycles.
And Finally – A Real Example
The last example presents code for calculating an exponent of a matrix.An exponent of a real number can be calculated using Taylor Series. In the same way, an exponent of matrix is defined:


There are three versions for calculating the series:
  • Using scalar code.
  • Using the GPMatrix class.
  • All the functions as inlined assembly (with no cross optimizations between the functions).
    The first and second versions are written using the classes’ operands. Actually I wrote them in about ten minutes.
    The third version, which was written using inlined assembly, took much more time to write – about an hour or two. In addition, this version is not fully optimized – the optimized functions are just placed end to end, without cross procedural optimizations.
    The first two versions are quite readable and can be modified easily. The third version is written in assembly, and is therefore much harder to modify. See the source file Exponent.cpp which is part of the
    MatLib.zip
    file.
    The next table shows the average time an iteration takes for each version:
    Version
    Average Time
    Scalar code
    371
    GPMatrix code
    130
    Inlined assembly
    144
    Note that using the GPMatrix instead of scalar code gives an improvement of x2.85! Even if the scalar functions were fully inlined, the GPMatrix code would still be more than twice as fast.
    The results of the GPMatrix version are even better than the assembly version since the compiler did a better job of optimizing the inlined functions… However, even a fully optimized assembly code (i.e. hand optimizations between functions) won’t give much better results.
    Conclusion
    The source files of the library and of the last example can be found in
    MatLib.zip
    .
    This library demonstrates how a performance library can be written without hardly using any assembly. Usually, writing good assembly is more efficient than C/C++. In this case writing the library functions without inlined assembly allows the compiler to perform inter-procedural optimizations on your code.
    Links
    Calculating rotation matrix using fast approximation, is done with the sine function from the Approximate Math (AM) library, download it from
    http://developer.intel.com/design/pentiumiii/devtools/
    .
    Other Resources
    Two related application notes:
    AP-928
    - Inverse of 4x4 Matrix
    AP-930
    - Matrix Multiplication
    Some Useful Links:
    Download an evaluation copy of Intel C/C++ Compiler at
    http://developer.intel.com/vtune/compilers/cpp/demo.htm
    .
    Vtune Analyzer can be used to measure the time consumed by function(s), and more. Download an evaluation copy at
    http://developer.intel.com/vtune/
    .
    Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.


    本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/30686/showart_531018.html
  •   发表评论 查看评论(共有条评论) 我要提问
     
     


    最新资讯更多>> 
    · Symbian开源:救赎 反击与未来
    · Google发布内部安全侦测工具Ra..
    · RMS攻击盖茨、微软及其慈善基金会
    · Richard Stallman对盖茨退休的..
    · 发行版发布:Ubuntu 8.04.1
    · Ubuntu Tweak 0.3.4正式发布
    · 哪一个 Linux 发行版最流行?
    · 开源运营引发中小网游厂商运营..
    · Core 2 Duo E8400 Ubuntu性能全..
    · Linux下Wine中文对话框乱码解决法
    论坛热点更多>> 
    · 70后和80后 的婚姻
    · 系统时钟比硬件时钟快很多,..
    · crontab内容无法启动
    · [转贴]热词“俯卧撑”风靡网络
    · 闲啊闲~闲啊闲
    · 墙纸 1280x1024
    · 好儿呢。。。
    · 原来我也是一个小心眼得女人
    · 信誉积分怎么得?
    · @@
    文档更新更多>> 
    · [转]几种linux内核文件的区别
    · Debian 在线播放
    · epoll入门
    · yum出现Existing lock /var/run/..
    · RPM 的介绍和应用
    · SSH客户端命令的使用
    · Unix/Linux下文件基本操作
    · vmstat命令列出的属性详解
    · linux学习笔记分享 (Linux入门绝佳)
    · squid 命令行选项分类
     
    关于我们 | 联系方式 | 广告合作 | 诚聘英才 | 网站地图 | 友情链接 | 免费注册

    Copyright © 2001-2008 ChinaUnix.net All Rights Reserved

    感谢所有关心和支持过ChinaUnix的朋友们