[Go to BBS]
All articles in a thread
SubjectCSR codeの高速化(線形代数ルーチンの高速化)
Article No857
Date: 2010/10/29(Fri) 21:28:50
ContributorAkio Morita
CSR@MakeZL[]のホットスポットは、
 * tfeigensystem()から呼び出される tqr(), thess()
 * tflinearsolve()から呼び出される tsolvm()
の二つであり、条件によるが、固有値・固有ベクトル分解で 60%、
逆行列計算で 35%のCPUタイムが費されています。

この種の問題を解く数学的なアルゴリズムの研究と実装の改良は、
数値計算の研究領域では日進月歩であり、CPU毎に最適化された
パッケージやSMPや並列計算機に最適化された実装が存在しています。
数値線形代数では、BLAS/LAPACKが事実上の標準インターフェースであり、
こうしたアルゴリズムの改良や実装最適化の成果はBLAS/LAPACK API経由であれば、
比較的簡単に利用することが可能です。

amorita branchにて実験的に、Eigensystem[]の内部実装を BLAS/LAPACKに置き換え、
高速なBLAS実装として知られる GotoBLAS2をバックエンドにした実装で、
実一般行列(N=512程度)の固有値・固有ベクトル分解で約5倍程度の高速化が得られ、
CSR@MakeZL[]の実行時間でも手元のサンプルでは 40%程度の時間短縮効果が出ています。
# 標準の BLAS/LAPACKでは速くなりません

追試・開発したい方は、次の環境で実験出来ます
 * amorita branch revision 3366以降
 * LAPACK extension module
 * CSR extension module
 * 最適化 or 並列化された BLAS/LAPACKライブラリ

Subject線形代数ルーチンの高速化
Article No858
Date: 2010/11/04(Thu) 15:43:38
ContributorAkio Morita
BLAS & LAPACKを使った線形代数ルーチンの高速化を目的とした
extension module Math/LAPACKが一応完成しました

高速化されるのは、
 * LinearSolve[]
 * Inverse[]
 * SingularValues[]
 * Eigensystem[]
の4関数(Inverse[]はLinearSolve[]で実装されているので実質3関数)です

使い方
 1. システムに BLAS & LAPACKをインストール
    ATLASや GotoBLAS等の最適化BLASを推奨します

 2. LAPACK extension moduleをダウンロードして make & make install
    システムにインストールされている BLAS & LAPACKに応じて
      * USE_BLAS変数を設定する
      * LDOPT_ADD変数にリンクする BLAS & LAPACKライブラリを指定する
    とうの設定作業が必要です

 3. SADScript上で Library@Require["Math/LAPACK"]

以上の手順で `Lapack' プレフィックスの付いた線形代数ルーチンの高速版が
使用可能になります。(Ex. LapackLinearSolve[])

SAD coreの提供する関数を完全に置き換える場合は、Makefile中の
``COPT_ADD=-DUSE_LAPACK_PREFIX''をコメントアウトしてください

動作環境は、SAD amorita branch revision 3407以降です

動作速度は、扱う行列サイズやBLASの最適化の程度に影響されますが、
良く最適化された BLASを使った場合、ある程度大きな行列では
SAD coreのものに対して、約5倍から10倍程度の改善が得られます

SubjectRe: 線形代数ルーチンの高速化
Article No859
Date: 2010/11/05(Fri) 15:59:08
ContributorAkio Morita
手元の環境で比較的簡単に入手できる BLAS/LAPACKによるベンチマークを行いました
SAD:amorita branch r3415
Module:Math/LPACK extension r3441
OS:FreeBSD/amd64 8.1-STABLE
CPU:Quad-Core AMD Opteron(tm) Processor 2376 (2300.11-MHz K8-class CPU)
Date:2010/11/05

---------------------------------------------------------------------------------
Real Eigensystem[teigen]
 N =      2 L =  32768 T =      .008237 +/-      .140829 msec   # of failures = 0
 N =      4 L =  32768 T =      .014519 +/-      .062383 msec   # of failures = 0
  TEIGEN convergence failed. Range =           2           5
         Lower right corner =  0.52571493004973358       4.63855639037369441E-002  0.73160999814810090     
 N =      8 L =  32768 T =      .048708 +/-      .144691 msec   # of failures = 0
 N =     16 L =  10240 T =      .216757 +/-      .132459 msec   # of failures = 0
 N =     32 L =   2560 T =     1.159240 +/-      .257751 msec   # of failures = 0
 N =     64 L =    640 T =     6.664422 +/-      .284198 msec   # of failures = 0
 N =    128 L =    160 T =    54.954325 +/-     1.280301 msec   # of failures = 0
 N =    256 L =     40 T =   474.934650 +/-     9.550046 msec   # of failures = 0
 N =    512 L =     10 T =  4954.452600 +/-    51.146981 msec   # of failures = 0

Real Eigensystem[DGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .013218 +/-      .157030 msec   # of failures = 0
 N =      4 L =  32768 T =      .024014 +/-      .067510 msec   # of failures = 0
 N =      8 L =  32768 T =      .070447 +/-      .110767 msec   # of failures = 0
 N =     16 L =  10240 T =      .280239 +/-      .204191 msec   # of failures = 0
 N =     32 L =   2560 T =     1.517591 +/-      .306330 msec   # of failures = 0
 N =     64 L =    640 T =     8.358041 +/-      .380745 msec   # of failures = 0
 N =    128 L =    160 T =    94.711063 +/-     8.258527 msec   # of failures = 0
 N =    256 L =     40 T =   684.155500 +/-    10.204925 msec   # of failures = 0
 N =    512 L =     10 T =  3075.963100 +/-    45.029070 msec   # of failures = 0

Real Eigensystem[DGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .014668 +/-      .160511 msec   # of failures = 0
 N =      4 L =  32768 T =      .027324 +/-      .072267 msec   # of failures = 0
 N =      8 L =  32768 T =      .075819 +/-      .112824 msec   # of failures = 0
 N =     16 L =  10240 T =      .498010 +/-      .199880 msec   # of failures = 0
 N =     32 L =   2560 T =     2.269072 +/-      .379338 msec   # of failures = 0
 N =     64 L =    640 T =    10.689588 +/-     1.262235 msec   # of failures = 0
 N =    128 L =    160 T =    95.835031 +/-     8.315531 msec   # of failures = 0
 N =    256 L =     40 T =   681.952975 +/-    16.886181 msec   # of failures = 0
 N =    512 L =     10 T =  4109.489400 +/-   127.084073 msec   # of failures = 0

Real Eigensystem[DGEEVX@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .014868 +/-      .167066 msec   # of failures = 0
 N =      4 L =  32768 T =      .026354 +/-      .093779 msec   # of failures = 0
 N =      8 L =  32768 T =      .070111 +/-      .104149 msec   # of failures = 0
 N =     16 L =  10240 T =      .256042 +/-      .191495 msec   # of failures = 0
 N =     32 L =   2560 T =     1.271190 +/-      .221631 msec   # of failures = 0
 N =     64 L =    640 T =     6.846203 +/-      .493545 msec   # of failures = 0
 N =    128 L =    160 T =    80.511619 +/-     8.542452 msec   # of failures = 0
 N =    256 L =     40 T =   401.314400 +/-     9.575067 msec   # of failures = 0
 N =    512 L =     10 T =  1388.119100 +/-    25.401272 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex Eigensystem[tceigen]
 N =      2 L =  32768 T =      .008042 +/-      .126553 msec   # of failures = 0
 N =      4 L =  32768 T =      .021977 +/-      .058947 msec   # of failures = 0
 N =      8 L =  32768 T =      .084895 +/-      .108231 msec   # of failures = 0
 N =     16 L =  10240 T =      .414323 +/-      .124297 msec   # of failures = 0
 N =     32 L =   2560 T =     2.466422 +/-      .159884 msec   # of failures = 0
 TCEIGEN Convergence fail.
 N =     64 L =    640 T =    17.269752 +/-      .420907 msec   # of failures = 0
 N =    128 L =    160 T =   151.543269 +/-     2.482483 msec   # of failures = 0
 N =    256 L =     40 T =  1536.990525 +/-    18.398861 msec   # of failures = 0
 N =    512 L =     10 T = 13866.533400 +/-   159.996810 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .017757 +/-      .059012 msec   # of failures = 0
 N =      4 L =  32768 T =      .040167 +/-      .092559 msec   # of failures = 0
 N =      8 L =  32768 T =      .119518 +/-      .072383 msec   # of failures = 0
 N =     16 L =  10240 T =      .507698 +/-      .105572 msec   # of failures = 0
 N =     32 L =   2560 T =     2.669335 +/-      .141993 msec   # of failures = 0
 N =     64 L =    640 T =    16.661434 +/-      .290309 msec   # of failures = 0
 N =    128 L =    160 T =   123.347500 +/-     1.494198 msec   # of failures = 0
 N =    256 L =     40 T =  1033.310050 +/-    13.795092 msec   # of failures = 0
 N =    512 L =     10 T =  6427.120600 +/-    66.541751 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .020239 +/-      .164884 msec   # of failures = 0
 N =      4 L =  32768 T =      .049169 +/-      .102389 msec   # of failures = 0
 N =      8 L =  32768 T =      .137168 +/-      .097606 msec   # of failures = 0
 N =     16 L =  10240 T =      .675185 +/-      .113269 msec   # of failures = 0
 N =     32 L =   2560 T =     3.237613 +/-      .222750 msec   # of failures = 0
 N =     64 L =    640 T =    19.622386 +/-      .812437 msec   # of failures = 0
 N =    128 L =    160 T =   143.707500 +/-     2.858266 msec   # of failures = 0
 N =    256 L =     40 T =  1256.908150 +/-    20.315763 msec   # of failures = 0
 N =    512 L =     10 T = 11165.999600 +/-   172.824775 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .019002 +/-      .153284 msec   # of failures = 0
 N =      4 L =  32768 T =      .043108 +/-      .073305 msec   # of failures = 0
 N =      8 L =  32768 T =      .114783 +/-      .068736 msec   # of failures = 0
 N =     16 L =  10240 T =      .560256 +/-      .131265 msec   # of failures = 0
 N =     32 L =   2560 T =     2.694456 +/-      .128608 msec   # of failures = 0
 N =     64 L =    640 T =    15.112442 +/-      .442129 msec   # of failures = 0
 N =    128 L =    160 T =    91.809056 +/-     1.331968 msec   # of failures = 0
 N =    256 L =     40 T =   424.419800 +/-     4.313518 msec   # of failures = 0
 N =    512 L =     10 T =  2694.422600 +/-    48.715928 msec   # of failures = 0

---------------------------------------------------------------------------------
Real SingularValues[tsvdm]
 N =      2 L =  32768 T =      .011872 +/-      .135057 msec   # of failures = 0
 N =      4 L =  32768 T =      .017514 +/-      .056119 msec   # of failures = 0
 N =      8 L =  32768 T =      .033522 +/-      .049175 msec   # of failures = 0
 N =     16 L =  10240 T =      .112781 +/-      .078994 msec   # of failures = 0
 N =     32 L =   2560 T =      .513979 +/-      .205664 msec   # of failures = 0
 N =     64 L =    640 T =     3.567764 +/-      .212971 msec   # of failures = 0
 N =    128 L =    160 T =    49.825387 +/-      .410513 msec   # of failures = 0
 N =    256 L =     40 T =   561.219125 +/-     3.144857 msec   # of failures = 0
 N =    512 L =     10 T = 10544.715000 +/-    56.068778 msec   # of failures = 0

Real SingularValues[DGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .022678 +/-      .187658 msec   # of failures = 0
 N =      4 L =  32768 T =      .032596 +/-      .060802 msec   # of failures = 0
 N =      8 L =  32768 T =      .074625 +/-      .068605 msec   # of failures = 0
 N =     16 L =  10240 T =      .250749 +/-      .085152 msec   # of failures = 0
 N =     32 L =   2560 T =      .920881 +/-      .073305 msec   # of failures = 0
 N =     64 L =    640 T =     4.754948 +/-      .094989 msec   # of failures = 0
 N =    128 L =    160 T =    26.753725 +/-      .139137 msec   # of failures = 0
 N =    256 L =     40 T =   187.731250 +/-      .537708 msec   # of failures = 0
 N =    512 L =     10 T =  1384.883600 +/-     3.334391 msec   # of failures = 0

Real SingularValues[DGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .019903 +/-      .171499 msec   # of failures = 0
 N =      4 L =  32768 T =      .038882 +/-      .123298 msec   # of failures = 0
 N =      8 L =  32768 T =      .087390 +/-      .083651 msec   # of failures = 0
 N =     16 L =  10240 T =      .271005 +/-      .132160 msec   # of failures = 0
 N =     32 L =   2560 T =      .779123 +/-      .155682 msec   # of failures = 0
 N =     64 L =    640 T =     2.982009 +/-      .240249 msec   # of failures = 0
 N =    128 L =    160 T =    14.705319 +/-      .116347 msec   # of failures = 0
 N =    256 L =     40 T =    76.330825 +/-      .542039 msec   # of failures = 0
 N =    512 L =     10 T =   458.911900 +/-      .526803 msec   # of failures = 0

Real SingularValues[DGESDD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .021825 +/-      .177369 msec   # of failures = 0
 N =      4 L =  32768 T =      .032880 +/-      .081391 msec   # of failures = 0
 N =      8 L =  32768 T =      .069007 +/-      .088089 msec   # of failures = 0
 N =     16 L =  10240 T =      .195701 +/-      .052913 msec   # of failures = 0
 N =     32 L =   2560 T =      .583645 +/-      .071883 msec   # of failures = 0
 N =     64 L =    640 T =     2.153711 +/-      .180950 msec   # of failures = 0
 N =    128 L =    160 T =     9.765350 +/-      .120727 msec   # of failures = 0
 N =    256 L =     40 T =    56.321225 +/-      .302002 msec   # of failures = 0
 N =    512 L =     10 T =   378.101700 +/-      .426204 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex SingularValues[tcsvdm]
 N =      2 L =  32768 T =      .013533 +/-      .147806 msec   # of failures = 0
 N =      4 L =  32768 T =      .024613 +/-      .077979 msec   # of failures = 0
 N =      8 L =  32768 T =      .403952 +/-      .307302 msec   # of failures = 0
 N =     16 L =  10240 T =      .246291 +/-      .131441 msec   # of failures = 0
 N =     32 L =   2560 T =     1.311957 +/-      .081337 msec   # of failures = 0
 N =     64 L =    640 T =    10.824608 +/-     1.309137 msec   # of failures = 0
 N =    128 L =    160 T =   103.553988 +/-     1.870567 msec   # of failures = 0
 N =    256 L =     40 T =  1062.854400 +/-     5.127493 msec   # of failures = 0
 N =    512 L =     10 T = 17645.269400 +/-   881.390705 msec   # of failures = 0

Complex SingularValues[ZGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .027102 +/-      .206568 msec   # of failures = 0
 N =      4 L =  32768 T =      .044584 +/-      .175870 msec   # of failures = 0
 N =      8 L =  32768 T =     6.774005 +/-     4.371503 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .336427 +/-      .025955 msec   # of failures = 0
 N =     32 L =   2560 T =     1.376423 +/-      .152718 msec   # of failures = 0
 N =     64 L =    640 T =     8.209416 +/-      .262218 msec   # of failures = 0
 N =    128 L =    160 T =    49.707750 +/-      .226245 msec   # of failures = 0
 N =    256 L =     40 T =   377.112475 +/-     1.425033 msec   # of failures = 0
 N =    512 L =     10 T =  2763.281800 +/-    12.012295 msec   # of failures = 0

Complex SingularValues[ZGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .031600 +/-      .221910 msec   # of failures = 0
 N =      4 L =  32768 T =      .055400 +/-      .171011 msec   # of failures = 0
 N =      8 L =  32768 T =     7.043678 +/-     4.634607 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .372316 +/-      .162209 msec   # of failures = 0
 N =     32 L =   2560 T =     1.320179 +/-      .055972 msec   # of failures = 0
 N =     64 L =    640 T =     6.073683 +/-      .232270 msec   # of failures = 0
 N =    128 L =    160 T =    35.523719 +/-      .844582 msec   # of failures = 0
 N =    256 L =     40 T =   249.999800 +/-    10.159306 msec   # of failures = 0
 N =    512 L =     10 T =  1616.151700 +/-    35.023459 msec   # of failures = 0

Complex SingularValues[ZGESDD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .027644 +/-      .188919 msec   # of failures = 0
 N =      4 L =  32768 T =      .042908 +/-      .069401 msec   # of failures = 0
 N =      8 L =  32768 T =     6.596583 +/-     4.197659 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .259719 +/-      .014824 msec   # of failures = 0
 N =     32 L =   2560 T =      .879463 +/-      .223959 msec   # of failures = 0
 N =     64 L =    640 T =     3.943514 +/-      .047710 msec   # of failures = 0
 N =    128 L =    160 T =    20.680231 +/-      .259194 msec   # of failures = 0
 N =    256 L =     40 T =   134.837150 +/-      .787147 msec   # of failures = 0
 N =    512 L =     10 T =   884.559200 +/-    11.894268 msec   # of failures = 0

---------------------------------------------------------------------------------
Real LinearSolve[tsolvm]
 N =      2 L =  32768 T =      .012932 +/-      .162209 msec   # of failures = 0
 N =      4 L =  32768 T =      .012732 +/-      .047418 msec   # of failures = 0
 N =      8 L =  32768 T =      .020592 +/-      .047073 msec   # of failures = 0
 N =     16 L =  10240 T =      .059602 +/-      .025690 msec   # of failures = 0
 N =     32 L =   2560 T =      .246922 +/-      .049520 msec   # of failures = 0
 N =     64 L =    640 T =     1.590381 +/-      .116929 msec   # of failures = 0
 N =    128 L =    160 T =    41.466450 +/-      .115509 msec   # of failures = 0
 N =    256 L =     40 T =   404.196600 +/-      .347976 msec   # of failures = 0
 N =    512 L =     10 T =  4553.732400 +/-     6.132115 msec   # of failures = 0

Real LinearSolve[DGELSD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .020902 +/-      .171563 msec   # of failures = 0
 N =      4 L =  32768 T =      .033962 +/-      .072665 msec   # of failures = 0
 N =      8 L =  32768 T =      .079500 +/-      .042726 msec   # of failures = 0
 N =     16 L =  10240 T =      .280457 +/-      .112379 msec   # of failures = 0
 N =     32 L =   2560 T =     1.144411 +/-      .081641 msec   # of failures = 0
 N =     64 L =    640 T =     6.406339 +/-      .079681 msec   # of failures = 0
 N =    128 L =    160 T =    40.585156 +/-      .134758 msec   # of failures = 0
 N =    256 L =     40 T =   310.538725 +/-      .785799 msec   # of failures = 0
 N =    512 L =     10 T =  2862.740200 +/-    12.274714 msec   # of failures = 0

Real LinearSolve[DGELSD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .024493 +/-      .190497 msec   # of failures = 0
 N =      4 L =  32768 T =      .040578 +/-      .101808 msec   # of failures = 0
 N =      8 L =  32768 T =      .093644 +/-      .114790 msec   # of failures = 0
 N =     16 L =  10240 T =      .283796 +/-      .136987 msec   # of failures = 0
 N =     32 L =   2560 T =      .897974 +/-      .101839 msec   # of failures = 0
 N =     64 L =    640 T =     3.688769 +/-      .129619 msec   # of failures = 0
 N =    128 L =    160 T =    22.626219 +/-      .179070 msec   # of failures = 0
 N =    256 L =     40 T =   134.290775 +/-      .545878 msec   # of failures = 0
 N =    512 L =     10 T =  1196.390700 +/-    41.841181 msec   # of failures = 0

Real LinearSolve[DGELSD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .023342 +/-      .195374 msec   # of failures = 0
 N =      4 L =  32768 T =      .034271 +/-      .082321 msec   # of failures = 0
 N =      8 L =  32768 T =      .072005 +/-      .105429 msec   # of failures = 0
 N =     16 L =  10240 T =      .205889 +/-      .056656 msec   # of failures = 0
 N =     32 L =   2560 T =      .681020 +/-      .155405 msec   # of failures = 0
 N =     64 L =    640 T =     2.651733 +/-      .128700 msec   # of failures = 0
 N =    128 L =    160 T =    16.469419 +/-      .132502 msec   # of failures = 0
 N =    256 L =     40 T =   115.589175 +/-      .481383 msec   # of failures = 0
 N =    512 L =     10 T =  1019.370300 +/-     1.474057 msec   # of failures = 0

---------------------------------------------------------------------------------

SubjectRe^2: 線形代数ルーチンの高速化
Article No860
Date: 2010/11/05(Fri) 17:03:51
ContributorAkio Morita
別の環境で BLAS/LAPACKによるベンチマークを行いました
SAD:amorita branch r3415
Module:Math/LPACK extension r3441
OS:FreeBSD/amd64 8.1-STABLE
CPU:Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz (2666.78-MHz K8-class CPU)
Date:2010/11/05

---------------------------------------------------------------------------------
Real Eigensystem[teigen]
 N =      2 L =  32768 T =      .004914 +/-      .054961 msec   # of failures = 0
 N =      4 L =  32768 T =      .010696 +/-      .070075 msec   # of failures = 0
  TEIGEN convergence failed. Range =           2           5
         Lower right corner =  0.52571493004973358       4.63855639037369441E-002  0.73160999814810090     
 N =      8 L =  32768 T =      .033344 +/-      .071230 msec   # of failures = 0
 N =     16 L =  10240 T =      .141852 +/-      .126846 msec   # of failures = 0
 N =     32 L =   2560 T =      .836331 +/-      .128061 msec   # of failures = 0
 N =     64 L =    640 T =     4.385395 +/-      .289299 msec   # of failures = 0
 N =    128 L =    160 T =    32.750225 +/-     1.118398 msec   # of failures = 0
 N =    256 L =     40 T =   308.373100 +/-     9.375773 msec   # of failures = 0
 N =    512 L =     10 T =  2967.998600 +/-    92.109288 msec   # of failures = 0

Real Eigensystem[DGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .011405 +/-      .072050 msec   # of failures = 0
 N =      4 L =  32768 T =      .020033 +/-      .071798 msec   # of failures = 0
 N =      8 L =  32768 T =      .073028 +/-      .095900 msec   # of failures = 0
 N =     16 L =  10240 T =      .182777 +/-      .128211 msec   # of failures = 0
 N =     32 L =   2560 T =      .942186 +/-      .056568 msec   # of failures = 0
 N =     64 L =    640 T =    15.527852 +/-      .385307 msec   # of failures = 0
 N =    128 L =    160 T =   135.973019 +/-     2.716700 msec   # of failures = 0
 N =    256 L =     40 T =   543.980725 +/-     5.152799 msec   # of failures = 0
 N =    512 L =     10 T =  2460.381000 +/-    35.651359 msec   # of failures = 0

Real Eigensystem[DGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .012499 +/-      .040732 msec   # of failures = 0
 N =      4 L =  32768 T =      .022517 +/-      .071029 msec   # of failures = 0
 N =      8 L =  32768 T =      .055881 +/-      .071899 msec   # of failures = 0
 N =     16 L =  10240 T =      .320816 +/-      .129054 msec   # of failures = 0
 N =     32 L =   2560 T =     1.366923 +/-      .276729 msec   # of failures = 0
 N =     64 L =    640 T =     7.134169 +/-      .876691 msec   # of failures = 0
 N =    128 L =    160 T =    55.171881 +/-     4.410096 msec   # of failures = 0
 N =    256 L =     40 T =   380.538625 +/-     7.824095 msec   # of failures = 0
 N =    512 L =     10 T =  2340.383800 +/-    61.905827 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex Eigensystem[tceigen]
 N =      2 L =  32768 T =      .005745 +/-      .045651 msec   # of failures = 0
 N =      4 L =  32768 T =      .014781 +/-      .057704 msec   # of failures = 0
 N =      8 L =  32768 T =      .055616 +/-      .044269 msec   # of failures = 0
 N =     16 L =  10240 T =      .271051 +/-      .077106 msec   # of failures = 0
 N =     32 L =   2560 T =     1.633234 +/-      .145197 msec   # of failures = 0
 TCEIGEN Convergence fail.
 N =     64 L =    640 T =    11.523869 +/-      .276761 msec   # of failures = 0
 N =    128 L =    160 T =    93.776237 +/-     7.066373 msec   # of failures = 0
 N =    256 L =     40 T =  1090.281600 +/-    13.078611 msec   # of failures = 0
 N =    512 L =     10 T =  9432.465000 +/-   327.177165 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .013927 +/-      .059110 msec   # of failures = 0
 N =      4 L =  32768 T =      .030352 +/-      .058527 msec   # of failures = 0
 N =      8 L =  32768 T =      .085863 +/-      .071898 msec   # of failures = 0
 N =     16 L =  10240 T =      .372235 +/-      .073488 msec   # of failures = 0
 N =     32 L =   2560 T =     2.088792 +/-      .064885 msec   # of failures = 0
 N =     64 L =    640 T =    13.832725 +/-      .237734 msec   # of failures = 0
 N =    128 L =    160 T =   100.479919 +/-     1.199283 msec   # of failures = 0
 N =    256 L =     40 T =   874.881150 +/-    13.824662 msec   # of failures = 0
 N =    512 L =     10 T =  5517.628600 +/-   177.549475 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .017170 +/-      .088134 msec   # of failures = 0
 N =      4 L =  32768 T =      .036427 +/-      .045012 msec   # of failures = 0
 N =      8 L =  32768 T =      .104784 +/-      .078750 msec   # of failures = 0
 N =     16 L =  10240 T =      .489181 +/-      .045275 msec   # of failures = 0
 N =     32 L =   2560 T =     2.414941 +/-      .214154 msec   # of failures = 0
 N =     64 L =    640 T =    15.113606 +/-      .927262 msec   # of failures = 0
 N =    128 L =    160 T =   107.582538 +/-     2.169691 msec   # of failures = 0
 N =    256 L =     40 T =   921.144450 +/-    78.077925 msec   # of failures = 0
 N =    512 L =     10 T =  6997.876600 +/-    65.148720 msec   # of failures = 0

---------------------------------------------------------------------------------
Real SingularValues[tsvdm]
 N =      2 L =  32768 T =      .009644 +/-      .046258 msec   # of failures = 0
 N =      4 L =  32768 T =      .012214 +/-      .067921 msec   # of failures = 0
 N =      8 L =  32768 T =      .024322 +/-      .041449 msec   # of failures = 0
 N =     16 L =  10240 T =      .075302 +/-      .073760 msec   # of failures = 0
 N =     32 L =   2560 T =      .343740 +/-      .036861 msec   # of failures = 0
 N =     64 L =    640 T =     3.107773 +/-      .221245 msec   # of failures = 0
 N =    128 L =    160 T =    28.426794 +/-     4.764495 msec   # of failures = 0
 N =    256 L =     40 T =   480.008375 +/-    56.490995 msec   # of failures = 0
 N =    512 L =     10 T =  3570.363100 +/-    61.897338 msec   # of failures = 0

Real SingularValues[DGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .026266 +/-      .082811 msec   # of failures = 0
 N =      4 L =  32768 T =      .039730 +/-      .058722 msec   # of failures = 0
 N =      8 L =  32768 T =      .085655 +/-      .083125 msec   # of failures = 0
 N =     16 L =  10240 T =      .277530 +/-      .075076 msec   # of failures = 0
 N =     32 L =   2560 T =     1.038829 +/-      .022839 msec   # of failures = 0
 N =     64 L =    640 T =     5.943048 +/-      .239012 msec   # of failures = 0
 N =    128 L =    160 T =    38.459394 +/-      .153021 msec   # of failures = 0
 N =    256 L =     40 T =   286.069500 +/-      .713294 msec   # of failures = 0
 N =    512 L =     10 T =  2081.214000 +/-     2.634993 msec   # of failures = 0

Real SingularValues[DGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .019553 +/-      .054526 msec   # of failures = 0
 N =      4 L =  32768 T =      .035733 +/-      .075945 msec   # of failures = 0
 N =      8 L =  32768 T =      .071613 +/-      .081462 msec   # of failures = 0
 N =     16 L =  10240 T =      .201926 +/-      .106510 msec   # of failures = 0
 N =     32 L =   2560 T =      .560016 +/-      .142055 msec   # of failures = 0
 N =     64 L =    640 T =     2.359422 +/-      .312576 msec   # of failures = 0
 N =    128 L =    160 T =    10.252638 +/-      .265303 msec   # of failures = 0
 N =    256 L =     40 T =    55.665500 +/-     3.808465 msec   # of failures = 0
 N =    512 L =     10 T =   304.373800 +/-    19.260853 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex SingularValues[tcsvdm]
 N =      2 L =  32768 T =      .010985 +/-      .078168 msec   # of failures = 0
 N =      4 L =  32768 T =      .017390 +/-      .010494 msec   # of failures = 0
 N =      8 L =  32768 T =     5.313555 +/-     4.616588 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .173400 +/-      .103538 msec   # of failures = 0
 N =     32 L =   2560 T =      .939715 +/-      .022199 msec   # of failures = 0
 N =     64 L =    640 T =     6.465434 +/-      .100230 msec   # of failures = 0
 N =    128 L =    160 T =    57.908787 +/-     4.018248 msec   # of failures = 0
 N =    256 L =     40 T =   669.606875 +/-     4.180820 msec   # of failures = 0
 N =    512 L =     10 T =  5147.778300 +/-    45.043540 msec   # of failures = 0

Complex SingularValues[ZGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .022014 +/-      .041939 msec   # of failures = 0
 N =      4 L =  32768 T =      .033519 +/-      .041508 msec   # of failures = 0
 N =      8 L =  32768 T =     4.537111 +/-     3.173818 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .234432 +/-      .007166 msec   # of failures = 0
 N =     32 L =   2560 T =      .996835 +/-      .011787 msec   # of failures = 0
 N =     64 L =    640 T =     6.394497 +/-      .065906 msec   # of failures = 0
 N =    128 L =    160 T =    40.739569 +/-      .357006 msec   # of failures = 0
 N =    256 L =     40 T =   316.512525 +/-     1.539484 msec   # of failures = 0
 N =    512 L =     10 T =  2255.242000 +/-     3.732018 msec   # of failures = 0

Complex SingularValues[ZGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .025624 +/-      .084478 msec   # of failures = 0
 N =      4 L =  32768 T =      .044957 +/-      .060120 msec   # of failures = 0
 N =      8 L =  32768 T =      .094604 +/-      .060680 msec   # of failures = 0
 N =     16 L =  10240 T =      .259461 +/-      .076981 msec   # of failures = 0
 N =     32 L =   2560 T =     1.003860 +/-      .035998 msec   # of failures = 0
 N =     64 L =    640 T =     4.478936 +/-      .153480 msec   # of failures = 0
 N =    128 L =    160 T =    23.881556 +/-      .207277 msec   # of failures = 0
 N =    256 L =     40 T =   148.580275 +/-     1.568393 msec   # of failures = 0
 N =    512 L =     10 T =   805.882300 +/-     4.571945 msec   # of failures = 0

---------------------------------------------------------------------------------
Real LinearSolve[tsolvm]
 N =      2 L =  32768 T =      .009529 +/-      .061895 msec   # of failures = 0
 N =      4 L =  32768 T =      .010196 +/-      .057294 msec   # of failures = 0
 N =      8 L =  32768 T =      .014949 +/-      .057801 msec   # of failures = 0
 N =     16 L =  10240 T =      .037556 +/-      .073777 msec   # of failures = 0
 N =     32 L =   2560 T =      .160285 +/-      .145266 msec   # of failures = 0
 N =     64 L =    640 T =     1.574538 +/-      .050661 msec   # of failures = 0
 N =    128 L =    160 T =    17.699331 +/-      .052163 msec   # of failures = 0
 N =    256 L =     40 T =   225.097700 +/-      .333980 msec   # of failures = 0
 N =    512 L =     10 T =  2193.822700 +/-    10.421862 msec   # of failures = 0

Real LinearSolve[DGELSD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .019176 +/-      .071623 msec   # of failures = 0
 N =      4 L =  32768 T =      .028825 +/-      .042434 msec   # of failures = 0
 N =      8 L =  32768 T =      .056619 +/-      .041981 msec   # of failures = 0
 N =     16 L =  10240 T =      .182530 +/-      .009877 msec   # of failures = 0
 N =     32 L =   2560 T =      .755317 +/-      .017439 msec   # of failures = 0
 N =     64 L =    640 T =     4.405711 +/-      .174577 msec   # of failures = 0
 N =    128 L =    160 T =    28.793131 +/-     1.137125 msec   # of failures = 0
 N =    256 L =     40 T =   215.818675 +/-      .645456 msec   # of failures = 0
 N =    512 L =     10 T =  1561.887900 +/-     2.827443 msec   # of failures = 0

Real LinearSolve[DGELSD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .021839 +/-      .082602 msec   # of failures = 0
 N =      4 L =  32768 T =      .037695 +/-      .081893 msec   # of failures = 0
 N =      8 L =  32768 T =      .073539 +/-      .072191 msec   # of failures = 0
 N =     16 L =  10240 T =      .203711 +/-      .107180 msec   # of failures = 0
 N =     32 L =   2560 T =      .657891 +/-      .143076 msec   # of failures = 0
 N =     64 L =    640 T =     2.880484 +/-      .055372 msec   # of failures = 0
 N =    128 L =    160 T =    15.440688 +/-      .074582 msec   # of failures = 0
 N =    256 L =     40 T =    95.026875 +/-     3.481534 msec   # of failures = 0
 N =    512 L =     10 T =   579.009100 +/-      .788075 msec   # of failures = 0

---------------------------------------------------------------------------------

SubjectRe^2: 線形代数ルーチンの高速化
Article No861
Date: 2010/11/05(Fri) 21:12:39
ContributorAkio Morita
Opteron上の ATLAS-3.8.3の QR Decomposition性能はかなり残念な数値であったが、
ATLAS-3.9.x系列にAMD K10(Opteron)向けのLAPACK QR tuningが行われたとの記述が
有ったので、ATLAS-3.9.11でのベンチマークも行った

 * GotoBLASには及ばないが、ATLAS-3.8.3に比べると QRは、かなり改善されている
 * Complex SingularValues N=8でのデグレードに関しては、テストベクターの検証が必要と思われる

SAD:amorita branch r3415
Module:Math/LPACK extension r3441
OS:FreeBSD/amd64 8.1-STABLE
CPU:Quad-Core AMD Opteron(tm) Processor 2376 (2300.11-MHz K8-class CPU)
Date:2010/11/05

---------------------------------------------------------------------------------
Real Eigensystem[teigen]
 N =      2 L =  32768 T =      .008237 +/-      .140829 msec   # of failures = 0
 N =      4 L =  32768 T =      .014519 +/-      .062383 msec   # of failures = 0
  TEIGEN convergence failed. Range =           2           5
         Lower right corner =  0.52571493004973358       4.63855639037369441E-002  0.73160999814810090     
 N =      8 L =  32768 T =      .048708 +/-      .144691 msec   # of failures = 0
 N =     16 L =  10240 T =      .216757 +/-      .132459 msec   # of failures = 0
 N =     32 L =   2560 T =     1.159240 +/-      .257751 msec   # of failures = 0
 N =     64 L =    640 T =     6.664422 +/-      .284198 msec   # of failures = 0
 N =    128 L =    160 T =    54.954325 +/-     1.280301 msec   # of failures = 0
 N =    256 L =     40 T =   474.934650 +/-     9.550046 msec   # of failures = 0
 N =    512 L =     10 T =  4954.452600 +/-    51.146981 msec   # of failures = 0

Real Eigensystem[DGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .013218 +/-      .157030 msec   # of failures = 0
 N =      4 L =  32768 T =      .024014 +/-      .067510 msec   # of failures = 0
 N =      8 L =  32768 T =      .070447 +/-      .110767 msec   # of failures = 0
 N =     16 L =  10240 T =      .280239 +/-      .204191 msec   # of failures = 0
 N =     32 L =   2560 T =     1.517591 +/-      .306330 msec   # of failures = 0
 N =     64 L =    640 T =     8.358041 +/-      .380745 msec   # of failures = 0
 N =    128 L =    160 T =    94.711063 +/-     8.258527 msec   # of failures = 0
 N =    256 L =     40 T =   684.155500 +/-    10.204925 msec   # of failures = 0
 N =    512 L =     10 T =  3075.963100 +/-    45.029070 msec   # of failures = 0

Real Eigensystem[DGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .014668 +/-      .160511 msec   # of failures = 0
 N =      4 L =  32768 T =      .027324 +/-      .072267 msec   # of failures = 0
 N =      8 L =  32768 T =      .075819 +/-      .112824 msec   # of failures = 0
 N =     16 L =  10240 T =      .498010 +/-      .199880 msec   # of failures = 0
 N =     32 L =   2560 T =     2.269072 +/-      .379338 msec   # of failures = 0
 N =     64 L =    640 T =    10.689588 +/-     1.262235 msec   # of failures = 0
 N =    128 L =    160 T =    95.835031 +/-     8.315531 msec   # of failures = 0
 N =    256 L =     40 T =   681.952975 +/-    16.886181 msec   # of failures = 0
 N =    512 L =     10 T =  4109.489400 +/-   127.084073 msec   # of failures = 0

Real Eigensystem[DGEEVX@ATLAS-3.9.11]
 N =      2 L =  32768 T =      .014025 +/-      .149389 msec   # of failures = 0
 N =      4 L =  32768 T =      .026715 +/-      .100128 msec   # of failures = 0
 N =      8 L =  32768 T =      .074533 +/-      .113714 msec   # of failures = 0
 N =     16 L =  10240 T =      .287458 +/-      .191506 msec   # of failures = 0
 N =     32 L =   2560 T =     1.484938 +/-      .415549 msec   # of failures = 0
 N =     64 L =    640 T =     7.842805 +/-      .391136 msec   # of failures = 0
 N =    128 L =    160 T =    93.693800 +/-     8.725549 msec   # of failures = 0
 N =    256 L =     40 T =   479.481400 +/-     8.149593 msec   # of failures = 0
 N =    512 L =     10 T =  1646.219700 +/-    31.246515 msec   # of failures = 0

Real Eigensystem[DGEEVX@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .014868 +/-      .167066 msec   # of failures = 0
 N =      4 L =  32768 T =      .026354 +/-      .093779 msec   # of failures = 0
 N =      8 L =  32768 T =      .070111 +/-      .104149 msec   # of failures = 0
 N =     16 L =  10240 T =      .256042 +/-      .191495 msec   # of failures = 0
 N =     32 L =   2560 T =     1.271190 +/-      .221631 msec   # of failures = 0
 N =     64 L =    640 T =     6.846203 +/-      .493545 msec   # of failures = 0
 N =    128 L =    160 T =    80.511619 +/-     8.542452 msec   # of failures = 0
 N =    256 L =     40 T =   401.314400 +/-     9.575067 msec   # of failures = 0
 N =    512 L =     10 T =  1388.119100 +/-    25.401272 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex Eigensystem[tceigen]
 N =      2 L =  32768 T =      .008042 +/-      .126553 msec   # of failures = 0
 N =      4 L =  32768 T =      .021977 +/-      .058947 msec   # of failures = 0
 N =      8 L =  32768 T =      .084895 +/-      .108231 msec   # of failures = 0
 N =     16 L =  10240 T =      .414323 +/-      .124297 msec   # of failures = 0
 N =     32 L =   2560 T =     2.466422 +/-      .159884 msec   # of failures = 0
 TCEIGEN Convergence fail.
 N =     64 L =    640 T =    17.269752 +/-      .420907 msec   # of failures = 0
 N =    128 L =    160 T =   151.543269 +/-     2.482483 msec   # of failures = 0
 N =    256 L =     40 T =  1536.990525 +/-    18.398861 msec   # of failures = 0
 N =    512 L =     10 T = 13866.533400 +/-   159.996810 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .017757 +/-      .059012 msec   # of failures = 0
 N =      4 L =  32768 T =      .040167 +/-      .092559 msec   # of failures = 0
 N =      8 L =  32768 T =      .119518 +/-      .072383 msec   # of failures = 0
 N =     16 L =  10240 T =      .507698 +/-      .105572 msec   # of failures = 0
 N =     32 L =   2560 T =     2.669335 +/-      .141993 msec   # of failures = 0
 N =     64 L =    640 T =    16.661434 +/-      .290309 msec   # of failures = 0
 N =    128 L =    160 T =   123.347500 +/-     1.494198 msec   # of failures = 0
 N =    256 L =     40 T =  1033.310050 +/-    13.795092 msec   # of failures = 0
 N =    512 L =     10 T =  6427.120600 +/-    66.541751 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .020239 +/-      .164884 msec   # of failures = 0
 N =      4 L =  32768 T =      .049169 +/-      .102389 msec   # of failures = 0
 N =      8 L =  32768 T =      .137168 +/-      .097606 msec   # of failures = 0
 N =     16 L =  10240 T =      .675185 +/-      .113269 msec   # of failures = 0
 N =     32 L =   2560 T =     3.237613 +/-      .222750 msec   # of failures = 0
 N =     64 L =    640 T =    19.622386 +/-      .812437 msec   # of failures = 0
 N =    128 L =    160 T =   143.707500 +/-     2.858266 msec   # of failures = 0
 N =    256 L =     40 T =  1256.908150 +/-    20.315763 msec   # of failures = 0
 N =    512 L =     10 T = 11165.999600 +/-   172.824775 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@ATLAS-3.9.11]
 N =      2 L =  32768 T =      .022342 +/-      .169591 msec   # of failures = 0
 N =      4 L =  32768 T =      .046625 +/-      .106590 msec   # of failures = 0
 N =      8 L =  32768 T =      .129303 +/-      .108754 msec   # of failures = 0
 N =     16 L =  10240 T =      .515950 +/-      .122293 msec   # of failures = 0
 N =     32 L =   2560 T =     2.585764 +/-      .121874 msec   # of failures = 0
 N =     64 L =    640 T =    15.882036 +/-      .293558 msec   # of failures = 0
 N =    128 L =    160 T =   117.267269 +/-     1.404716 msec   # of failures = 0
 N =    256 L =     40 T =   700.721650 +/-     8.036642 msec   # of failures = 0
 N =    512 L =     10 T =  3687.228800 +/-    51.588553 msec   # of failures = 0

Complex Eigensystem[ZGEEVX@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .019002 +/-      .153284 msec   # of failures = 0
 N =      4 L =  32768 T =      .043108 +/-      .073305 msec   # of failures = 0
 N =      8 L =  32768 T =      .114783 +/-      .068736 msec   # of failures = 0
 N =     16 L =  10240 T =      .560256 +/-      .131265 msec   # of failures = 0
 N =     32 L =   2560 T =     2.694456 +/-      .128608 msec   # of failures = 0
 N =     64 L =    640 T =    15.112442 +/-      .442129 msec   # of failures = 0
 N =    128 L =    160 T =    91.809056 +/-     1.331968 msec   # of failures = 0
 N =    256 L =     40 T =   424.419800 +/-     4.313518 msec   # of failures = 0
 N =    512 L =     10 T =  2694.422600 +/-    48.715928 msec   # of failures = 0

---------------------------------------------------------------------------------
Real SingularValues[tsvdm]
 N =      2 L =  32768 T =      .011872 +/-      .135057 msec   # of failures = 0
 N =      4 L =  32768 T =      .017514 +/-      .056119 msec   # of failures = 0
 N =      8 L =  32768 T =      .033522 +/-      .049175 msec   # of failures = 0
 N =     16 L =  10240 T =      .112781 +/-      .078994 msec   # of failures = 0
 N =     32 L =   2560 T =      .513979 +/-      .205664 msec   # of failures = 0
 N =     64 L =    640 T =     3.567764 +/-      .212971 msec   # of failures = 0
 N =    128 L =    160 T =    49.825387 +/-      .410513 msec   # of failures = 0
 N =    256 L =     40 T =   561.219125 +/-     3.144857 msec   # of failures = 0
 N =    512 L =     10 T = 10544.715000 +/-    56.068778 msec   # of failures = 0

Real SingularValues[DGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .022678 +/-      .187658 msec   # of failures = 0
 N =      4 L =  32768 T =      .032596 +/-      .060802 msec   # of failures = 0
 N =      8 L =  32768 T =      .074625 +/-      .068605 msec   # of failures = 0
 N =     16 L =  10240 T =      .250749 +/-      .085152 msec   # of failures = 0
 N =     32 L =   2560 T =      .920881 +/-      .073305 msec   # of failures = 0
 N =     64 L =    640 T =     4.754948 +/-      .094989 msec   # of failures = 0
 N =    128 L =    160 T =    26.753725 +/-      .139137 msec   # of failures = 0
 N =    256 L =     40 T =   187.731250 +/-      .537708 msec   # of failures = 0
 N =    512 L =     10 T =  1384.883600 +/-     3.334391 msec   # of failures = 0

Real SingularValues[DGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .019903 +/-      .171499 msec   # of failures = 0
 N =      4 L =  32768 T =      .038882 +/-      .123298 msec   # of failures = 0
 N =      8 L =  32768 T =      .087390 +/-      .083651 msec   # of failures = 0
 N =     16 L =  10240 T =      .271005 +/-      .132160 msec   # of failures = 0
 N =     32 L =   2560 T =      .779123 +/-      .155682 msec   # of failures = 0
 N =     64 L =    640 T =     2.982009 +/-      .240249 msec   # of failures = 0
 N =    128 L =    160 T =    14.705319 +/-      .116347 msec   # of failures = 0
 N =    256 L =     40 T =    76.330825 +/-      .542039 msec   # of failures = 0
 N =    512 L =     10 T =   458.911900 +/-      .526803 msec   # of failures = 0

Real SingularValues[DGESDD@ATLAS-3.9.11]
 N =      2 L =  32768 T =      .020868 +/-      .157107 msec   # of failures = 0
 N =      4 L =  32768 T =      .035345 +/-      .122849 msec   # of failures = 0
 N =      8 L =  32768 T =      .080285 +/-      .098572 msec   # of failures = 0
 N =     16 L =  10240 T =      .273210 +/-      .061332 msec   # of failures = 0
 N =     32 L =   2560 T =      .783836 +/-      .080288 msec   # of failures = 0
 N =     64 L =    640 T =     3.235420 +/-      .120992 msec   # of failures = 0
 N =    128 L =    160 T =    14.952244 +/-      .117878 msec   # of failures = 0
 N =    256 L =     40 T =    73.638775 +/-      .300941 msec   # of failures = 0
 N =    512 L =     10 T =   461.864400 +/-      .574151 msec   # of failures = 0

Real SingularValues[DGESDD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .021825 +/-      .177369 msec   # of failures = 0
 N =      4 L =  32768 T =      .032880 +/-      .081391 msec   # of failures = 0
 N =      8 L =  32768 T =      .069007 +/-      .088089 msec   # of failures = 0
 N =     16 L =  10240 T =      .195701 +/-      .052913 msec   # of failures = 0
 N =     32 L =   2560 T =      .583645 +/-      .071883 msec   # of failures = 0
 N =     64 L =    640 T =     2.153711 +/-      .180950 msec   # of failures = 0
 N =    128 L =    160 T =     9.765350 +/-      .120727 msec   # of failures = 0
 N =    256 L =     40 T =    56.321225 +/-      .302002 msec   # of failures = 0
 N =    512 L =     10 T =   378.101700 +/-      .426204 msec   # of failures = 0

---------------------------------------------------------------------------------
Complex SingularValues[tcsvdm]
 N =      2 L =  32768 T =      .013533 +/-      .147806 msec   # of failures = 0
 N =      4 L =  32768 T =      .024613 +/-      .077979 msec   # of failures = 0
 N =      8 L =  32768 T =      .403952 +/-      .307302 msec   # of failures = 0
 N =     16 L =  10240 T =      .246291 +/-      .131441 msec   # of failures = 0
 N =     32 L =   2560 T =     1.311957 +/-      .081337 msec   # of failures = 0
 N =     64 L =    640 T =    10.824608 +/-     1.309137 msec   # of failures = 0
 N =    128 L =    160 T =   103.553988 +/-     1.870567 msec   # of failures = 0
 N =    256 L =     40 T =  1062.854400 +/-     5.127493 msec   # of failures = 0
 N =    512 L =     10 T = 17645.269400 +/-   881.390705 msec   # of failures = 0

Complex SingularValues[ZGESDD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .027102 +/-      .206568 msec   # of failures = 0
 N =      4 L =  32768 T =      .044584 +/-      .175870 msec   # of failures = 0
 N =      8 L =  32768 T =     6.774005 +/-     4.371503 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .336427 +/-      .025955 msec   # of failures = 0
 N =     32 L =   2560 T =     1.376423 +/-      .152718 msec   # of failures = 0
 N =     64 L =    640 T =     8.209416 +/-      .262218 msec   # of failures = 0
 N =    128 L =    160 T =    49.707750 +/-      .226245 msec   # of failures = 0
 N =    256 L =     40 T =   377.112475 +/-     1.425033 msec   # of failures = 0
 N =    512 L =     10 T =  2763.281800 +/-    12.012295 msec   # of failures = 0

Complex SingularValues[ZGESDD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .031600 +/-      .221910 msec   # of failures = 0
 N =      4 L =  32768 T =      .055400 +/-      .171011 msec   # of failures = 0
 N =      8 L =  32768 T =     7.043678 +/-     4.634607 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .372316 +/-      .162209 msec   # of failures = 0
 N =     32 L =   2560 T =     1.320179 +/-      .055972 msec   # of failures = 0
 N =     64 L =    640 T =     6.073683 +/-      .232270 msec   # of failures = 0
 N =    128 L =    160 T =    35.523719 +/-      .844582 msec   # of failures = 0
 N =    256 L =     40 T =   249.999800 +/-    10.159306 msec   # of failures = 0
 N =    512 L =     10 T =  1616.151700 +/-    35.023459 msec   # of failures = 0

Complex SingularValues[ZGESDD@ATLAS-3.9.11]
 N =      2 L =  32768 T =      .030354 +/-      .199105 msec   # of failures = 0
 N =      4 L =  32768 T =      .047115 +/-      .050325 msec   # of failures = 0
 N =      8 L =  32768 T =     6.618956 +/-     4.202718 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .332314 +/-      .104221 msec   # of failures = 0
 N =     32 L =   2560 T =     1.189729 +/-      .021061 msec   # of failures = 0
 N =     64 L =    640 T =     6.267263 +/-      .078448 msec   # of failures = 0
 N =    128 L =    160 T =    33.797069 +/-      .236447 msec   # of failures = 0
 N =    256 L =     40 T =   219.956975 +/-     1.235945 msec   # of failures = 0
 N =    512 L =     10 T =  1454.313100 +/-     6.742284 msec   # of failures = 0

Complex SingularValues[ZGESDD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .027644 +/-      .188919 msec   # of failures = 0
 N =      4 L =  32768 T =      .042908 +/-      .069401 msec   # of failures = 0
 N =      8 L =  32768 T =     6.596583 +/-     4.197659 msec   # of failures = 0(*slow)
 N =     16 L =  10240 T =      .259719 +/-      .014824 msec   # of failures = 0
 N =     32 L =   2560 T =      .879463 +/-      .223959 msec   # of failures = 0
 N =     64 L =    640 T =     3.943514 +/-      .047710 msec   # of failures = 0
 N =    128 L =    160 T =    20.680231 +/-      .259194 msec   # of failures = 0
 N =    256 L =     40 T =   134.837150 +/-      .787147 msec   # of failures = 0
 N =    512 L =     10 T =   884.559200 +/-    11.894268 msec   # of failures = 0

---------------------------------------------------------------------------------
Real LinearSolve[tsolvm]
 N =      2 L =  32768 T =      .012932 +/-      .162209 msec   # of failures = 0
 N =      4 L =  32768 T =      .012732 +/-      .047418 msec   # of failures = 0
 N =      8 L =  32768 T =      .020592 +/-      .047073 msec   # of failures = 0
 N =     16 L =  10240 T =      .059602 +/-      .025690 msec   # of failures = 0
 N =     32 L =   2560 T =      .246922 +/-      .049520 msec   # of failures = 0
 N =     64 L =    640 T =     1.590381 +/-      .116929 msec   # of failures = 0
 N =    128 L =    160 T =    41.466450 +/-      .115509 msec   # of failures = 0
 N =    256 L =     40 T =   404.196600 +/-      .347976 msec   # of failures = 0
 N =    512 L =     10 T =  4553.732400 +/-     6.132115 msec   # of failures = 0

Real LinearSolve[DGELSD@LAPACK-3.2.2]
 N =      2 L =  32768 T =      .020902 +/-      .171563 msec   # of failures = 0
 N =      4 L =  32768 T =      .033962 +/-      .072665 msec   # of failures = 0
 N =      8 L =  32768 T =      .079500 +/-      .042726 msec   # of failures = 0
 N =     16 L =  10240 T =      .280457 +/-      .112379 msec   # of failures = 0
 N =     32 L =   2560 T =     1.144411 +/-      .081641 msec   # of failures = 0
 N =     64 L =    640 T =     6.406339 +/-      .079681 msec   # of failures = 0
 N =    128 L =    160 T =    40.585156 +/-      .134758 msec   # of failures = 0
 N =    256 L =     40 T =   310.538725 +/-      .785799 msec   # of failures = 0
 N =    512 L =     10 T =  2862.740200 +/-    12.274714 msec   # of failures = 0

Real LinearSolve[DGELSD@ATLAS-3.8.3]
 N =      2 L =  32768 T =      .024493 +/-      .190497 msec   # of failures = 0
 N =      4 L =  32768 T =      .040578 +/-      .101808 msec   # of failures = 0
 N =      8 L =  32768 T =      .093644 +/-      .114790 msec   # of failures = 0
 N =     16 L =  10240 T =      .283796 +/-      .136987 msec   # of failures = 0
 N =     32 L =   2560 T =      .897974 +/-      .101839 msec   # of failures = 0
 N =     64 L =    640 T =     3.688769 +/-      .129619 msec   # of failures = 0
 N =    128 L =    160 T =    22.626219 +/-      .179070 msec   # of failures = 0
 N =    256 L =     40 T =   134.290775 +/-      .545878 msec   # of failures = 0
 N =    512 L =     10 T =  1196.390700 +/-    41.841181 msec   # of failures = 0

Real LinearSolve[DGELSD@ATLAS-3.9.11]
 N =      2 L =  32768 T =      .024281 +/-      .195702 msec   # of failures = 0
 N =      4 L =  32768 T =      .037526 +/-      .113346 msec   # of failures = 0
 N =      8 L =  32768 T =      .085461 +/-      .083935 msec   # of failures = 0
 N =     16 L =  10240 T =      .272406 +/-      .061417 msec   # of failures = 0
 N =     32 L =   2560 T =      .887479 +/-      .084327 msec   # of failures = 0
 N =     64 L =    640 T =     3.960634 +/-      .130929 msec   # of failures = 0
 N =    128 L =    160 T =    22.656294 +/-      .118033 msec   # of failures = 0
 N =    256 L =     40 T =   132.068025 +/-      .431280 msec   # of failures = 0
 N =    512 L =     10 T =  1235.568900 +/-     3.348272 msec   # of failures = 0

Real LinearSolve[DGELSD@GotoBLAS-2.1.13]
 N =      2 L =  32768 T =      .023342 +/-      .195374 msec   # of failures = 0
 N =      4 L =  32768 T =      .034271 +/-      .082321 msec   # of failures = 0
 N =      8 L =  32768 T =      .072005 +/-      .105429 msec   # of failures = 0
 N =     16 L =  10240 T =      .205889 +/-      .056656 msec   # of failures = 0
 N =     32 L =   2560 T =      .681020 +/-      .155405 msec   # of failures = 0
 N =     64 L =    640 T =     2.651733 +/-      .128700 msec   # of failures = 0
 N =    128 L =    160 T =    16.469419 +/-      .132502 msec   # of failures = 0
 N =    256 L =     40 T =   115.589175 +/-      .481383 msec   # of failures = 0
 N =    512 L =     10 T =  1019.370300 +/-     1.474057 msec   # of failures = 0

---------------------------------------------------------------------------------

SubjectRe: 線形代数ルーチンの高速化
Article No881
Date: 2011/01/28(Fri) 13:43:19
ContributorAkio Morita
MAIN trunk上で実行するのに必要な変更点

* itfcm2l API update
- amorita branch r3383 & r3415

* itfri2l API introduce
- amorita branch r3366

* tflapack.f内の use TFCODE文を include 'inc/TFCODE.inc'へ置き換える

* tfLAPACK_,cに #include <stdio.h>を追加

付属のテストコード実行に必要な変更点

* Library class API update
- amorita branch r3010

SubjectRe^2: 線形代数ルーチンの高速化
Article No912
Date: 2011/05/16(Mon) 15:10:17
ContributorAkio Morita
TFCODE.incのモジュール化以外の buildに必要な変更は 1.0.10.4.14a2まででバックポート完了

> * tflapack.f内の use TFCODE文を include 'inc/TFCODE.inc'へ置き換える
>
この修正のみで動きます(LAPACK-snapshot-3503.tar.gzにて確認)