[Go to BBS]
All articles in a thread
SubjectParallel module fault on a 64 bit 4-core machine
Article No914
Date: 2011/05/26(Thu) 01:38:19
ContributorZhe Duan < >
Hello,
I'm attemping to use the parallel module of SAD main trunk on a 64 bit 4-core SLES10 machine. However, executing the " test-scale.sad " resulted in a segmentation fault. The error info is as below:

(* Care to activate CPUs on power managed system *)
Map[func, l];

BenchScale[]; Exit[];
Library[Algorism/Parallel/Fork] from /home/duanz/SAD/share/Extension/Algorism/Parallel/Fork.n is loaded.
???General::abort: Aborted:
BenchScale[]
^
???-FFS-Error-?Undefined command or element: BENCHSCALE[]

! End of File
In[1]:= ^C


I tried it on another 32bit 8-core SLC5.5 machine, and it did work properly. So I wonder if the former fault is related with 64-bit?

Thanks a lot!


Zhe, Duan

SubjectRe: Parallel module fault on a 64 bit 4-core machine
Article No915
Date: 2011/05/26(Thu) 12:47:36
ContributorAkio Morita
> Hello,
> I'm attemping to use the parallel module of SAD main trunk on a 64 bit 4-core SLES10 machine. However, executing the " test-scale.sad " resulted in a segmentation fault. The error info is as below:
>
> (* Care to activate CPUs on power managed system *)
> Map[func, l];
>
> BenchScale[]; Exit[];
> Library[Algorism/Parallel/Fork] from /home/duanz/SAD/share/Extension/Algorism/Parallel/Fork.n is loaded.
> ???General::abort: Aborted:
> BenchScale[]
> ^
> ???-FFS-Error-?Undefined command or element: BENCHSCALE[]
>
> ! End of File
> In[1]:= ^C
>
>
> I tried it on another 32bit 8-core SLC5.5 machine, and it did work properly. So I wonder if the former fault is related with 64-bit?
>
fork(2) based parallel algorism uses anonymous shared memory for inter-process communication channel.
This shared memory is created via mmap(2) system call and mmap system call returns `pointer'.
In SAD code, `pointer' is handled as index number of double-array and it is stored into signed integer.
Thus, SAD code CAN be handled 16GiB virtual memory space lower around rlist(1).
(The negative index number is defined as invalid.)

In the case that mmap(2) returns higher VM space, SAD shared memory code cause faital error.
(eg. segmentation fault)

Solution(Fix SAD codes)
* Rewrite memory handling code and internal data structure.

Workaround(Tuning/Modifing Operating System)
* Limit VM space up-to 16GiB
* Modify mmap(2) to return under 16GiB boundary

SubjectRe^2: Parallel module fault on a 64 bit 4-core machine
Article No916
Date: 2011/05/26(Thu) 13:04:15
ContributorAkio Morita
Example of Workaround case
* SAD on FreeBSD/amd64
Parallel Extension Module(Sharad API family) is worked with Maximum DSIZ limitation
example)
Add following line into /boot/loader.conf
kern.maxdsiz="8G"

SubjectRe^3: Parallel module fault on a 64 bit 4-core machine
Article No917
Date: 2011/05/27(Fri) 18:50:31
ContributorZhe Duan < >
For SUSE linux, the kern.maxdsiz approach does not work. It seems modification on the system kernel is needed, since "ulimit -v" works well to limit VM space but "ulimit -d" has no effect on the sysmtem call mmap. And I'm still searching for the solution.

I wonder if there is a lot of work to do to fix SAD code, as you said, since I'm not familiar with the SAD kenel.

Thanks for your help!

SubjectRe^4: Parallel module fault on a 64 bit 4-core machine
Article No918
Date: 2011/05/27(Fri) 23:56:06
ContributorAkio Morita
> For SUSE linux, the kern.maxdsiz approach does not work. It seems modification on the system kernel is needed, since "ulimit -v" works well to limit VM space but "ulimit -d" has no effect on the sysmtem call mmap. And I'm still searching for the solution.
>
ulimit controls `per user' resource limit.
It does not control VM address map.

FreeBSD's kern.maxdsiz knob changes kernel parameter at boot time.
In unix, VM address map COULD be changed by modifing kernel parameter.
For example,
* Edit kernel source macros and recompile kernel

> I wonder if there is a lot of work to do to fix SAD code, as you said, since I'm not familiar with the SAD kenel.
>
For 64bit full support
* Survey undocumented data structure keeping pointer/index
* Extend data structure to keep native pointer(architecture dependent!)
* Test! Test! Test!
It MIGHT be possible, but it WOULD need skillful programer and many testers, I think.