I think we run jobs in the 900 node range with no problems daily although
our average size is smaller. I haven't looked in the logs to see if we're
having communication problems that are masked by retries or something
though.
-- pete
On 9/12/07 12:14 AM, "Lennart Karlsson" <Lennart.Karlsson@xxxxxxxxxx> wrote:
> meo@xxxxxxxxxxxxxx said:
>> Peter Wyckoff said...
>>
>> |I'm wondering how big you've gotten maui and torque to scale, mostly
>> |interested in number of nodes?
>> |
>> |The docs say something like 1,000 but I think it scales well beyond that,
>> |no?
>>
>> That's what I've heard. Right now we're at about 300 nodes.
>
> Are you able to start a parallel job spanning all of these 300 nodes
> or is the mom-to-mom communication setup breaking down?
>
> We have problems starting jobs wider than about 100 nodes, because
> that amount of moms gets difficulties synchronizing among themselves
> at startup.
>
> -- Lennart Karlsson <Lennart.Karlsson@xxxxxxxxxx>
> National Supercomputer Centre in Linkoping, Sweden
> http://www.nsc.liu.se
>
>
|