This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
• NEC won tender 4.5PB over threee deliveriesand from other notes on the same event:
• 500 TB already installed (450 x 400GB 7k SATA) connected to 28 storage controllers RAID6
- 500TB already installed -- 400GB 7k SATA drives and 50TB FC -- 140GB 15kabout the same presentation and I was quite amused: 450 drives (actually 400+140) over 28 RAID6s means over 16 per RAID6, which to me seems quite perverse.
- RAID 6 SATA
- 2 servers at 2 gbit every 20TB
- Jan 2007 -- 300GB FC and 500GB SATA
• 2MW today in near future .... 5.5 MW total into the building by end of year by 2017 predict 64MW
:-)
.orthogonal basefor configuration space via environment variables, of which I use three:
HULL
is for names of systems, which implies a
certain hardware configuration, including for example
which filesystems are available.SITE
is for names of locations, which usually
imply different networking setups for servers.NODE
is for names of individual
configurations, and is a recent addition, as I realized that
several configurations may be needed on the same
HULL
at the same SITE
, for example
because of dual booting different GNU/Linux distributions,
or virtual machines, or different roles.case
statements with pattern matching,
something that the authors of most shell scripts I see seem to
eschew in their quest for the most appalling style, as follows
for example (many variants are possible, but a delimiter like
the +
below is necessary):
case "$SITE" in 'home+laptop1') : 'Whatever purely site specific';; ... esac case "$SITE+$HULL" in *'+laptop1') : 'Whatever purely hull specific';; ... esac case "$SITE+$HULL+$NODE" in 'home+laptop1+linux1') : 'Whatever for this specific situation';; *'+laptop1+linux1') : 'Whatever hull and node specific';; ... esacFor configuration files that are not shell scripts I use other variants on the idea:
/root/CONF/work+laptop1
might contain configuration files specific to that site and
hull. Then switching configuration can be as simple as
cp -alf /root/CONF/work/. /. cp -alf /root/CONF/work+laptop/. /.(even if I actually use a slightly different scheme and script).
make
file to copy or generate on
the specialized configuration file, using the name of site,
node or hull in the make
variables and the
filenames. For example:
${HOME}/.emacs: emacs-${SITE}.el; cp -p emacs-${SITE}.el '$@'and sometimes I preprocess the files to be installed with
cpp
which requires some slight trickery instead
of a mere cp
:
${HOME}/.Xresources: Xresources cpp-dot < Xresources > .tmp && cp -p .tmp > '$@'where
.tmp
is used to prevent overwriting the
older target in case the preprocessing fails, and
cpp-dot
looks like:
ENV='' CPP='gcc -E -x c' test -f /lib/cpp && CPP=/lib/cpp test -f /usr/lib/cpp && CPP=/usr/lib/cpp test -f /usr/ccs/lib/cpp && CPP=/usr/ccs/lib/cpp test -f "$LOCAL/bin/cpp" && CPP="$LOCAL/bin/cpp" for VAR in HOME SITE HULL NODE do eval VAL=\"\$"$VAR"\" ENV="$ENV -DEnv$VAR=$VAL" done exec $CPP $ENV ${1+"$@"} \ | exec egrep -v '^[ ]*$|^[#!]' \ | exec sed 's/^ *//;s/ *\%\% *//g;s/\^^/"/g'In the above there is a special trick: the sequence
%%
is used where white space around it should
be deleted, as some variants of cpp
insert
white space around expanded macros.
/etc/env-NODE
, /etc/env-SITE
,
/etc/env-HULL
white typically read like this:
#!/bin/sh export SITE : ${SITE:='home'}and a generic one that evaluates those and can override them by injecting assignments from the kernel command line into the environment like this:
#!/bin/sh for S in /etc/env-SITE /etc/env-HULL /etc/env-NODE do test -r "$S" && . "$S" done if test -r /proc/cmdline then # This should be: tr ' ' '\012' | while read P # but cannot be because then the 'while' is a subshell. for P in `cat /proc/cmdline` do if N="`expr \"$P\" : '\([A-Z_][A-Z_0-9]*\)='`" then export "$N" V="`expr \"$P\" : \"$N=\(.*\)\"`" eval "$N"="'$V'" fi done fiThen the script above is
sourcedat the beginning of the global
profile
script for
users, or the rc
scripts for init
.hdparm -T
, and system CPU usage was
around 11% for FireWire 400 and a still fairly low 28% for
FireWire 800. Compare with the same figures for
my USB2 test
with much faster systems at lower data rates.
non-Fibre Channelfor small-to-middle situations, which can add considerable flexibility to configurations.
# dd bs=32k if=/dev/hda of=/dev/sdi 777871+0 records in 777871+0 records out 25489276928 bytes (25 GB) copied, 903.105 seconds, 28.2 MB/sand
vmstat 10
was reporting:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ 1 3 4240 10208 676080 59140 0 0 25138 27612 2727 5682 2 19 0 78 0 0 8 4240 9944 679664 58456 0 0 29947 27198 2960 6169 2 21 0 77 0 2 2 4240 9608 670924 61236 0 0 25709 26650 2761 5985 8 19 0 73 0 1 5 4240 9396 682068 60600 0 0 30333 28282 2928 6029 7 24 0 69 0 0 5 4240 9800 686468 58512 0 0 27962 28345 2782 5709 2 20 0 78 0which is pretty decent, even if I have seen people claiming reaching 35MB/s with other ATA-USB2 chipsets (I haven't checked yet which one is in this enclosure). there is for both reading and writing a 20% CPU use on a 3GHz Athlon 64. Just reading from the same external drive gives much the same rate with a 10% use of CPU:
# dd bs=32k if=/dev/sdi of=/dev/null count=100000 100000+0 records in 100000+0 records out 3276800000 bytes (3.3 GB) copied, 117.09 seconds, 28.0 MB/swith
vmstat 10
reporting:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 4848 9952 574724 61264 0 0 27789 1 2318 4529 1 10 0 88 0 1 1 4848 9652 575004 61240 0 0 27202 0 2273 4442 1 12 0 87 0 1 1 4848 10320 574240 61260 0 0 27750 1 2363 4563 1 9 0 90 0 0 1 4848 9532 575164 61208 0 0 27251 0 2275 4438 1 12 0 87 0 2 1 4848 9892 576376 59660 0 0 27789 0 2316 4525 1 9 0 90 0 0 1 4848 10132 576192 59588 0 0 27265 0 2274 4441 1 12 0 87 0and neither transfer rate is disappointing. Overall a lot better than many others.
busbecause of that.
Our survey of technical applications indicated that typical HPTC programs spend the major- ity of their time waiting for memory. Ratios in the range of 5-80 floating-point operations per cache miss to main memory were typical.which is very well supported by my experiences (and Ben's, hi!
:->
) in optimizing game
code on various consoles, where more cache means better
performance, given that most programmers, even game programmers,
don't understand memory-friendly or microparallel algorithms.
By scaling to N communicating processes, we are able to spread the data movement task over N independent memory access streams. Scaling is, of course, limited by the cost of communication.This is a bit too vague: the
costof communication is ambiguous as communication is rated in both bandwidth and latency, which can have very different cost profiles. For massively parallel algorithms latency probably matters more, but here it is not clear indeed whether the cost of bandwidth or the cost of latency has been targeted.
Our hardware design was guided by a simple idea: while traditional clusters are built upon processor designs that emphasize calculation speed, the SiCortex cluster architec- ture aims to balance the components of arithmetic, memory, and communications in a way that delivers maximum performance per dollar, watt, and square foot.This balancing is surely wise, and reminds me of a very good point by Edseger Dijkstra about optimal page replacement algorithms (to be discussed some other time), where optimum utilization of one resource is not the goal, but cheapest utilization of all resources (that is, cost-weighting). Unfortunately shallow customers (which exist in the HPC market too) buy on raw performance benchmarks and selling to the wise and discriminating restricts the potential market.
Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use parallel links, giving higher bandwidth. And with a small, single-cabinet system we were able to use a single master clock, resulting in reduced synchronization delays. Our low-power design also has cascading benefits in reducing infrastructure costs such as building and air conditioning, and in reducing operational costs for electricity.Well said, and especially for lower end applications power requirements can impact costs severely. At the higher end not many have the resources of Google who can afford semi-custom PC designs and to build gigantic facilities where land and power are cheap.
The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent caches, two interleaved memory interfaces, high speed I/O, and a programmable interface to the interconnect fabric.The usual MIPS-style instruction set, and a fairly decent amount of cache considering the CPUs are packed six to a chip. Sounds not too unlike a Sony/IBM Cell design, MIPS rather than PowerPC based, and double precision floating point is obviously targeted at scientific rather than gaming markets. Remains to be seen whether double precisions performance is much slower than single precision floating point as in similar designs. But the kicker is that:
The processors are based on a low power 64-bit MIPS® implementation. Each processor has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and has a peak floating-point rate of 1 GFLOPs. The processor's six-stage pipeline provides in-order execution of up to two instructions per cycle.
This simple design dissipates less than one watt per processor core.which suggests a power draw of 6W per chip, which is fairly impressive. The 500MHz clock frequency however is far lower than the 3GHz clock of the Cell in the PS3, which however has only one such chip (even if with a similar number of CPUs), but then it is the only one such chip on that board, not one of 27 as in the SiCortex nodes. But then I also can only agree with the point that:
The processor's rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete.
interiornodes in the hierarchy cannot have data associated with them: in the DNS both a domain name and its subdomain names can have addresses associated with them.
srv1
) being preferred for
floor 1 and the other (srv2
for floor 2. Let's
consider then these possible naming schemes, which I will
present all at once so the differences can be seen at a glance,
with case-by-case comments afterwards:
; There should be '$ORIGIN Example.com' or equivalent here. ; #1 srv1.Example.com. A 192.168.1.1 srv1.Example.com. A 192.168.2.1 srv2.Example.com. A 192.168.1.2 srv2.Example.com. A 192.168.2.2 ; #2 eth0.srv1.Example.com. A 192.168.1.1 eth1.srv1.Example.com. A 192.168.2.1 eth0.srv1.Example.com. A 192.168.1.2 eth1.srv1.Example.com. A 192.168.2.2 ; #2 NFS-1.Example.com. A 192.168.1.1 IPP-1.Example.com. A 192.168.1.1 NFS-2.Example.com. A 192.168.2.2 IPP-2.Example.com. A 192.168.2.2 ; #3 floor1.NFS.Example.com. A 192.168.1.1 floor1.IPP.Example.com. A 192.168.1.1 floor1.NFS2.Example.com. A 192.168.1.2 floor1.IPP2.Example.com. A 192.168.1.2 floor2.NFS.Example.com. A 192.168.2.2 floor2.IPP.Example.com. A 192.168.2.2 floor2.NFS2.Example.com. A 192.168.2.1 floor2.IPP2.Example.com. A 192.168.2.1 ; #4 ; #4.1 floor1.srv1.Example.com. A 192.168.1.1 floor2.srv1.Example.com. A 192.168.2.1 floor1.srv2.Example.com. A 192.168.1.2 floor2.srv2.Example.com. A 192.168.2.2 ; #4.2 srv1.Example.com. CNAME floor1.srv1.Example.com. srv1.Example.com. CNAME floor2.srv1.Example.com. srv2.Example.com. CNAME floor1.srv2.Example.com. srv2.Example.com. CNAME floor2.srv2.Example.com. ; #4.3 NFS.main.floor1.Example.com. CNAME floor1.srv1.Example.com. IPP.main.floor1.Example.com. CNAME floor1.srv1.Example.com. NFS.bkup.floor1.Example.com. CNAME floor1.srv2.Example.com. IPP.bkup.floor1.Example.com. CNAME floor1.srv2.Example.com. NFS.main.floor2.Example.com. CNAME floor2.srv2.Example.com. IPP.main.floor2.Example.com. CNAME floor2.srv2.Example.com. NFS.bkup.floor2.Example.com. CNAME floor2.srv1.Example.com. IPP.bkup.floor2.Example.com. CNAME floor2.srv1.Example.com.As to these:
flatfor my tastes, and does not take enough advantage of the hierarchical nature of DNS, and does not make it easy to switch from the main to the backup server for each floor.
search
lines in
the resolv.conf
(or equivalent DHCP) file,
and/or by changing the pointed-to interfaces.NFS.floor1.Example.com. CNAME floor1.srv1.Example.com. IPP.floor1.Example.com. CNAME floor1.srv2.Example.com. NFS.floor2.Example.com. CNAME floor2.srv2.Example.com. IPP.floor2.Example.com. CNAME floor2.srv1.Example.com.if one had clustering or load balancing for NFS and IPP acrosss the two servers.