RC4 Code Achieves 319 MB/s On AMD64 Opteron 177
Marc Bevand writes "This
recent paper
is about optimizing
RC4
for
AMD64
processors. A working implementation is
provided. Its encryption/decryption throughput
reaches 319 MB/s on a single AMD Opteron x44
processor running at 1.8 GHz. This makes it, as of today, the world's fastest RC4 symmetric cipher implementation for general purpose CPUs. As the author of this work, I would like to
point out that many CPU-hungry applications
have not been optimized for AMD64 yet.
In other words: such speedups can be expected
in other areas."
An anonymous reader adds some figures for the old implementation: "Opteron 244 1.8 GHz (32-bit) 163 MB/s; Opteron 244 1.8 GHz (64-bit) 135 MB/s."
Optimisation is definately the key (Score:5, Informative)
Every now & then I come across some code optimised for 64bit processors, and it just flies - as more & more stuff gets the treatment, it will be like upgradingin for free
Re:Optimisation is definately the key (Score:5, Funny)
Just don't get too excited. One of my coworkers made this same discovery a while back. Now he runs around the office wearing an "I love Opteron" T-Shirt and starts shouting"Intel is history - Power PC is dead!" everytime somebody mentions the words Opteron or AMD in a sentence. Worst of all he attacks anybody who disagrees and tries to bite them. We tried to knock him out with a dart gun after he savaged a visiting IBM sales rep but even heavy duty veterinary tranqulisers don't seem to have any effect.
Re:Optimisation is definately the key (Score:2)
What the hell is he smoking?
I'll take PPC over any other architechture any day of the week.
Re:Optimisation is definately the key (Score:2, Funny)
I know. The attacking and biting I can look past, but saying that Power PC is dead is just nuts.
Re:Optimisation is definately the key (Score:2, Funny)
Re:Optimisation is definately the key (Score:2)
Not all athlon 64 procs are built for socket 939. the value option is socket 754.
If there are other people like me out there, tough, they will be waiting for PCI express compatible motherboards. There's no point in buying a new rig and getting stuck with AGP five years down the line.
until (Score:4, Insightful)
Re:until (Score:4, Insightful)
GCC is no slouch though, and obviously Intel is performing some tricks that could also be implemented by GCC.
I think it'd be a great move for AMD to work WITH GNU to optimize 64-bit AMD code from GCC.
Seems like Intel is more prone to keeping secrets when it comes to processors. Maybe this is (yet another) way for AMD to give them a run for their money.
Re:until (Score:3, Informative)
Re:until (Score:2)
Yes.
Re:until (Score:5, Informative)
AFAICR AMD paid SuSE to do the original work. I think the main developers were Jan Hubicka, the current x86-64 maintainer, and Andreas Jaeger. SuSE have a few more well-known GCC contributors: look at MAINTAINERS [gnu.org].
Re:until (Score:2)
GCC is no slouch though, (Score:2)
Slow to compile, slow when compiled. [coyotegulch.com]
Re:until (Score:2)
Re:until (Score:3)
For example, my code performs 5 times faster when compiled with gcc than when compiled with ICC
Ok, maybe I'm a special case (I use computed GOTO). But you can't compile the kernel either
Re:until (Score:2)
Re:until (Score:3, Funny)
Real programmers can write FORTRAN in every language.
Re:until (Score:3, Interesting)
PS: Oh, of course, Intel compiler won't ever support 3dnow, but that's the issue with sponsorship. I mean - AMD don't have to design the compiler themselves. They will be equally ok with sponsoring someone who knows how to do that.
Somewhat OT, but... (Score:5, Informative)
Re:Somewhat OT, but... (Score:5, Interesting)
So whilst this is all very handy, if you want encryption other than AES (which, if there were ever any significant flaws found in AES' maths, is a certainty) you'd want to dump those VIA boards and get yourself either a dedicated encryption device like an Encipher box (like an expensive version of the VIA) or just a beast of a machine to do encryption entirely in software (like an Opteron).
I personally shunt everything through DSA stunnels, so a VIA isn't much use to me.
Re:Somewhat OT, but... (Score:5, Informative)
Re:Somewhat OT, but... (Score:2)
Now if only they'd be as nice with the damned CLE266 graphics drivers...
Re:Somewhat OT, but... (Score:2)
Excuse my ignorance here, but are these chips on an expansion card or can you find motherboards with them?
Re:Somewhat OT, but... (Score:2)
Re:Somewhat OT, but... (Score:2)
I would just like to object to hearing this all the time. Sure, it's POSSIBLE that AES will be found vunerable, but quite unlikely. The government agency that selected and approved of AES are the same ones who approved of DES, oh so many years ago. I think that alone means it deserves the benefit of the doubt.
Of course, it's still POSSIBLE, but hearing the same questions about it repeated so often, gives the wrong impression.
Re:Somewhat OT, but... (Score:3, Informative)
The government agency that selected and approved of AES are the same ones who approved of DES, oh so many years ago.
And the same ones who were apparently surprised when flaws were found in SHA-1, which they also selected and approved. And the same ones who developed the Law Enforcement Access Field (LEAF) for Clipper, which was quickly broken by Matt Blaze.
Thirty years ago when the NSA fixed IBM's Lucifer, which became DES, the NSA clearly had a huge amount of cryptologic knowledge that the public res
Re:Somewhat OT, but... (Score:2)
Yet you think the NSA forgot much of that?
DES has never been broken, therefore they know enough to thwart even the most advanced researchers today.
But you're convinced, this time around, they don't know enough to do that again?
SHA-1 and LEAF are completely different subject, really. If you want to talk about Clipper, talk about Skipjack, which hasn't been found vulnerable yet.
Re:Somewhat OT, but... (Score:2)
Yet you think the NSA forgot much of that?
Nope. I think the public cryptologists caught up (or close to it).
DES has never been broken, therefore they know enough to thwart even the most advanced researchers today.
And what about tomorrow?
Even if you can break all but the last round, it's still every bit as secure. Even if you can break all but one round, does not mean it's possible to extend the same or similar method to break that last round. Skipjack is again a good example, is it has just enou
Re:Somewhat OT, but... (Score:2)
To break a strong symmetric cipher like AES just means to find a method which will deduce the key from an arbitrarily large amount of data (often with chosen plaintext or ciphertext) in a smaller number of basic operations than 2^(key size). This doesn't mean the attack is practical.
Sure, there's a big difference between theoretical and practical breaks. OTOH, attacks only improve so the smart thing to do is to start looking for other alternatives when theoretical breaks are found.
DES is broken, but
Re:Somewhat OT, but... (Score:2)
I personally shunt everything through DSA stunnels
You encrypt your data with the Digital Signature Algorithm? Good trick, that. Gotta be horribly slow, though.
Actually, you don't do this. You use DSA to validate DH public keys, use DH to establish a shared secret and use something like RC4 or some block cipher to actually do the bulk encryption. Or maybe you use RSA instead of DSA/DH, or maybe even El Gamal, but you definitely don't use DSA for bulk encryption.
It's actually quite likely that you
Re:Somewhat OT, but... (Score:2)
What I should have said was; everything gets thrown through SSH tunnels and I'd love to see an acceleration of whatever it is that SSH uses, as well as acceleration for creating those huge RSA/DSA keys we use all the time, which are slow to generate even on a dual Athlon 2000. And maybe better use of those RNG's that some of the VIA and AMD chipsets use.
I have
Re:Somewhat OT, but... (Score:2)
everything gets thrown through SSH tunnels and I'd love to see an acceleration of whatever it is that SSH uses, as well as acceleration for creating those huge RSA/DSA keys we use all the time
Well, get an Opteron, install the tuned RC4 implementation and configure stunnel to prefer RC4 and you'll have no problems with throughput. The tuned RC4 won't speed up session startup because that's all public-key stuff. Large integer math libs could really benefit from tuning on 64-bit registers, though.
As far
Not worth the outlay at present (Score:4, Informative)
RC4 isn't really that relavent in real life as wep is crap & also easily done in hardware anyway.
The 64 bit advantage will suffer thesame fate as the 32bit advantage did for the 486, pentium & especially the Pentium Pro.
486 = 32bits, faster but people still bought 386's due to cost.
Pentium = 32bits, sometimes faster but again costs meant 486's stayed popular.
Pentium Pro = 32bit, 16 bit instrucations stalled it. WHen running pure 32bit code ran like the dogs, when running 16bit code (win 98) ran like a dog.
Problem is that your generally better off saving your cash, buying a cheap CPU (32bit in this case) and waiting for the 2nd/3rd Generation CPU. By that time prices will more reasonable and you will see the full advantages as programs will use the extra bits properly.
I mean come on MS still hasn't released a final AMD64 version of Winblows yet.
Re:Not worth the outlay at present (Score:1)
So this code should run directly on an Pentium IV with EM64T. Anybody tried it, yet? How about trying it with the Intel C compiler. Most benchmarks use the Intel compiler, even on AMD CPUs because its so much better than GCC.
I don't buy the argument that its the extra registers, because there have been over 56 registers available for register renaming since the early-mid 90's.
Re:Not worth the outlay at present (Score:2)
I don't buy the argument that its the extra registers, because there have been over 56 registers available for register renaming since the early-mid 90's.
I'm no expert, however, from what I understand from the bit if reading I've done and the bit of assembler I've done, it isn't the number of registers on the chip, it is the number of registers available to the user of the chip.
For example, on the classic 32 bit X86, there are only four general purpose registers - EAX, EBX, ECX and EDX. If you want to
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:2, Informative)
Not if you want to actually use the stack pointer and your stack-frame base pointer; you have 4 GP regs (EAX
AND, if you want to do multiplications and divisions (the worst offenders, IMO), then two of the GP registers are already spoken for (EAX, EDX).
So actually, the grandparent poster was right.
-gus
Re:Not worth the outlay at present (Score:2)
You just save a few load uops (Score:2)
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:2)
Now it is true that I've heard that EMT is less optimised that AMD64, but I've never seen benchmarks so I don't know if it is true..
Re:Not worth the outlay at present (Score:4, Informative)
486 = 32bits, faster but people still bought 386's due to cost.
The 386 was also a 32-bit processor...
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:3, Informative)
thats the 386SX you're taking about, the regular 386DX which came out before the SX was full 32bit..
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:2)
The Althon 64 starts at $141 on NewEgg
See for yourself [newegg.com]
And BTW, the 386 was 32 bit.
Re:Not worth the outlay at present (Score:5, Interesting)
I just bought a new PC, and when compaired to all the available options, the the AMD64 option (I got an AMD64 2800+) was best. Slightly more expensive than the equivalent XP, cheaper than the p4. And they run so cool, its the first PC I've had in years where I don't have to worry about the temperature. When I bought an XP 2600+ last year, I spent almost half the chips price again on cooling.
Just because I'm running a 32bit win XP on it doesn't make it a bad purchase.
Also, I'm one of those people who bought a 386 instead of a 486 (then later a 486 instead of a pentium 1) because of the price difference. The price difference nowadays is nowhere near comparable to what it was then.
Re:Not worth the outlay at present (Score:5, Informative)
If you use Mozilla and Apache, you can use 256-bit AES encryption for SSL (try loading up paypal with a mozilla based browser) but if either the server or client is microsoft-based your stuck with the much weaker 128bit RC4...
MS - always behind the curve, no 256bit encryption, no 64bit os
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:2)
Infact, NT4 for the Alpha has a larger userbase than the itanium, despite being discontinued...
Re:Not worth the outlay at present (Score:2)
What fate would that be? [intel.com]In 1985 I could see buying into a 286 simply because there was really no support for 32bit protected mode let alone expanded memory. Hell extended memory was barely supported. Even in 1990 I could see buying into a 286 if it would save you money. Dos 4.0 was a bug ridden piece of filth and there still was not alot of support for 32bit protected mode.
Re:Not worth the outlay at present (Score:3)
I've spoken to some of the people that made this decision for various companies (e.g. Raytheon). The general consensus was that the difference between the 68K and the x86 was "night and day", but that the Intel
Re:Not worth the outlay at present (Score:2)
64 bit systems aren't exactly new, they've been around for ages and the apps have been there as well. The fairly cheap (at the time) DEC alpha series & (way cheaper) assorted clones popularized them further.
I currently run a 64 bit AMD cpu and my system and all my applications are 64 bit. It's quite easy to run a 64 bit system if you want/need one. You can even tweak your system so it runs 32 bit apps in case you have some old stuff lying around.
Or you can go get a ready
Pentium Pro is the worst example (Score:2)
The Athlon 64 architecture currently runs many or most 32 bit applications faster than comparable Intel processors, and is competitively priced. The ability to run 64 bit code is more like a bonus. This seems more comparable to the Pentiu
Re:Pentium Pro is the worst example (Score:2)
AMD was smart (as in business-smart) by providing a very easy upgrade path from 32 to 64 bit CPUs. I now own an Opteron CPU, and it is a very sweet chip that runs regular old 32-bit WinXP very nicely (as well as m
Ohh... you're not biased... (Score:2)
Opterons are much cheaper then IA-64, and they run 32-bit x86 stuff at full speed. They make porting application easy because, it's still x86. So whether or not the Itanium is faster/better, is moot. They are way expensive and way nitche.
"RC4 isn't really that relavent in real life as wep is crap & also easily don
Re:Not worth the outlay at present (Score:2)
Re:Not worth the outlay at present (Score:2, Insightful)
Itanium does really well on encryption in general. Hand-optimized code makes good use of the large register set, the modulo-scheduling of loops and powerful bit manipulation primitives.
IIRC Itanium hold the top stop in Spe
Finally enough horsepower... (Score:4, Funny)
PowerPC G5 (Score:5, Interesting)
Re:PowerPC G5 (Score:2, Insightful)
Seriously.
If you want to get 110% out of your hardware, you have to put effort in, to get effort out. Makes sense, doesnt it ?
Im not saying people who dont like ASM are sissies, not at all. But Im saying that assembly has its right, just as so many other programing languages.
Re:PowerPC G5 (Score:3, Insightful)
But when other projects beckon that don't require assembler work, I'm not about to jump on one that does for "fun" either ;)
Re:PowerPC G5 (Score:2)
Re:PowerPC G5 (Score:4, Informative)
From distributed.net's pages, here's what it has to say on the Opterons for RC5-72 (uniprocessor) [distributed.net]
The Opteron 2420 achieved a score of 9,547,969.00.
The 2GHz G5 for RC5-72 (uniprocessor) [distributed.net] achieved a score of 15,057,412.00 (there are 2.5GHz chips available...) The best multi-cpu scores?
A 2-way 2 GHz Opteron [distributed.net] achieved a score of 15,145,274.67, but
a 2-way 2.5GHz G5 [distributed.net] smoked it with a score of 37,441,192.00.
Apples to apples, my friend, apples to apples.
64-bit (Score:1)
chip names (Score:4, Funny)
well... (Score:4, Insightful)
well, maybe in some areas.
Since this is a cipher, it obviously helps a lot when you can work on 64-Bit chunks of data instead of 32-Bit.
The same speedup can probably be seen with applications that use numbers larger than 32b (or 64b for floats), since the number of operations necessary will essentially halve.
But other than that, I don't see much room for huge speedups.
The extra GP registers will help (Score:2)
See my earlier post [slashdot.org] as to why.
Re:The extra GP registers will help (Score:2)
Re:The extra GP registers will help (Score:2)
Re:The extra GP registers will help (Score:2)
Ok, you were right. But yeah, if you start with it should be fine unless it overflows
Re:The extra GP registers will help (Score:2)
Heh, you've hit the edge of my assembler knowledge, and I didn't think the example through that well..
However, the point I was trying to show was that on a processor with additional GP registers, you would be able to add to your example
If such an "eex" register existed, instead of
In other words, the additional GP registers allow both the number of "mov" instructions, and the delays they cause, to be reduced,.
Re:The extra GP registers will help (Score:2)
Of course, I understand what you are trying to say.
Re:well... (Score:2)
Depends on what you're doing. An add yes, instead of ab + cd you'd have a+c,b+d (plus some overflow flags). ab * cd? a*c + a*d + b*c + b*d (with appropriate magnitudes, of course).
Still, cryptograhpy is still ideal for going 64-bit. Most other apps won't be significant, it is the added GP registers (which have nothing to do with 64 b
MOD PARENT UP (Score:2)
Compiler should do this (Score:2)
It really take knowledge and skill to write portable code that makes few assumptions about hardware. Porting for OpenOffice.org 64 bit has been worked on for about 18 months. Hopefully 128bit will not be as hard. See the code for dates that is not Y2K compliant writt
Maybe not with normal apps of TODAY... (Score:2)
Can you imagine a 16-bit version of Office 2003? Or a media player? Or any of the other pretty heavy apps you run now a days?
A 64-bit platform opens new doors for doing things that would require a much faster IA-32 chip to perform. Since we're not going to be seeing the huge Ghz increases in clock speed for awhile, it's a decent thing to focus on.
Re:well... (Score:2)
At one point I was doing metrics for a highly sensitive financial trading application we were working on and did a break down of the response time (we had 3 seconds to create, transmit, render, and get user reply on a trade decision and we had to hop the pacific and atlantic for our international users). The results were that we took a
that's good (Score:2, Interesting)
and because everytime we hear good about AMD we're happy:P
Everybody'll get TLS'ed
Optimization First, Features Second (Score:4, Insightful)
Re:Optimization First, Features Second (Score:2, Insightful)
If someone took your idea to the extreme, you might get something like this:
"What does it do?"
"Nothing, but look how *fast* it does it?"
I think the best solution is moderation in both ends.
Re:Optimization First, Features Second (Score:2, Funny)
Don't do it.
The Second Rule of Program Optimization (for experts only!):
Don't do it yet.
Re:Optimization First, Features Second (Score:2)
We Do Not Talk about Optimization
The Second Rule of Program Optimization (for experts only!):
We Do Not Talk about Optimization!!!
Re:Optimization First, Features Second (Score:2)
Re:Optimization First, Features Second (Score:3, Insightful)
Re:Optimization First, Features Second (Score:2)
Want to explain your logic? It seems to me it'd be a once-off win (as everyone switches to focus on optimization), and then business as usual. Think it through:
free market (Score:2)
I think with the lack of upgrades to Windows you are starting to see this effect happening. People are simply sticking to what they have. Microsoft (as an example only) will have to consider performance gains on existing hardware as a marketable thing soon.
RC4 (Score:2)
in an ideal world (Score:2)
Then all the coders need to do is write the code that can be optimised best. The Intel C compiler does magic on intel processors in linux etc the performance difference is clear.
In other news, it is still worth optimizing... (Score:2)
I don't see the big deal here. I'd like to see what this algorithm would do if fully-optimized on the other processors out there, including the 64-bit G5. Maybe even better, use an algorithm that would have more practical value (wasn't RC4 cracked a while back already?) Try cracking MD5 or SHA-1 or something...
RC4 Code Achieves 411 MB/s On AMD64 Opteron (Score:2)
The interesting thing is that the Opteron 248 CPU is faster than just clock cycles (using timothy's code)
319*(2.2/1.8)=390 411
RC4 is not cryptographically strong (Score:3, Informative)
If you really need speed, you can use RC4 securely but you have to know what you are doing and be aware of these attacks so you can employ protective countermeasures. Otherwise you are better off to use a cipher like AES which is actually secure.
Re:RC4 is not cryptographically strong (Score:2)
As with the G5 (Score:2)
I just think it's great that AMD is making such strides... for being a Mac guy, I pull for them in the PC world. What can I say? I like the underdog story.
64 bit Xeon (Score:2)
Re:Does this change anything for rc5? (Score:5, Interesting)
No, they're entirely different. For a start, RC4 is a stream cipher whereas RC5 is a block cipher. They just share the same inventor, hence the names.
AFAICR, the RC5 effort uses the register width to try and crack many keys in one go anyway - a different approach to this, which is using the register width to generate more of a single stream in one go.
Re:Does this change anything for rc5? (Score:2)
In their applications, perhaps, but they are very different in implementation. I would expect that techniques to optimise the implementation of stream cipher and block ciphers are very different, and the original question was (I thought) whether this optimised RC4 would help provide an optimised RC5.
And my last point was that, as I understand it, the distributed.net people don't crunch R
Re:post C benchmarks (Score:2)
He did: 135 MB/s, near the top of the article, is for OpenSSL's C implementation of RC4 using GCC 3.4.2, -march=opteron -O3.
Now you can probably tweak the compiler flags to improve that but it's a good point to start from.
SUSE (Score:2)
Definitely SUSE Professional 9.2 [suse.com]
ubuntu not for servers - yet (Score:2)
PS: Do not attempt to put your home on a vfat partition, it fails to install