8/02/2010

08-02-10 - Java will be faster than C

I think it's quite likely that in the next 10 years, Java and C# programs will be faster than C/C++ programs. The languages are simply better - cleaner, more well defined, more precise in their operation, and most importantly - much easier to parallelize. C/C++ is just too hard to make parallel for most tasks, it's not worth it for the average programmer. But Java/C# are very easy.

Certainly I think that the Appleophile bloggers who are enamored of "native code" are missing the big picture. It's a damn shame that so many simple utilitarian apps are being written for specific platforms, when we have pretty decent platform-independent languages.

Obviously speed parity has not been acheived yet, though a lot of it is 1. retarded Java programmers who are doing things like SetPixel() one by one instead of using higher level APIs, and 2. we're not actually on 8+ cores all the time yet.

12 comments:

CrazyButcher said...

a bit related to the discussion, http://luajit.org/performance.html is getting pretty fast and purely single-core so far, and interpreted language. so indeed it's just a matter of time.

jeskola said...

10 years is very optimistic.

cbloom said...

Arguably, Azul is already doing it :

http://www.azulsystems.com/blogs/


And then the next big leap will come when we get hardware assists for transactional memory. Using the semi-HTM in C will be difficult and ugly as usual. Higher level, cleaner, languages with better defined concurrency behavior will be able to use HTM and start to destroy C on the 24+ core machines we will all have.

Autodidactic Asphyxiation said...

I somewhat agree with this, although I don't know whether it will be the 24 core or the 100 core (or 1000 core) machine that proves this, but it seems pretty inevitable that such a thing will soon exist. There is still a chance that C++ will get its shit together, though, and stay on top.

But it really depends on the problem domain. Things that are latency-bound are not going to be helped at all by threads.

Azul is interesting, but they also have their CPU and OS built around Java, something that isn't necessarily going to be true on the desktop, console, or handheld spaces. They also are more throughput, and not latency, oriented.

Re: LuaJIT

I think trace-based JITs are awesome, and LuaJit makes lots of excellent choices, but trace-based JIT really only makes sense for dynamic compilation environments (particularly for dynamic languages and/or space-constrained devices). There's clearly still a need for ahead-of-time compilation that can make global decisions about your program. I think a number of things need to happen in the mythical ideal runtime:

0) lightweight profiling (probably already done)
1) garbage collection that uses all your threads, less of your memory, and doesn't pause
2) combine ahead-of-time and trace-based JIT
3) runtime optimization happens concurrently with execution

cbloom said...

"But it really depends on the problem domain. Things that are latency-bound are not going to be helped at all by threads."

Sure, "embedded" low-memory, high-latency-need type stuff will be C for a long time.

"Azul is interesting, but they also have their CPU and OS built around Java, .. "

Yeah, though more generally there are just too many weird hardware models to expose to C in a clean way. Another example I think is interesting is Sun's Rock. Rock will probably suck or die, but similar things are coming from Intel/IBM as well.

BTW I also don't think Java is actually the best example of this; C# is improved in various ways for threading.

"trace-based JIT really only makes sense for dynamic compilation environments (particularly for dynamic languages and/or space-constrained devices). There's clearly still a need for ahead-of-time compilation that can make global decisions about your program."

Clearly a mix of compilation is needed, but trace-based compilation is also good for PGO (profile guided optimization). Azul's tiered JIT can already do things like inline functions based on use. Having a language with inherent introspection is really nice for PGO.

" 0) lightweight profiling (probably already done)"

Yeah Azul & Doug Lea do this. They have introspective GC and JIT that can monitor the running program and adapt. It will get better.

" 3) runtime optimization happens concurrently with execution "

This is a nice thing to do with all those extra cores that we can't figure out how to use - have some of them watching the running the cores and optimizing them!

Brian said...

Have you tried benchmarking Java code vs. C lately? They are pretty much at parity already with the one critical difference being array bounds checks in Java. That and the levels of abstraction in Java's API encourage people to write Java code in a way that introduces significant overheads.

There is of course the advantage that C code can call vector-like instructions to get big performance wins.

TM is a mess. HTM only supports a limited amount of store, and if you exceed that you take a huge performance hit. Contention management is also a big issue and makes it hard to reason about performance. The AMD proposal uses an aggressive CM --- it is very easy to get into near livelock situations that kill performance.

cbloom said...

"TM is a mess. HTM only supports a limited amount of store, and if you exceed that you take a huge performance hit. Contention management is also a big issue and makes it hard to reason about performance. The AMD proposal uses an aggressive CM --- it is very easy to get into near livelock situations that kill performance."

Hmm.. I haven't followed it in great detail, but all of them are basically a speculate-commit that's usually based on whole cache lines. eg.

AMD ASF :

http://blogs.amd.com/developer/2009/06/15/just-released-advanced-synchronization-facility-asf-specification/

which I think is very cool and handy. I don't mean that it will be actually used with STM, I have no idea if that will ever come true, but it is a very powerful primitive for general use.

cbloom said...

http://forums.amd.com/devforum/messageview.cfm?catid=203&threadid=112333&highlight_key=y

http://groups.google.com/group/comp.arch/browse_frm/thread/68a9fda9386048d7

http://groups.google.com/group/comp.programming.threads/tree/browse_frm/thread/c1c6c6327aed79b6


In Java or C# you should be able to basically just write mutex based code, and it should turn it into lock-elided transactions (that fall back to real mutexes).

Brian said...

Yeah, one problem with the hardware TM's is the following. Writes in a speculative region that conflict with another cores speculative region cause the other cores speculation to immediately abort. See the section on contention in http://developer.amd.com/assets/45432-ASF_Spec_2.1.pdf. Consider:
Thread 1:
trans start
read and write a
compute (some sort of delay)
trans end

Thread 2:
trans start
read and write a
compute (some sort of delay)
trans end

These two threads will just abort each other repeatedly... When one thread updates a, it immediately broadcast a message that causes the other to abort if it has read the cache line w/ a.

This isn't insurmountable of course, but the details of how you manage contention are important for performance of your code. And this isn't like a small 20% difference, but an order of magnitude...

Autodidactic Asphyxiation said...

Don't hold your breath for Rock. It might have been dead even before Oracle got to it. I am not a microarchitect, but I think what needs to happen at the h/w level is a kind of unification around speculation; out-of-order execution and simultaneous multithreading (particularly in a transactional memory context) feel similar to me.

"Clearly a mix of compilation is needed, but trace-based compilation is also good for PGO (profile guided optimization). Azul's tiered JIT can already do things like inline functions based on use. Having a language with inherent introspection is really nice for PGO."

Yep, although I'm still suspicious of the whole GC thing. The way I think of it is that there is a time/space tradeoff here, and GC exposes a whole new section of that curve. The really interesting thing is that it makes improvements in developer productivity and program analyzability/type safety.

You know, people seem to be taking reference counting seriously, again...

RCL said...

Co why we aren't adapting Java/C# to GPGPU programming (hundred/thousands of "cores") and instead doing C/C++ conversions like CUDA, OpenCL and DirectCompute?

I believe that there's fundamental problem with higher level languages is that they don't allow cutting corners.

E.g. requiring that everything has to be allocated from the heap, GC'ed, type-checked, that sort of stuff.

Hardware is not going to assist that, either. Nobody (well, except ARM, but it's optional on it anyway) wants to cripple its processor/GPU by making programming more easy.

ryg said...

"Co why we aren't adapting Java/C# to GPGPU programming (hundred/thousands of "cores") and instead doing C/C++ conversions like CUDA, OpenCL and DirectCompute?"
OpenCL is indeed a C conversion (without recursion), for the most part. Not sure about CUDA. D3D11 Compute Shaders most definitely are not.

Compute Shaders don't have pointers, unions/bitwise casts or recursion. All outside-visible memory writes go through special constructs (read-write buffers) that are created and passed in by the host app. Not sure how the aliasing rules are between buffers (which codify explicit memory access), but buffers can never alias with variables local to a compute shader, and local variables and function parameters can't alias with each other. In short, Compute Shaders are at least as compiler-friendly as Java, and more so in some regards.

Furthermore, OpenCL, CUDA and DirectCompute all have a specified memory model and built-in atomics and memory barriers - like Java but unlike C.

old rants