IBM's CodeNet Dataset Can Teach AI To Translate Computer Languages (engadget.com) 40
IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code. Engadget reports: In effect, we've taught computers how to speak human, so why not also teach computers to speak more computer? That's what IBM's Project CodeNet seeks to accomplish. "We need our ImageNet, which can snowball the innovation and can unleash this innovation in algorithms," [Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation]. CodeNet is essentially the ImageNet of computers. It's an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages -- from COBOL and FORTRAN to Java, C++, and Python.
"Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations," Puri explained. "Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages." In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.
CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems. Project CodeNet consists of more than 14 million code samples along with 4000-plus coding problems collected and curated from decades' of programming challenges and competitions across the globe. "The way the data set actually came about," Puri said, "there are many kinds of programming competitions and all kinds of problems -- some of them more businesslike, some of them more academic. These are the languages that have been used over the last decade and a half in many of these competitions with 1000s of students or competitors submitting solutions." Additionally, users can run individual code samples "to extract metadata and verify outputs from generative AI models for correctness," according to an IBM press release. "This will enable researchers to program intent equivalence when translating one programming language into another." [...] IBM intends to release the CodeNet data to the public domain, allowing researchers worldwide equal and free access.
"Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations," Puri explained. "Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages." In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.
CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems. Project CodeNet consists of more than 14 million code samples along with 4000-plus coding problems collected and curated from decades' of programming challenges and competitions across the globe. "The way the data set actually came about," Puri said, "there are many kinds of programming competitions and all kinds of problems -- some of them more businesslike, some of them more academic. These are the languages that have been used over the last decade and a half in many of these competitions with 1000s of students or competitors submitting solutions." Additionally, users can run individual code samples "to extract metadata and verify outputs from generative AI models for correctness," according to an IBM press release. "This will enable researchers to program intent equivalence when translating one programming language into another." [...] IBM intends to release the CodeNet data to the public domain, allowing researchers worldwide equal and free access.
Easy to miss the point (Score:2)
Re: (Score:2)
Indeed. I agree to all of that. The results of such a translation tend to not be human-readable in addition, which makes the whole exercise pretty pointless as it makes the code unmaintainable. If you have really old code, it is better to maintain the compiler for that than do an automatic translation to some other language.
That said, there is a case for translating Python to C, but that one is both solved and comes with some limitations.
Re: (Score:3, Interesting)
I am not sure i agree completely. A long time ago I worked in place where we "up lifted" a lot of COBOL code to C. The machine translation of COBOL to see I forget what the product was essentially converted the DATA DIVISION into a giant C UNION with the records as STRUCTS and all the memory allocated at the start of main().
Basically it was a very direct translation of the COBOL application both in the code layout highlevel and in the how the machine is going to execute it. It was absolutely NOT the way a
Re: (Score:2)
I love this. It's a pretty complicated high-wire balancing act because you not only had to understand COBOL really well but obviously had to be a c master to "trace" the older program and its logic.
I have no insight to offer, just thought your thing was rad
Re: (Score:3)
Re: (Score:2)
The success of this really depends on the starting COBOL corpus. If things were done neatly with appropriate copybooks and programs were kept small and run as a series of separate job steps from JCL you get useful C modules out that you could can be treated as little black boxes until you are ready to replace them. if you have that giant single program that does all the things - full ledger reconciliation, pay roll, billing, inventory start to finish; mud ball as input you will get a mud ball as output.
Re: (Score:2)
GIGO no matter what. Of course terseness doesn't help.
Re: (Score:2)
I hear what you're saying, but I think since it's all programmatic, translation seems like it would be a matter of mapping and functional replication. I don't see why it couldn't be done although the translated code might be ugly, messy, and top-heavy.
Human coders translate code from one language to another (I've done it myself going from perl to PHP) so I don't see why it couldn't be done programmatically. There's not a lot of nuance or intent to puzzle out, just replicating the flow of the logic.
I'm not s
Re: (Score:2)
Because computers/compilers make pretty bad translators?
IBM used to sell (and I guess they still support, kinda) something called EGL and VG (VisualGen?). You write code in yet a third language and set up generation options and it will generate either COBOL or Java+JSP code to run under Websphere. While the code it generates builds, runs, etc. and does what we expect it to do, the generated Java is nearly unreadable and has a very weird way of dealing with objects and passing them around.
Fortunately the E
Re: (Score:2)
Because computers/compilers make pretty bad translators?
For the most part they do pretty well with human languages, I would think that it would be even more straightforward with programmatic languages. There would be no issues dealing with nuance, no double meanings, and so on.
Fortunately the EGL/VG code is simple to read and understand so when I have to convert it to "plain" Java (e/t/l stuff) or Spring (rest-like stuff) it is much quicker for me to just do it from scratch and about all I take from it is the sql statements and business rules
Sounds like you might want to write a Java/SQL simplifier program. :) I'm guessing that some (maybe a lot) of what's involved in your conversion is repetitive or determinate and could be done programmatically.
Re: (Score:3)
COBOL terrifying? (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Old code isn't a bad thing, and I have no issues with the choice of COBOL as a language for running the infrastructure of a bank.
My worry is the assumptions of the initial team are now invalid.
Little things like the Y2K bug, a company's worth is trillion dollars (or individuals worth 100s of billions), or the methods for data writing and retrieval.
As for finding people to hack away at COBOL, answer is easy, paid them money to do so.
Shocking how simple money makes problems go away.
Re: (Score:2)
I would suggest to you the assumptions and implicit limitations behind 30+ year old COBOL code are better documented, or at least more obvious than the vast majority of what has found its way into production between then and now.
I suspect 2038 will be a rude surprise for a lot of organizations. I also expect as we discovered recently think like the price of BRK.A are going to do things like over flow unsigned ints. With the old COBOL programs it was very obvious how wide fields were and not in binary 2s com
Re: (Score:3)
Re: (Score:3)
The code has been running fine but decades but sure rewrite it all in javashit. What can possibly go wrong?
Re: (Score:2)
The code has been running fine
For any values of "fine"
Re: (Score:1)
Agreed.
Banking/finance deals with little things like money. Money is inherently floating point. 2s complement math tends to do rather poorly with such work. Yes, there's workarounds - but in these situations your better off using a language designed for such work (BCD under the covers) rather than crocking a fix that some poor slob programmer will forget/fail to use.
Second area - COBOL is designed for record processing. C, Java, Python, etc. are byte oriented. The nature of the work tends to be... reco
Re: (Score:3, Insightful)
Banking/finance deals with little things like money. Money is inherently floating point
You should never be let near any financial software.
Money is inherently fixed point.
Re: (Score:2)
Re: COBOL terrifying? (Score:2)
Because the people who could translate it from human language to COBOL, despite its verbosity, are finding their code outliving them. If no one remembers how to maintain it, when it crashes it'll crash hard and fast and idiots will panic instead of picking up a reference manual.
Not new (Score:3)
There are quite a few languages that get translated to others. That is a very, very old approach. There is also decompilation that reverses the process.
That said, languages have different limitations and no amount of translation can get around that. Code generated in this fashion tends to be unreadable and hence is not maintainable. That limits the utility of this system rather dramatically.
I would say this is another desperate attempt by the "AI" people at IBM to prove they can do something useful.
Re: Not new (Score:1)
Crap In Crap Out (Score:2)
As if most of the code out there is even worth recycling.
Re:Crap In Crap Out (Score:4, Informative)
Re: (Score:2)
Better known as the, "my shit doesn't stink, yours does" philosophy.
Server not found (Score:2)
Legacy translation (Score:2)
âoeyou can take some legacy COBOL code ... and translate it into Javaâ
So it can translate from one legacy language into another legacy language. I bet they were disappointed there was already an app called Rosetta Stone. :)
Quality of Output (Score:2)
Well, it can't be any worse than the code my co-workers write
The real test... (Score:2)
When it can translate the winning programs from the Obfuscated Perl Contest to Apple Basic I'll be impressed. Until then its just the computer language equivalent of Google Translate
"I can write COBOL in any language" (Score:2)
Ever heard of that phrase? And now, apparently, so can the computer!
Re: (Score:2)
While I know this was meant in jest (as was my previous post), I've seen this sentiment quite often, though usually with writing Fortran in C/C++. Simply making something work is often straightforward, but making it work well within the expected style of the language often isn't. I don't expect the AI to translate language idioms well at this point, and I especially don't expect it to be able to translate ideas across different programming approaches. It's hard enough to refactor my own code from procedu
Problems with all of this (Score:2)
Start with code conversion. This is not difficult, far easier than written/spoken language translation given the small set of "words".
The hard part is integration, local or not. Your Cobol has a bunch of hardware/storage specific code, good luck translating that in a useful manner.
And the "databases" may just be flat files with a SQL overlay (IBM stuff certainly can be). Even if you can pull off data layer conversion, what's the database look like on the "other side" (and how is that integration "generat
Terrified about the wrong thing. (Score:1)
That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.
Actually what would be terrifying would be taking working COBOL code a turning it into Java.
Call me ... (Score:3)
Or even Snobol into Klingon.
"Your predecessor wrote a bunch of Perl scripts" (Score:2)
I'll be a believer and a fan when a salad of regexes and hacks gets translated into something a sane person wrote for the same purpose. :)
If you taught the "computers to speak human" (Score:2)
already, then why do we need programming languages at all?