Magika 1.0 Goes Stable As Google Rebuilds Its File Detection Tool In Rust (googleblog.com) 26
BrianFagioli writes: Google has released Magika 1.0, a stable version of its AI-based file type detection tool, and rebuilt the entire engine in Rust for speed and memory safety. The system now recognizes more than 200 file types, up from about 100, and is better at distinguishing look-alike formats such as JSON vs JSONL, TSV vs CSV, C vs C++, and JavaScript vs TypeScript. The team used a 3TB training dataset and even relied on Gemini to generate synthetic samples for rare file types, allowing Magika to handle formats that don't have large, publicly available corpora. The tool supports Python and TypeScript integrations and offers a native Rust command-line client.
Under the hood, Magika uses ONNX Runtime for inference and Tokio for parallel processing, allowing it to scan around 1,000 files per second on a modern laptop core and scale further with more CPU cores. Google says this makes Magika suitable for security workflows, automated analysis pipelines, and general developer tooling. Installation is a single curl or PowerShell command, and the project remains fully open source. The project is available on GitHub and documentation can be found here.
Under the hood, Magika uses ONNX Runtime for inference and Tokio for parallel processing, allowing it to scan around 1,000 files per second on a modern laptop core and scale further with more CPU cores. Google says this makes Magika suitable for security workflows, automated analysis pipelines, and general developer tooling. Installation is a single curl or PowerShell command, and the project remains fully open source. The project is available on GitHub and documentation can be found here.
So... Garbage in, garbage out, again? (Score:2)
Using AI to train AI. What could go wrong?
What is that for? (Score:5, Interesting)
I'm really curious of the use scenario for an AI-reimplementation of the Unix "file" command that can process "scan around 1,000 files per second on a modern laptop core". Who needs to speed up their existing bash scripts to analyse the filetype of over 1000 files per second on a laptop?
Re: (Score:3)
An implementation of updatedb could make use of this, allowing to search files by their identified type.
Re: (Score:2)
This could make a very tedious task a lot easier. If you've ever had to do PII/PCI/ISO auditing, you probably know what I mean. And since it can probably find malicious executable code masquerading as some benign file, all the better.
100MB version of mimetype (Score:4, Interesting)
I understand this is slightly more granular but damn it comes at a high cost in resources. While people may be able to look the other way at 100MB to replace a 10KB program, the computational cost is sooo MUCH higher.
Re: (Score:2)
That's where I see this being very useful.
Re: (Score:2)
Have you ever had to scan hundreds of endpoints and file servers, running a mix of operating systems, for unstructured confidential (PII, PCI, etc) data?
No, literally never. I doubt many people have.
Re: (Score:2)
Why? (Score:2)
What's the point? File types are well defined. Even if you feel the distinguish between "look-alikes" why wouldn't you use something that understands the the basic common types first, and then do your "AI" nonsense to detect dialects for those where it's relevant?
Re: (Score:2)
If that isn't a concern, then yeah, filter out t
Re: (Score:2)
I think you misunderstood my point. It was not to rely on a file extension, but the same definitions used by file(1)
Re: (Score:2)
I'm with you on speeding the process up for low priority scans by filtering out the obvious, but that's also what
o rly? (Score:4, Informative)
Electricity cost per file idenfied (Score:2)
Two Asks:
- Need to know the electricity cost per each file identified comparing the old system versus the new system.
- Need to know the relative frequency of each of the file formats, an important benchmark for others building systems needing to process lots of file types.
For example, the percent of files that can be 100% identified by doing a binary compare of the first 10 bytes of the file for the magic number versus the cost in cpu instructions and electricity for an AI classification of those same files
This is a fine example (Score:5, Insightful)
This is a fine example of a completely pointless, stupid, BRAIN-FUCKING-DEAD application of AI. The UNIX file command has existed for decades. It works well and it's fast.
But for "security" purposes, we need to make a program 1000x as big, 100x as slow, and that uses orders of magnitude more electricity so our email security software can distinguish text/json from text/jsonl
We have truly jumped the AI shark.
Re: (Score:2)
Just the other week, my brother-in-law got my wife to buy a little tool for spreading open a spring on my washing machine. $11, over my objection that it was silly to spend any money on something that would probably only be used
Re: (Score:2)
But for "security" purposes, we need to make a program 1000x as big, 100x as slow,
Good sire, you offend me! It's at least 10000x as big and 100000x as slow!
How does Magika detect file type (Score:2)
Not sure I see the point (Score:2)
Re: (Score:2)
so basically 'file' on steroid? (Score:1)
Re: (Score:2)