Magika 1.0 Goes Stable As Google Rebuilds Its File Detection Tool In Rust (googleblog.com) 10
BrianFagioli writes: Google has released Magika 1.0, a stable version of its AI-based file type detection tool, and rebuilt the entire engine in Rust for speed and memory safety. The system now recognizes more than 200 file types, up from about 100, and is better at distinguishing look-alike formats such as JSON vs JSONL, TSV vs CSV, C vs C++, and JavaScript vs TypeScript. The team used a 3TB training dataset and even relied on Gemini to generate synthetic samples for rare file types, allowing Magika to handle formats that don't have large, publicly available corpora. The tool supports Python and TypeScript integrations and offers a native Rust command-line client.
Under the hood, Magika uses ONNX Runtime for inference and Tokio for parallel processing, allowing it to scan around 1,000 files per second on a modern laptop core and scale further with more CPU cores. Google says this makes Magika suitable for security workflows, automated analysis pipelines, and general developer tooling. Installation is a single curl or PowerShell command, and the project remains fully open source. The project is available on GitHub and documentation can be found here.
Under the hood, Magika uses ONNX Runtime for inference and Tokio for parallel processing, allowing it to scan around 1,000 files per second on a modern laptop core and scale further with more CPU cores. Google says this makes Magika suitable for security workflows, automated analysis pipelines, and general developer tooling. Installation is a single curl or PowerShell command, and the project remains fully open source. The project is available on GitHub and documentation can be found here.
So... Garbage in, garbage out, again? (Score:2)
Using AI to train AI. What could go wrong?
What is that for? (Score:4, Insightful)
I'm really curious of the use scenario for an AI-reimplementation of the Unix "file" command that can process "scan around 1,000 files per second on a modern laptop core". Who needs to speed up their existing bash scripts to analyse the filetype of over 1000 files per second on a laptop?
Re: (Score:2)
An implementation of updatedb could make use of this, allowing to search files by their identified type.
100MB version of mimetype (Score:3)
I understand this is slightly more granular but damn it comes at a high cost in resources. While people may be able to look the other way at 100MB to replace a 10KB program, the computational cost is sooo MUCH higher.
Why? (Score:2)
What's the point? File types are well defined. Even if you feel the distinguish between "look-alikes" why wouldn't you use something that understands the the basic common types first, and then do your "AI" nonsense to detect dialects for those where it's relevant?
o rly? (Score:3)
Electricity cost per file idenfied (Score:2)
Two Asks:
- Need to know the electricity cost per each file identified comparing the old system versus the new system.
- Need to know the relative frequency of each of the file formats, an important benchmark for others building systems needing to process lots of file types.
For example, the percent of files that can be 100% identified by doing a binary compare of the first 10 bytes of the file for the magic number versus the cost in cpu instructions and electricity for an AI classification of those same files
This is a fine example (Score:4)
This is a fine example of a completely pointless, stupid, BRAIN-FUCKING-DEAD application of AI. The UNIX file command has existed for decades. It works well and it's fast.
But for "security" purposes, we need to make a program 1000x as big, 100x as slow, and that uses orders of magnitude more electricity so our email security software can distinguish text/json from text/jsonl
We have truly jumped the AI shark.
How does Magika detect file type (Score:2)
Not sure I see the point (Score:2)