Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
AI Programming

Magika 1.0 Goes Stable As Google Rebuilds Its File Detection Tool In Rust (googleblog.com) 26

BrianFagioli writes: Google has released Magika 1.0, a stable version of its AI-based file type detection tool, and rebuilt the entire engine in Rust for speed and memory safety. The system now recognizes more than 200 file types, up from about 100, and is better at distinguishing look-alike formats such as JSON vs JSONL, TSV vs CSV, C vs C++, and JavaScript vs TypeScript. The team used a 3TB training dataset and even relied on Gemini to generate synthetic samples for rare file types, allowing Magika to handle formats that don't have large, publicly available corpora. The tool supports Python and TypeScript integrations and offers a native Rust command-line client.

Under the hood, Magika uses ONNX Runtime for inference and Tokio for parallel processing, allowing it to scan around 1,000 files per second on a modern laptop core and scale further with more CPU cores. Google says this makes Magika suitable for security workflows, automated analysis pipelines, and general developer tooling. Installation is a single curl or PowerShell command, and the project remains fully open source.
The project is available on GitHub and documentation can be found here.
This discussion has been archived. No new comments can be posted.

Magika 1.0 Goes Stable As Google Rebuilds Its File Detection Tool In Rust

Comments Filter:
  • Using AI to train AI. What could go wrong?

  • What is that for? (Score:5, Interesting)

    by test321 ( 8891681 ) on Thursday November 06, 2025 @08:26PM (#65778940)

    I'm really curious of the use scenario for an AI-reimplementation of the Unix "file" command that can process "scan around 1,000 files per second on a modern laptop core". Who needs to speed up their existing bash scripts to analyse the filetype of over 1000 files per second on a laptop?

    • by flux ( 5274 )

      An implementation of updatedb could make use of this, allowing to search files by their identified type.

    • Security audits? Structured/unstructured data auditing can be pretty slow and resource intensive. And need to happen on more than just linux systems.

      This could make a very tedious task a lot easier. If you've ever had to do PII/PCI/ISO auditing, you probably know what I mean. And since it can probably find malicious executable code masquerading as some benign file, all the better.

  • by Gravis Zero ( 934156 ) on Thursday November 06, 2025 @08:35PM (#65778964)

    I understand this is slightly more granular but damn it comes at a high cost in resources. While people may be able to look the other way at 100MB to replace a 10KB program, the computational cost is sooo MUCH higher.

    • Have you ever had to scan hundreds of endpoints and file servers, running a mix of operating systems, for unstructured confidential (PII, PCI, etc) data? Not really a lightweight job for an OS specific tool.

      That's where I see this being very useful.

      • Have you ever had to scan hundreds of endpoints and file servers, running a mix of operating systems, for unstructured confidential (PII, PCI, etc) data?

        No, literally never. I doubt many people have.

        • You're probably right that relatively few people have to deal with that sort of compliance crap. I have been that guy though. Be thankful you were not?
  • What's the point? File types are well defined. Even if you feel the distinguish between "look-alikes" why wouldn't you use something that understands the the basic common types first, and then do your "AI" nonsense to detect dialects for those where it's relevant?

    • Because sometimes you can't trust that things are what they appear to be. Sure, maybe it claims to be a PDF, but that doesn't mean it isn't actually a disguised executable. Maybe that JPEG is actually a text file filled with financial data someone is trying to exfiltrate, and they had the sense to set the first four characters to JEPG with a matching extension. It won't open, but it'll look like an image to anything that just checks extensions or FourCCs.

      If that isn't a concern, then yeah, filter out t

      • I think you misunderstood my point. It was not to rely on a file extension, but the same definitions used by file(1)

        • No, I got it. The first four characters (4CC) of an image file is basically the AV version of the "magic number" file looks for in its second stage. It will recognize a PNG file by seeing the ASCII characters "PNG" at the beginning of the file. File will see that "magic number" and stop, assuming it has correctly identified a PNG file. But what if after that header there's executable code?

          I'm with you on speeding the process up for low priority scans by filtering out the obvious, but that's also what

  • o rly? (Score:4, Informative)

    by Grady Martin ( 4197307 ) on Thursday November 06, 2025 @10:27PM (#65779142)

    $ find kernel.org/linux/include/linux/ -maxdepth 1 -type f | wc -l
    1235
    $ time file --mime-type -b -- kernel.org/linux/include/linux/* > /dev/null

    real 0m0.690s
    user 0m0.581s
    sys 0m0.033s

  • Two Asks:

    - Need to know the electricity cost per each file identified comparing the old system versus the new system.
    - Need to know the relative frequency of each of the file formats, an important benchmark for others building systems needing to process lots of file types.

    For example, the percent of files that can be 100% identified by doing a binary compare of the first 10 bytes of the file for the magic number versus the cost in cpu instructions and electricity for an AI classification of those same files

  • by dskoll ( 99328 ) on Thursday November 06, 2025 @10:46PM (#65779174) Homepage

    This is a fine example of a completely pointless, stupid, BRAIN-FUCKING-DEAD application of AI. The UNIX file command has existed for decades. It works well and it's fast.

    But for "security" purposes, we need to make a program 1000x as big, 100x as slow, and that uses orders of magnitude more electricity so our email security software can distinguish text/json from text/jsonl

    We have truly jumped the AI shark.

    • Sometimes, when a new tool looks useless, it's because you don't have a particular use for it. Someone else might see it and be appreciative about how much time and effort it will save. I saw this and started thinking about how long it took StealthAudit to scan all my endpoints.

      Just the other week, my brother-in-law got my wife to buy a little tool for spreading open a spring on my washing machine. $11, over my objection that it was silly to spend any money on something that would probably only be used

    • But for "security" purposes, we need to make a program 1000x as big, 100x as slow,

      Good sire, you offend me! It's at least 10000x as big and 100000x as slow!

  • Magika's deep learning system processes three sequences of 512 bytes, from the start, middle, and end of each file. Allowing it to pick up unique structural, semantic, and content cues associated with different formats.
  • The best way to detect file types is to make them unambiguously what they are by means of content type, file extension and structure. And fuck any format that decides to be something vague or ambiguous - they brought that problem on themselves.
    • Doesn't that put text/* on the "to be f-ked" list? Wouldn't life be a little harder without text/csv (the ambiguity between csv and tsv having been mentioned in the article)? Think how much more complicated it would be if you had to pipe your data out to .xslx instead.

The gent who wakes up and finds himself a success hasn't been asleep.

Working...