Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Python Open Source Security

How Python is Fighting Open Source's 'Phantom' Dependencies Problem (blogspot.com) 30

Since 2023 the Python Software Foundation has had a Security Developer-in-Residence (sponsored by the Open Source Security Foundation's vulnerability-finding "Alpha-Omega" project). And he's just published a new 11-page white paper about open source's "phantom dependencies" problem — suggesting a way to solve it.

"Phantom" dependencies aren't tracked with packaging metadata, manifests, or lock files, which makes them "not discoverable" by tools like vulnerability scanners or compliance and policy tools. So Python security developer-in-residence Seth Larson authored a recently-accepted Python Enhancement Proposal offering an easy way for packages to provide metadata through Software Bill-of-Materials (SBOMs). From the whitepaper: Python Enhancement Proposal 770 is backwards compatible and can be enabled by default by tools, meaning most projects won't need to manually opt in to begin generating valid PEP 770 SBOM metadata. Python is not the only software package ecosystem affected by the "Phantom Dependency" problem. The approach using SBOMs for metadata can be remixed and adopted by other packaging ecosystems looking to record ecosystem-agnostic software metadata...

Within Endor Labs' [2023 dependencies] report, Python is named as one of the most affected packaging ecosystems by the "Phantom Dependency" problem. There are multiple reasons that Python is particularly affected:

- There are many methods for interfacing Python with non-Python software, such as through the C-API or FFI. Python can "wrap" and expose an easy-to-use Python API for software written in other languages like C, C++, Rust, Fortran, Web Assembly, and more.

- Python is the premier language for scientific computing and artificial intelligence, meaning many high-performance libraries written in system languages need to be accessed from Python code.

- Finally, Python packages have a distribution type called a "wheel", which is essentially a zip file that is "installed" by being unzipped into a directory, meaning there is no compilation step allowed during installation. This is great for being able to inspect a package before installation, but it means that all compiled languages need to be pre-compiled into binaries before installation...


When designing a new package metadata standard, one of the top concerns is reducing the amount of effort required from the mostly volunteer maintainers of packaging tools and the thousands of projects being published to the Python Package Index... By defining PEP 770 SBOM metadata as using a directory of files, rather than a new metadata field, we were able to side-step all the implementation pain...

We'll be working to submit issues on popular open source SBOM and vulnerability scanning tools, and gradually, Phantom Dependencies will become less of an issue for the Python package ecosystem.

The white paper "details the approach, challenges, and insights into the creation and acceptance of PEP 770 and adopting Software Bill-of-Materials (SBOMs) to improve the measurability of Python packages," explains an announcement from the Python Software Foundation. And the white paper ends with a helpful note.

"Having spoken to other open source packaging ecosystem maintainers, we have come to learn that other ecosystems have similar issues with Phantom Dependencies. We welcome other packaging ecosystems to adopt Python's approach with PEP 770 and are willing to provide guidance on the implementation."

How Python is Fighting Open Source's 'Phantom' Dependencies Problem

Comments Filter:
  • It sounds and looks like this just means "files at a specific path that weren't included in the package."

    Why is this not a bug?

    • No, it is system libraries bundled with a Wheel without proper listing in the SBOM, such CVE scanning tools won't find them and therefore can't give a security warning. It looks like a missing feature in the distribution format. To me the root course seems to be that there is so many package managers, that making a propor interopperable system is impossible. Each language have their own packages manager. Linux have at least 3 commonly used packages systems. MacOS have some. Etc. When a Python package needs
      • The reason there are so many package managers is because of two reasons:

        1. Different distros having different filesystem paths and package names they need to support. This is typically for historical reasons. As most distros are based off of Fedora / Debian, which have their own support contracts that they need to uphold. Which means the distros take on Fedora's / Debian's package burden when they are made. The technical reason is the lack of a side by side installation filesystem schema that the distros
        • Point 1 is way more complicated... Sorting out paths is the easiest, then you need to figure out dynamic linking/import hierarchies, and in the end you will have 2 distros in the same filesystem side by side. That's AppImage/Snap/Flatpak.

          The second is a "devops" problem: developer, integrator and distributor should be separate entities. All three would constantly push and pull with different requirements.

          Language repositories could be a thing, if they could be integrated directly into distro package manager

  • FTA:
    " Python is the premier language for scientific computing and artificial intelligence, meaning many high-performance libraries written in system languages need to be accessed from Python code."

    Which has for a long time made me wonder why the entire program isn't written in system languages such as C++. Given how critical performance is to these paradigms and the amount spent on GPU hardware for acceleration you'd think ditching a slow language such a Python would be a no brainer even if maybe as a wrapp

    • Tensorflow was written in C++ and Python, so you can choose which language you want to use.

      Data scientists and machine learning specialists tend to know Python and not C++, so that's why all these libraries are written for use with Python.
      • by Viol8 ( 599362 )

        Frankly if they can learn the maths around data science and AI then learning a new progamming language such as C++ shouldn't be much of a problem for them (thought the C++ steering committee seem to be doing their best to turn it into a dogs dinner write only language).

        • It's not that C++ is a problem, it's that they all know Python, and they would rather put their effort into studying improvements in activation functions than learning C++. And actually C++ is not trivial, it does take time to learn.
          • by Viol8 ( 599362 )

            I guess if they or their organisation want to spend money on extra compute power to offset using python instead of spending a few weeks getting up to speed on C++ thats up to them. Sure, C++ isn't trivial but its hardly Klingon either and they wouldn't need to learn all the obscure dusty parts of its cupboard to use it in this arena.

        • Re:AI and scientific (Score:5, Interesting)

          by superposed ( 308216 ) on Monday August 11, 2025 @12:09PM (#65581770)

          Frankly if they can learn the maths around data science and AI then learning a new progamming language such as C++ shouldn't be much of a problem for them

          Both of the programs below will read columns a and b from a parquet file, multiply them elementwise using vectorized operations, and save the original columns plus the new one to a new parquet file. For large datasets, the Python or C++ versions will take equally long to run. But the Python one is much faster to write and debug and easier for partners to understand. The dependency installation process will also be identical across platforms for the Python version. That is why people use Python for data science.

          Python version:

          import pandas as pd
          df = pd.read_parquet("data.parquet", engine="pyarrow")
          df["c"] = df["a"] * df["b"]
          df[["a", "b", "c"]].to_parquet("output.parquet", index=False, engine="pyarrow")

          C++ version (with some extra-long lines and extra spaces to keep slashcode happy):

          #include <iostream>
          #include <memory>
           
          #include <arrow/api.h>
          #include <arrow/compute/api.h>
          #include <arrow/io/api.h>
          #include <parquet/arrow/reader.h>
          #include <parquet/arrow/writer.h>
           
          using arrow::Status;
           
          int main() {
          // Open input Parquet
            auto infile_result = arrow::io::ReadableFile::Open("data.parquet");
            if (!infile_result.ok()) {
              std::cerr << "Failed to open data.parquet: " << infile_result.status().ToString() << "\n";
              return 1;
            }
            std::shared_ptr<arrow::io::ReadableFile> infile = *infile_result;
           
          // Read as Arrow Table
            std::unique_ptr<parquet::arrow::FileReader> reader;
            auto st = parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader);
            if (!st.ok()) { std::cerr << st.ToString() << "\n"; return 1; }
           
            std::shared_ptr<arrow::Table> table; st = reader->ReadTable(&table);
            if (!st.ok()) { std::cerr << st.ToString() << "\n"; return 1; }
           
          // Fetch columns a and b
            auto a_col = table->GetColumnByName("a"); auto b_col = table->GetColumnByName("b");
            if (!a_col || !b_col) {
              std::cerr << "Input must contain float columns named 'a' and 'b'.\n";
              return 1;
            }
           
          // Ensure float64 and compute c = a * b using Arrow's vectorized kernels
            auto f64 = arrow::float64();
            auto a_cast_res = arrow::compute::Cast (arrow::Datum(a_col), f64);
            auto b_cast_res = arrow::compute::Cast (arrow::Datum(b_col), f64);
            if (!a_cast_res.ok() || !b_cast_res.ok()) {
              std::cerr << "Failed to cast columns to float64.\n"; return 1;
            }
            arrow::Datum a64 = *a_cast_res; arrow::Datum b64 = *b_cast_res;
           
            auto mult_res = arrow::compute::CallFunction("multiply", {a64, b64});
            if (!mult_res.ok()) { std::cerr << "Multiply failed: " << mult_res.status().ToString() << "\n"; return 1; }
            arrow::Datum c = *mult_res;
           
          // Build output table (a, b, c)
            auto schema = arrow::schema({
              arrow::field("a", f64), arrow::field("b", f64), arrow::field("c", f64)
            });
           
            auto out_table = arrow::Table::Make(
                schema, {a64.chunked_array(), b64.chunked_array(), c.chunked_array()}
            );
           
          // Write Parquet
            auto outfile_result = arrow::io::FileOutputStream::Open("output.parquet");
            if (!outfile_result.ok()) {
              std::cerr << "Failed to open output.parquet for writing: " << outfile_result.status().ToString() << "\n";
              return 1;
            }
            std::shared_ptr<arrow::io::OutputStream> outfile = *outfile_result;
           
            st = parquet::arrow::WriteTable(*out_table, arrow::default_memory_pool(), outfile, /*chunk_size=*/1024);
            if (!st.ok()) { std::cerr << "WriteTable failed: " << st.ToString() << "\n"; return 1; }
           
            return 0;
          }

          • I would point out that your python version doesn't have any error handling, the very thing that requires the most lines in the C++ version.

          • You seem to have picked a rather obscure file format - no most people don't use apache hadoop - in order to make the C++ code far more complicated than it would be just reading a csv file and multiplying as appropriate. And as someone else has said - your python code has no error checking, it'll just bomb out if there's a problem.

            • You seem to have picked a rather obscure file format - no most people don't use apache hadoop - in order to make the C++ code far more complicated than it would be just reading a csv file and multiplying as appropriate. And as someone else has said - your python code has no error checking, it'll just bomb out if there's a problem.

              The built-in Python error handling is pretty equivalent to the explicit handling in the C++ program: file not found, missing column a, etc.

              I picked parquet because I see it a lot for data science workflows. If youâ(TM)d rather use csv, just change âparquetâ(TM) to âcsvâ(TM) in the Python code and drop the âoeengineâ argument (itâ(TM)s not actually needed anyway). I doubt the change would be so simple in the C++ code, or that you would end up with a C++ program as easy

              • by Viol8 ( 599362 )

                The python error "handling" would take the form of an exception being thrown and the program being dumped back to the command line. Hardly ideal.

                Also I'd never claim that C++ code would be as short as Python, however the difference wouldn't be as large as your example suggests.

      • Re:AI and scientific (Score:4, Informative)

        by HiThere ( 15173 ) <(ten.knilhtrae) (ta) (nsxihselrahc)> on Monday August 11, 2025 @11:37AM (#65581654)

        Additionally, if the heavy compute stuff is done in a compiled language, for efficiency, it doesn't matter much that there's a slow Python layer on top of it. That's often limited by I/O speed rather than by compute speed anyway.

  • by Anonymous Coward

    There is no reason, none whatsoever for Python to impose this crap on its package managers. Yes the government and big banks want SBOMs from their software vendors, but that sounds like a "them" problem not a "Python" problem. They can use Python or not, makes no difference to the Python maintainers. Just rolling over and doing what the feds want on a hobby project. Awesome.

    The bigger problem is that SBOMs in no way shape or form actually provide for security which is the goal. The security problems are fro

    • by Entrope ( 68843 )

      Merely having an SBOM doesn't solve the problem of having a software package with some security bug, but without an SBOM you're not likely to even know you have a problem. SBOMs are needed in a complex software ecosystem, but yes, they do need to be complete and correct.

      • by Junta ( 36770 )

        Problem being that if the tooling is so hopeless it can't actually do an analysis of an installed instance and correctly identify the SBOM material itself, it's likely the SBOM is going to be of low quality anyway, with mostly false positives.

        I participate routinely in SBOM reviews from one of the platforms in the industry. It's like 95+% false positives. There's another tool which does more than that, but does 'SBOM' type stuff, but instead of trying for 'application', it does the 'environment' and is mu

    • by Bongo ( 13261 )

      True.

      Also the habit of treating security like it's a general state, when it's a highly contextual matter. The more sensitive and important a system's function and data, the smaller and more guarded and conservative it has to be, because checking is so difficult and time consuming that you can't afford to protect anything but the smallest and simplest system.

    • You misunderstand institutional security. The end sought is not true security, but rather, plausible deniability. The institutional users want to be able to point to this SBOM and say, "It passed all our security audits", not "Our analysis missed the security vulnerability which brought down our systems." No one in a large institution wants to take the blame for the inevitable security vulnerabiliity, so things like the SBOM provide the requisite blame deflection back to the package maintainer. This wa

Surprise your boss. Get to work on time.

Working...