PyPi is Reducing Stored IP Address Data (theregister.com) 10
The PyPi registry of open source Python packages "began evaluating ways to reduce the amount of identifying information that it stores," reports the Register, "even before the U.S. Justice Department came asking for data on suspect users."
But now, "the Python community package registry wants developers to understand that it's working to minimize the user data that it stores." The goal is not to be unable to respond to lawful requests for information; rather it's to store only the minimum amount of data necessary so as not to expose users to unnecessary privacy intrusion. Coincidentally, data minimization may prevent organizations from becoming a preferred source of on-demand surveillance: having excessive amounts of information about users invites legal demands, which staff then have to handle...
Mike Fiedler, a member of the PyPI admin team, said in a statement on Friday that the organization's effort to improve user privacy and security dates back to 2020. Since the receipt of the subpoenas in March and April, that effort has been reinvigorated.
Much of the concern focuses on IP address data, which gets stored in conjunction with web log access; user events such as logins; project events including uploads; events associated with recently introduced organizations; and administrative PyPI journal entries. According to Fiedler, PyPI was able to stop storing IP data for journal entries — an append-only transaction log — because these were only exposed to administrators... To obscure IP addresses, PyPI is salting them — adding an arbitrary value — and then hashing them — running the data through a one-way scrambling function that creates a value called a hash. This provides a way to store a reference to potentially identifying data without actually storing raw data... PyPI has been using its CDN provider Fastly to pass along a salted hash of the IP address for requests via a custom header, along with broad GeoIP data (the country and city where the user is located), and is using that instead of the raw IP address. In April, the registry adopted code changes for hashing and salting IP addresses for requests that PyPI handles directly in Warehouse, the web application that implements the official Python package index.
And over the past few days, it has been replacing IP addresses in the PyPI user interface with geolocation data. PyPI still relies on IP address information to identify abuse — the creation of malicious packages, harassments, and so on — but Fiedler says even that is being looked at. "We're thinking about how to manage that without storing IP data, but we're not there yet," he said. Fiedler says the PyPI team will be weighing whether it can remove IP data from event history records after a period of time and whether the service can handle all its requests via CDN.
But now, "the Python community package registry wants developers to understand that it's working to minimize the user data that it stores." The goal is not to be unable to respond to lawful requests for information; rather it's to store only the minimum amount of data necessary so as not to expose users to unnecessary privacy intrusion. Coincidentally, data minimization may prevent organizations from becoming a preferred source of on-demand surveillance: having excessive amounts of information about users invites legal demands, which staff then have to handle...
Mike Fiedler, a member of the PyPI admin team, said in a statement on Friday that the organization's effort to improve user privacy and security dates back to 2020. Since the receipt of the subpoenas in March and April, that effort has been reinvigorated.
Much of the concern focuses on IP address data, which gets stored in conjunction with web log access; user events such as logins; project events including uploads; events associated with recently introduced organizations; and administrative PyPI journal entries. According to Fiedler, PyPI was able to stop storing IP data for journal entries — an append-only transaction log — because these were only exposed to administrators... To obscure IP addresses, PyPI is salting them — adding an arbitrary value — and then hashing them — running the data through a one-way scrambling function that creates a value called a hash. This provides a way to store a reference to potentially identifying data without actually storing raw data... PyPI has been using its CDN provider Fastly to pass along a salted hash of the IP address for requests via a custom header, along with broad GeoIP data (the country and city where the user is located), and is using that instead of the raw IP address. In April, the registry adopted code changes for hashing and salting IP addresses for requests that PyPI handles directly in Warehouse, the web application that implements the official Python package index.
And over the past few days, it has been replacing IP addresses in the PyPI user interface with geolocation data. PyPI still relies on IP address information to identify abuse — the creation of malicious packages, harassments, and so on — but Fiedler says even that is being looked at. "We're thinking about how to manage that without storing IP data, but we're not there yet," he said. Fiedler says the PyPI team will be weighing whether it can remove IP data from event history records after a period of time and whether the service can handle all its requests via CDN.
Re: (Score:2)
But then you can’t monetize it.
Re: (Score:2)
Would be nice if more projects would take the hint, _BEFORE_ they get subpoenad.
Remember, kids: Data is Toxic. Don't keep anything around you don't want to have to give out to whoever might subpoena you, or just raid your offices and take All The Data.
I think the PyPi folks would have completed their project sooner had some of their Python packages not gotten corrupted.
So friendly, these goons (Score:2)
So friendly these law and enforcement goons. Glad people correctly see them as the enemy.
Re: (Score:2)
Do you have a thin blue line sticker on your car?
Re: So friendly, these goons (Score:2)
Did you misread? My comment was calling the police goons.
Re: (Score:3)
Lots of people love local cops because they harass “undesirables” but hate the federal agencies because they investigate right wing terrorists.
Re: (Score:3)
Funny thing: the ones who bother with subpoenas really are friendly, relatively, because they approach you and demand the information. The other 99% just sneakily take it.
Yes, LE sneakily takes it too, but not always. Sometimes they subpoena, instead. And that's what you're usually going to hear about in the news. Usually.
If you do happen to solve the problem with the sneaky ones, then you'll solve the LE problem as well, since there will be nothing extant for LE's subpoenas to demand. If your defense can't
Re: (Score:3)
Or more correctly, answer the question "why are we keeping so much personal data in the first place?"
Why are websites asking for so much personal data? There's only a limited amount you need to operate on - many times not much more than a username and password. Asking for any other information - name, address, etc., is already asking for too much, and you're now logging IP addresses too?
What sites are starting to realiz