The Environmental Impact of PHP Compared To C++ On Facebook 752
Kensai7 writes "Recently, Facebook provided us with some information on their server park. They use about 30,000 servers, and not surprisingly, most of them are running PHP code to generate pages full of social info for their users. As they only say that 'the bulk' is running PHP, let's assume this to be 25,000 of the 30,000. If C++ would have been used instead of PHP, then 22,500 servers could be powered down (assuming a conservative ratio of 10 for the efficiency of C++ versus PHP code), or a reduction of 49,000 tons of CO2 per year. Of course, it is a bit unfair to isolate Facebook here. Their servers are only a tiny fraction of computers deployed world-wide that are interpreting PHP code."
First post from TFA nails it (Score:2, Informative)
Please (Score:2, Informative)
I don't care about your environmentalism.
Re:php is bad for the environment (Score:5, Informative)
Morons. (Score:1, Informative)
This is why people don't take global warming seriously. Please, just stop it. If you really wanted to help, you could just fucking kill yourself and cut your carbon footprint to 0.
Re:Sounds like cheap C-- drugs ! (Score:3, Informative)
Re:people use PHP? (Score:4, Informative)
Re:Please (Score:5, Informative)
Re:10:1... Really? (Score:1, Informative)
A cached C++-generated page takes just as much server load as a cached PHP-generated page. Better?
Author needs a clue about metrics (Score:5, Informative)
Yes, PHP is a heck of a lot slower on proccessor-bound tasks than C++. In a pure benchmarking contest, no doubt C++ will win.
But what about when both languages have to query a database (be it mysql/postgress/oracle, etc)? In this case, both are blocked on the speed of the database. a 15 ms query takes 15 ms no matter what language is asking. Facebook is not calculating pi to 10 gazillion digits, and it is not checking factors for the Great Internet Mersenne Prime Search. It is serving up pages containing tons of customized data. This is not proessor-bound... it is I/O bound both on the ins and outs of the database and the ins and outs of the http request. It is also processor bound on the page render, but the goal of this many machines is to cache to the point where page renders are eliminated.
Once a page is rendered, it can be cached until the data inside of it changes. For something like facebook, I bet a page is rendered once for every ~10 times it is viewed by someone. Caching is done in ram, and large ram caches take a lot of machines.
So lets look at those 30,000 machines not by their language, but by their role. We can argue the percentages to death, but lets assume 1/3rd are database, 1/3rd are cache, and 1/3rd are actually running a web server, assembling pages, or otherwise dealing with the end users directly (BTW, I think 1/3rd is way high for that.)
So 1/3rd of the machines are dealing with page composition and serving pages. If they serve a page ~10 times for every render request, then abtou 1/10th of the page requests actually cause a render... the rest are being served from cache. Those page renders are I/O bound, as in the example above - waiting on the database (and other caches, like memcached), so even if they are taking a lot of wait cycles, they are not using processor power on the box. The actual page composition (which might be 20% of the processing that box is doing), would be a lot faster in C++... So 10,000 servers, the virtual equivalent of 2000 are generating pages using php, and could be replaced by 200 boxes using stuff generated in C++.
So the choice of using php is adding ~1800 machines to the architecture. or ~6% of the total 30,000. Given that a php developer is probably 10x more productive than a developer in C++, is the time to market with new features worth that to them? I bet it is.
Re:where did he get this factor? (Score:3, Informative)
Re:No. (Score:3, Informative)
Very true: they are are big contributor to projects like memcache.
Re:10:1... Really? (Score:3, Informative)
PHP's primary issue in the database department is it doesn't have a clean way of say, maintaining prepared statement declarations across connection instances. Which is frustrating. APC's handling of shared memory is not the best, either, and the memcached extensions for it need polish. Don't get me started on how PHP treats constants.
Where PHP really fails, however, is in memory usage. It takes up dozens of times as much RAM as a well-built C program would. Facebook would not reduce their computer count by a factor of ten because PHP is that much less efficient at its job, but because more memory would be available in a given machine to handle more instances at once.
Note: PHP 5.3 addresses a lot of this, but though I haven't tested it, I doubt the memory efficiency of PHP is going to get far into the double-digit percentiles of C++ in one shot.
Re:it's stoopid because (Score:2, Informative)
A C app would be much faster (Score:2, Informative)
The proposed ratio of 1:10 is real, if not bigger. And here's why:
1.) For each request, PHP has to load entire application responsible for that particular response, including its configuration, etc. With memcache(d), you have to instantiate connection classes and reconfigure them, per request. Languages like C/C++, Python and Ruby have different architecture to begin with. They load ONCE and each request triggers a FUNCTION or METHOD of a class, with all the app-specific configuration, db and memcached connections done and configured on app init, NOT per request.
2.) TFA mentions microsecond relevance! Even a simple echo "Hello World" will take much more time than similar action in C. I have yet to see a PHP helloworld app that does it in under 1msec, let alone the microseconds required.
3.) Arrays in PHP are slow, being always hashmaps. Other data structures can speed up things. You don't always need hashmaps. SPLFixedArray() is a joke, btw, and available only as of 5.3. Can't compare it to a vector anyways, and lots of fixed structures can be represented by structs or classes in C which are anways faster than in PHP. Also the app can instantiate them once on init, and just (re)load when required.
4.) Even if all the app does it parse input vars and call memcache(d) / database funcs/methods to retrieve/store data, those calls are faster in C. Params can be parsed quicker in C, not requiring hashmaps for instance.
5.) FastCGI is crap. If this app were to be done in C, then it would require its own HTTP layer, epoll based (for Linux). It can take out all the crap in HTTP that is not requred to parse the AJAX calls, and does not need to be "generic" enough to deliver static content.
6.) For such dedicated and distributed deployments, garbage collection is sometimes not required. For instace, fixed-length stuctures can be preallocated upon app init, and the app can really take as much RAM as possible on startup. Yes, that would limit the MAX number of users/connections per server, but so what? The app dominates the server, nothing else is required to run (except basic OS environment for the app), so fixed memory consumption is not a problem.
7.) Even though each request has to wait for I/O of some sorts, either from memcache(d), from disk or from DB, you can process much more of these per front-end server and just scale backend servers as required. For example, with PHP your front-end server can serve 100k/sec, having X DB backends and Y memcached backends. With a C application, the front end can serve, say, 1M/sec. You still get to keep one front-end, even though you had to put more backends.
In short, you can significantly reduce the number of servers required if the app was written in C.
Re:Interpreted Languages... (Score:4, Informative)
For example, consider the following. Say bad things about PHP all you want (it deserves it) but one of the things you don't generally see with PHP code is a buffer overflow, where you try to copy a bunch of strings and concatenate them together and you run out of room and don't notice it and you go clobbering memory. That's because the string manipulation code goes through a bunch of checks when you're appending strings. You can't just skip these checks and hope that everything will work the same. You may know that such and such a code-path isn't going to need all the bounds checking because you're, say, idunno, assembling fixed-length ZIP+4 codes or something, but the scripting language can't be informed of that fact using any extant mechanism (nor is it clear how you could integrate such a mechanism with the powerful abstraction that lets you not worry about the rest of your strings to begin with).
Moreover, as has already been pointed out, a lot of the computational price of rendering a web page is database queries and memory-cached-object queries which employ compiled code already. The string-manipulation overhead isn't all that significant compared to the abstraction that it buys you. It's probably a better idea to track down logic issues, where your code does stupid useless computations that it doesn't need that make it slow, or could do certain computations in advance to make it faster, or such.
I think there's a lot more potential for interesting machine optimization of code for things coming from the functional paradigm, where you can mathematically show the equivalence of certain portions of code with its optimized replacement, and that this paradigm will be making a resurgence in some places during the upcoming era of 128-core processors. This might be interesting.
Re:10:1... Really? (Score:1, Informative)
Large websites use caching on the server side, as well as the client side.
At the basic level, you can make sure your server side code spits out proper caching headers, and stick a caching reverse proxy in front of it (like Squid). Any page that's in the cache will be served directly out of the cache, and won't even touch the server. Even if it does, checking that the page content hasn't been changed and returning a 304 response is going to be just as fast in PHP as it is in C++ - the time taken to set up and tear down the network connection will dwarf the server code.
Even better - you can cache pages, parts of pages, or raw data in the server application, and stick them in some kind of shared, distributed memory cache (like memcache, which Facebook use). Most of the time, the server code will be grabbing some results from the cache, sticking it in a template, and sending it to the client. Since the server code isn't doing much, it's going to be fast no matter what language it's written in.
Worst case - you actually have to build the entire page, with data taken from the actual database. That means you need to connect to the database server, issue some SQL queries, stick the results in a template, and send that to the client. The slowest part here is unquestionably going to be the database - it hardly matters which language you're using, since you're talking to the same database server.
Re:php is bad for the environment (Score:4, Informative)
From my personal experience: Data-heavy applications run at a complete crawl in PHP. 10 times slower, is, in my opinion, a vast understatement.
Then again, that’s not the point of PHP. The point is, that in PHP, provided you already know how to program, also get things done more than 10 times faster, than in C++. Because there is a simple function with defaults and automatisms for literally everything.
Only if those defaults and automatisms are other than what you expect, you will get into big trouble. And because the PHP interpreter is truly a horrible piece of shit (I was able to run totally illegal constructs, with plain text right in the middle of the code, and it ran, doing nothing of what I expected it to do.), that happens quite a lot.
It’s one reason that drove me to the extreme strictness of Haskell, where you have to get it right upfront, so it doesn’t bite you in the ass later.
Re:No. (Score:2, Informative)
MemCache.
Last time I read about Facebook (2008), they had 800+ machines running memcache, providing somewhere in the region of 28 TB of total memory.
Then the databases is shared across thousands of database servers.
I think our C++ friend is rapidly running out of servers to run his efficient C++ on....
Re:10:1... Really? (Score:3, Informative)
In terms of total page delivery latency for a typical I/O bound application, sure. In terms of actual cpu usage, 10x overhead for any dynamically typed language is to be expected. If the application servers are CPU bound, that means a lot more servers.
In addition, dynamic languages do not compile or JIT well, compared to statically typed languages, which severely limits the overhead reduction achievable.
Re:people use PHP? (Score:2, Informative)
So your entire argument is "nuh-uh?"
Re:Interpreted Languages... (Score:5, Informative)
Actually, Facebook uses APC [facebook.com] to compile and optimize the code in the shared memory so it doesn't have to be compiled over and over again.
There are other libraries for caching PHP functions on many different levels as well, and they're open source, for the most part. Some real bright minds from Facebook and other large PHP applications have contributed to them.
Bottom line: PHP is quite powerful and efficient when built and extended properly.
Figures off by a factor of 10 to 100 (Score:3, Informative)
My own experience doing server development in c was that it's a minimum of 30:1 (and in in some cases, much greater). Plus the speed differential is huge, and also in favour of c.
There's a big difference between a couple of hundred requests a second and 6,000 - 10,000.
Then again, the php code had to be served through apache, while the c code was served directly by a custom server sitting on a separate socket, so there's no telling how much of the overhead was from apache.
Even the absolute worst-case scenarios were well over 10:1.
Re:Umm... no. (Score:3, Informative)
the author...has no clue about what uses most time (waiting for database results mostly)
Like many here, you are confusing page delivery latency with total processor overhead. If you need more than one processor for page processing, how many you need has little or nothing to do with how much latency there is elsewhere in the system.
Re:10:1... Really? (Score:1, Informative)
No. Most time is usually spent generating pages: the Wikipedia server role [wikimedia.org] page shows it clearly.
Yeah, statistics, they are all lies if they don't support my current opinions ;). See for yourself [debian.org]. PHP is between 3 to 116 times slower than C++.
Re:Assuming... (Score:3, Informative)
Latency is a different question than efficiency. If your page generation efficiency is bad, on a small setup the difference may be imperceptible. On a large installation, i.e. one with a large number of servers dedicated to page generation, the efficiency of those servers makes a big difference. Holding latency constant, in a large installation less efficient page generation means more servers. In a small installation, not so much.
Re:Figures off by a factor of 10 to 100 (Score:4, Informative)
Those were actual benchmarks run at peak load for 5-minute periods. sustained rate of over 600,000 queries in 5 minutes, or 2,000 per second (around 2,200 iirc), on absolutely craptastic hardware, against an 8 gig mysql table. Benchmark was by running ab (apache benchmark) against a custom forking server instead of apache, tested with between 100 and 400 simultaneous requests. Threads were never "reaped", always reused, so it was important that there were no memory leaks, but never having to spawn another thread after initial startup also contributed to the difference.
Contrast to php, where every script has to be loaded, interpreted, then flushed out of the system so it leaves a clean memory footprint for the next script, and where tons of variables that your script may never call have to be initialized each run. Obviously only compiling what you need and loading it once is more efficient :-)
Re:people use PHP? (Score:3, Informative)
I remember when it was the script kiddie's substitute for cgi-perl. What does it offer from a theoretical and engineering PoV, apart from a Visual Basic learning curve?
Market penetration. From managerial perspective, you can hire PHP developers a dime a dozen, and replace them very quickly if needed. From developer perspective, you can grab any of those "PHP in 10 nanoseconds for complete idiots" books, an Apache+PHP+MySQL bundle installer for Windows, and learn it in a few days to the level sufficient to be hired.
Of course, the typical quality of a PHP solution is what you'd expect from such approach, but when did it ever stop anyone?
If you mean technological advantages, than there are none whatsoever. As a language, PHP today is essentially Java with weak typing, no proper packages, namespaces just being introduced (so no existing library uses them), and some very questionable language design decisions (like $a[1] is the same element as $a["1"], but $a["01"] is distinct).
From library perspective, the coverage is okay - about what you'd expect from a decent modern platform - but API design is essentially random and inconsistent with no common guidelines followed, and things such as Unicode support are usually an afterthought.
In short, there's nothing there over Python or Ruby, or even Groovy or Boo.
As to why it ended up in the spot it is in? Well, it's actually fairly obvious when you look at the history. PHP version 4, the point at which its popularity skyrocketed, was released in 2000. At that point the established frameworks were ASP (not ASP.NET - that didn't exist yet), JSP, and ColdFusion. Mentions of MVC at this point, in the context of Web development, would just earn you some blank stares; at best, some particularly advanced Java devs would be aware of "model 2" [wikipedia.org], handcoded via servlets and JSPs...
ColdFusion was both getting dated, and cost $$$. The latter bit especially meant that it was right out for many.
ASP was really simplistic, with VBScript as a primary language (and that was much more primitive than PHP), no decent IDEs, and not exactly fast either; also, while it was kinda free itself, you needed IIS to run it, and that (in 2000, remember?) came with Win2K, which not everyone in the "casual newbie developer" group had or even wanted to have, and which was more expensive than 9x/ME (meanwhile, Apache ran on 9x).
JSP itself was okay in this context, but it had two problems compared to PHP. First of all, Java is still a rather verbose language, and that complexity showed when you don't have anything like modern frameworks mapping requests to beans etc. At that point, you had to work with raw request parameters (strings!), query database in raw SQL, and output plain text data (strings!) - and while you can do that all in Java, the corresponding PHP code was usually much shorter. As well, no-one in PHP land cared about the theoretical advantages of database decoupling that JDBC gave you, because they (we, really; I was doing it at that time as well) just hardcoded mysql_* calls, because that's all that was expected to be supported in the foreseeable future.
The other problem JSP had was setting it all up. Today, you can just download Netbeans and get it all out of the box configured properly; back then, it usually involved getting JDK first, then downloading and configuring Tomcat (not for the faint of heart, too). I don't recall seeing any all-in-one, one-click setup bundles like there were for PHP.
And documentation. Oh yes, that still persists as a myth that "PHP has the bestest docs" (witness various fanboi replies in this thread). It hasn't been true for a few years at least, but back then it definitely was. The big deal was that PHP manual somewhat tutorial-like - something you could read without having any clue as to how it all works, and get the general idea along with the minimum of details that you actually need to get it all working. Meanwhile
Re:A C app would be much faster (Score:2, Informative)
The proposed ratio of 1:10 is real, if not bigger. And here's why:
1.) For each request, PHP has to load entire application responsible for that particular response, including its configuration, etc. With memcache(d), you have to instantiate connection classes and reconfigure them, per request. Languages like C/C++, Python and Ruby have different architecture to begin with. They load ONCE and each request triggers a FUNCTION or METHOD of a class, with all the app-specific configuration, db and memcached connections done and configured on app init, NOT per request.
With caches like APC, overhead is very much mitigated. PHP can also use a pool of connections to memcache/database to minimize connection delays.
2.) TFA mentions microsecond relevance! Even a simple echo "Hello World" will take much more time than similar action in C. I have yet to see a PHP helloworld app that does it in under 1msec, let alone the microseconds required.
helloworld.php takes 0.363ms on average here on my laptop.
3.) Arrays in PHP are slow, being always hashmaps. Other data structures can speed up things. You don't always need hashmaps. SPLFixedArray() is a joke, btw, and available only as of 5.3. Can't compare it to a vector anyways, and lots of fixed structures can be represented by structs or classes in C which are anways faster than in PHP. Also the app can instantiate them once on init, and just (re)load when required.
PHP can also instantiate it once, with the use of APC cache. It caches opcodes (and thereby constant values/arrays), and you can also cache any data you want, and loading that data is fast since APC is written in C (some small overhead).
4.) Even if all the app does it parse input vars and call memcache(d) / database funcs/methods to retrieve/store data, those calls are faster in C. Params can be parsed quicker in C, not requiring hashmaps for instance.
Is waiting on I/O in C faster than in PHP? Nope. So if you're mainly doing database lookups you'll see extremely low speedup porting your code to C.
5.) FastCGI is crap. If this app were to be done in C, then it would require its own HTTP layer, epoll based (for Linux). It can take out all the crap in HTTP that is not requred to parse the AJAX calls, and does not need to be "generic" enough to deliver static content.
I know. I'd like to get rid of that crap too. But is it really worth it? Making a very efficient server with all the capabilities your application has now, with all the error checking, testing and security protections would take months. HTTP is more complicated than you'd think. [wordpress.com]. Is the small speedup you'd gain, versus the maintenance of a much larger application worth it?
6.) For such dedicated and distributed deployments, garbage collection is sometimes not required. For instace, fixed-length stuctures can be preallocated upon app init, and the app can really take as much RAM as possible on startup. Yes, that would limit the MAX number of users/connections per server, but so what? The app dominates the server, nothing else is required to run (except basic OS environment for the app), so fixed memory consumption is not a problem.
7.) Even though each request has to wait for I/O of some sorts, either from memcache(d), from disk or from DB, you can process much more of these per front-end server and just scale backend servers as required. For example, with PHP your front-end server can serve 100k/sec, having X DB backends and Y memcached backends. With a C application, the front end can serve, say, 1M/sec. You still get to keep one front-end, even though you had to put more backends.
In short, you can significantly reduce the number of servers required if the app was written in C.
You're pulling those numbers out of thin air. You sti