In-Database R Coming To SQL Server 2016 94
theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.
Alteryx (Score:1)
Check out http://alteryx.com/ which is already doing in database R with Oracle and Hadoop (Spark R.) Its great that Microsoft is joining the club, but they aren't exactly the 1st.
Re: (Score:1)
PostgreSQL has had PL/R since 2003.
Re:Alteryx (Score:5, Interesting)
Re: (Score:2)
PostgreSQL has had PL/R since 2003.
Which is nice but doesn't really do anything for you if you're not using PostgreSQL, for example those using SQL Server.
Re:Alteryx (Score:5, Insightful)
for example those using SQL Server.
Though to be fair, that was a questionable decision to begin with. You just don't get any value for your subscription fees.
Databases are one area that open source is beating closed source.
Re: (Score:3)
Re: (Score:2)
Vertica says HELLO! Even though its -absurdly- expensive, it runs circle around anything open source.
Though in general, large (really large) databases is an area where you actually want commercial support, because things can go wrong in the most fucked up ways.
Open source dbs have companies doing that support, but few have the kind of manpower I'd want when things go very sour.
Re: (Score:3)
Vertica says HELLO! Even though its -absurdly- expensive, it runs circle around anything open source.
Vertica is a data warehouse
Open source dbs have companies doing that support, but few have the kind of manpower I'd want when things go very sour.
If you sincerely need help from Oracle/Microsoft/HP to deal with your database problems, then your technical expertise isn't very high.
Re: (Score:2)
The line is so thin between data warehouse and transactional dbs. Heck, in this case the only difference is how data is stored and which type of query is fast and which is slow. You can insert, run SQL (we use Postgres as a mock to run persistance layer tests, because its so close to Vertica), all in real time. Close enough.
And even the biggest of big data giants sometimes end up with issues where you need help. When you need to write a patch for your RDBMS, its nice to be able to have a vendor to do it, op
Re: (Score:3)
The line is so thin between data warehouse and transactional dbs. Heck, in this case the only difference is how data is stored and which type of query is fast and which is slow.
No, that is actually the difference lol
Re: (Score:2)
And even the biggest of big data giants sometimes end up with issues where you need help. When you need to write a patch for your RDBMS, its nice to be able to have a vendor to do it, open source or not. Not many companies keep Postgres core developers in house (
I'm interested though, is this an issue you've run into?
Re: (Score:2)
And the only difference between a train and semi is the engine and body.
Re: (Score:3)
With Postgres and MS-SQL being pretty much a tie on TCO, just choose whichever best fits your situation. Postgres does have a low barrier of entry and can do some pretty nifty things, but those things increa
Re: (Score:2)
Postgres was pretty much the best until about $100k, then MS-SQL caught up and it was pretty much a tie.
How did MS-SQL catch up?
Re: (Score:2)
3rd Party Vendors. It is a scary world out there. If you don't like working with it, double your prices to drive them away. Oops, now you're the highest paid person, you're the expert in what you hate. It happens all the time.
If you're willing to use proprietary COTS crapware inside a business, you'll probably get stuck with crap like SQL Server. This is a huge service to poor souls stuck working on these things and doing statistics. You can throw away a whole layer of crapware and move it into the database
But, but? (Score:1)
Re: (Score:1)
Re: But, but? (Score:5, Informative)
Yeah exactly.
MS SQL has a lot of good things going for it - but what you're asking for is one area where Postgres just runs rings around it. You can achieve similar benefits in MS using a CLR but it will be faster and easier in Postgres. Unless you have some compelling reason to stay MS, I suggest you take the hit and learn a new platform.
Re: But, but? (Score:3)
Faster, yes. Easier? Maybe. I'm migrating a project from SQL Server to Postgres and I will say that SSMS is definitely better than pgAdmin any day. I'm almost tempted to write my own console due to the bugs I encounter.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
As someone with a bad memory,
Improve your memory. It can be done.
Re: (Score:2)
Crazy idea I know, but how about making notes?
Re: (Score:1)
No. You're wrong. Use your eyes. Do it his way. ...ahh, that feels better ;).
Memorizing CLIs is a waste of brain space for all but the most static of job descriptions. Why? Because everything changes in this field rapidly. In a world of disposable code, where my random OSS framework could suddenly become the next big thing tomorrow morning, memorizing APIs is about the least efficient thing a developer can spend their time on.
But I will say that it does give an advantage in the workplace to be the square ey
Re: (Score:2)
Using postgres from the command line is not hard. Really.
Re: (Score:1)
No, the reality is the person would have to be a jack of all trades and master of none because of all of the job duties assigned to the position. Said person would constantly be changing between working switches, routers, firewall, PBX, Linux servers, Windows servers, Windows clients, multiple SAN manufacturers, 2 different hypervisors, 2 different relational databases, IIS, Apache, PHP, C#, various shell scripting languages, etc.
Re: (Score:1)
So you've had one of those crazy jobs too, eh? ;)
I love that CLIs let people automate tool chains and crud stuff with ease. I can't stand the groupthink that it's somehow better to use CLI all the time. A billion context switches between languages and esoteric interfaces over the course of an hour is initially exciting and feels productive. Unfortunately, I've never met a program written by a polyglot CLI warrior that wasn't a nightmare of spaghetti fragments that could break with one bad keystroke. Buildin
Re: (Score:2)
Yup, SSMS is far, far better than pgAdmin. SSIS is years ahead of any postgres ETL tool. There's a bunch of other awesome features in SQL Server too - from memory merge doesn't work in Postgres, procedures/functions are harder to use and ...
I wasn't trying to say Postgres is all-round better than SQL Server. But there are a few things including R integration and spatial queries where Postgres is so far ahead that you are probably better to put up with the weaknesses.
Re: (Score:3)
SQL Server 2016 will have a Json column type, so its most of the way there.
This might not be a good idea ... (Score:5, Interesting)
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
Re: (Score:1)
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore.
library(bigmemory)
Create, store, access, and manipulate massive matrices. Matrices are, by default, allocated to shared memory and may use memory-mapped files. Packages biganalytics, synchronicity, bigalgebra, and bigtabulate provide advanced functionality.
Re: (Score:2)
RDBMS engines are designed to convert routines of in memory row by row or group by group statistical operations and figure out good (optimal) disk / memory organizations. That's one of the things they are very very good at.
Re: (Score:1)
Re: (Score:2)
DBAs won't like it and will disable it in most corporate environments. This in effect lets the users/developers "inside" their precious servers where they are the ultimate power in a way they can't fully control (and lets face it Control Freak is a job requirement for a DBA). Add to that the potential to bring a server to its knees with a badly written fragment of code and the possibility of security holes in a new component and they will have all the ammo they need to convince their bosses that it is a B
Re: (Score:2)
I would imagine one would only be performing datamining/statistical analysis on the data warehouse server, not the transactional database server.
Re: (Score:2)
How about making SQL server respect ASCII nulls on unique constraints?
I would be more impressed with EBCDIC nulls.
In the KELVIN character set, NULL equals ABS(NULL)
Re: (Score:3)
So if you aren't first with a feature, you shouldn't bother?
Re:Why not Python? (Score:4, Interesting)
Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
This. The language is horrible. What R has going for it is (1) some quite good graph plotting and (2) Support any statistical function you can think of, since every statistics researcher works in R and so the functions a available. No other statistics product comes close.
A python statistics library with some funky C linkage to the R library would take over in milliseconds when people find they can get all the stats functions while being able to program in a sane language.
Re: (Score:2)
I use both R and Python. R itself is actually quite nice and more efficient for interactive use, once you get used to it. For interactive exploration with statistics, I actually prefer it to Python (and I have been using Python for ~15 years). Lots of helper functions. Everything uses the DataFrame datastructure. Good, concise and consistent documentation.
Unless you are a R library dev, for most users, its best to see R as a shell for statistics, rather than a programming language. So its language horriblen
Re: (Score:2)
Just out of curiosity, when you say "Python" are you including iPython Notebook and Pandas and the rest of the SciPy/NumPy modules or are you comparing R strictly to "plain" Python scripts?
Re: (Score:2)
I mean the full Python stack (IPython notebook + Spyder with IPython, PyLab, Pandas, statsmodels).
For almost everything in stats, I prefer the RStudio experience. The flow feels much better, even though my Python is much better than my R. Machine Learning is one stats topic though, where I still prefer Python - I just like Scikit-learn.
If I was doing linear algebra directly, I would have preferred the Python stack with NumPy. PyLab stack is more for Matlab users than R users. On the stats side, Pandas and s
Re: (Score:2)
That's what rpy2 is.
Thank you. I didn't know rpy2 existed.
Isn't R GPL? (Score:1)
Re: (Score:2)
An implementation of R is GPL, but that doesn't extend to all independent implementations, such as the one MS is writing to do this.
Re:Isn't R GPL? (Score:5, Informative)
No - MS will only need to release any changes they make to R.
This sort of thing comes up quite often and largely comes down to coupling. If Microsoft included R code in the binary of SQL Server then they would run into complications. However as long as they keep R on its own and arrange interprocess communication sensibly, they will not be affected by the GPL.
It's quite likely MS will modify R, e.g. writing low level routines for getting data out of SQL without needing to go via ODBC and those sort of changes will need to be released. It's also possible MS will want things like .RData readers for putting into SQL and similar - and they might choose to do a clean-room implementation of such bits rather than calling out to R for the loading code in order to avoid too tight coupling.
Incidentially, this has been done before. The PgR project gives Postgres (BSD) has tight coupling with R (GPL) without requiring Postgres to be relicenced. Tableau also released similar features, though they don't add much value at this stage.
Re: (Score:2)
expect to pay $$ (Score:3)
expect it to be in the enterprise version at $7000 a physical CPU core
MS OLAP (Score:2)
I'm curious whether it will be exposed via OLAP - when I was doing some proteomics work with MS OLAP some years back, the retrieval speed was stellar, but the math libraries were pathetic, which seemed pretty sad for something allegedly aimed at analytics. (Yes, I know, most people assumed business analytics, but there's an awful lot of potential for scientific analysis, especially with large, messy datasets.)
Re: (Score:2)
I'm guessing they'll slowly phase out OLAP.
OLAP got its stellar retrieval speed through lots of precomputation and that just isn't compatible with where the whole big data stuff is going. I'd guess instead they will bring in a NoSQL database as a per-table query engine and use that as the OLAP replacement.
And so it begins (Score:1)
Embrace, Extend, Extinguish [wikipedia.org]
Microsoft, just like they did to Lotus 123, Wordperfect, just like they did to Java with their J++ before getting spanked, just like they tried to do with C++, just like they're trying to do with porting Android and IOS apps to their OS, they're doing it again -- creating a Roach Motel of software in which the developer or user can check into the Microsoft Roach Motel OS, but they sure cannot check out.
What is so egregiously evil about this? They're taking an Open Source product
Are you sure this is a good idea? (Score:2)
Being able to remotely transmit commands in a new general-purpose programming language to the server that stores your irreplaceable data? What could possibly go wrong?
Also, how do you say "Robert'); DROP TABLE Students;" in R?
Re: (Score:3)
You can already do that with the CLR. [microsoft.com]
zzzzz (Score:2)
Wake me up when SQL Server comes with an MP3 player built in.