SAS and R are topics which are very closely related: Both are popular tools for people like us who want to solve problems from the environment of statistic and machine learning on (more or less) large data volumes. Despite this apparent proximity, there are few touchpoints between the communities and only few persons work with both tools. As passionated `outside the box´ thinkers, we regret that and want to start a mini-series by means of this blog article in which we deal with topics which connect the both worlds, in loose order. For this first blog article, we will deal with the possibilities to exchange data between the systems. As there is a high number of ways, this article is limited to the transfer of SAS to R; the opposite direction will follow in a later article.
The transfer of SAS to R
There are various ways to transfer SAS data into an R system. We categorize them into three rough categories:
1. The generic possibilities
In this context, "generic" means that these methods generally suit for the data transfer between all possible different systems, not only R and SAS. Particularly two methods must be named here: the transfer via CSV files and the joint access to the same relational database.
The benefit of the first method is that it requires only little specific know-how. The respective commands are well known to most users: PROC EXPORT on the SAS side and read.csv in R and/or the fread command from the data.table package as better alternative, not only for large data volumes. Those who have used these commands more often know, however, that time-consuming manual precision work is frequently required until the data exchange really works. This starts with the sensible choice of the separator (which must be identical on both sides). In particular, if longer free texts are to be exchanged e.g. within text mining applications, it does not stop there by far. From the right choice of the string delimiter (which should not occur in free texts) up to the treatment of problems with encodings (in particular in case of cross-platform transfer of country-specific special characters), there are numerous possibilities to deal with apparently minor details for a long time. Thus, an unpleasant option for impatient people like me.
Joint access to the same relational database is much more pleasant than the transfer via CSV. The problems with separators, string delimiters and partly also with encodings are eliminated and data types can be transferred much easier. In addition, the database is expedient not only to transfer the data, but also to work with the latter. In addition, it is easy to choose in the course of the transfer only the particular set of data which is actually relevant. However, this method makes a lot of presuppositions. Access to the same database from both systems is often not available. However, it is frequently easier to develop than a direct connection between R and SAS, which is required by the next method.
2. The methods which use a local SAS installation
Many methods for importing SAS data into R require access to a local SAS installation. As this requirement is not fulfilled in many cases, we will not deal with this possibility in detail.
3. Direct transfer of native SAS files without recourse to an SAS installation
The SAS7BDAT format is the standard format on the SAS side. It exists in various variants; in particular, there is the possibility to activate or deactivate compression. If one tries to read these files with the R package "foreign", one gets surprised. Namely, it does not work. Though "foreign" is one of the first packages which come to mind when one wants to import foreign data formats into R, and it also supports SAS files. But, unfortunately, this support is limited to an ancient data format (SAS XPORT) which is rarely used in the SAS community anymore. The format shows almost grotesque limitations; in particular, variable names may not be longer than eight signs. Unfortunately, it is the only SAS data format for which SAS disclosed a specification which may also be the reason why the developers of foreign only provide for this format. The format is not reasonably usable; it only offers disadvantages compared to simple CSV files.
Fortunately, Matt Shotwell took on the Sisyphus task of analyzing the data format SAS7BDAT. He programmed an R package of the same name based on findings made by reverse engineering which is able to read these files (writing is not possible). Unfortunately, the package can only cope with the uncompressed variant of the data format. See below for alternatives which can also read the compressed data.
Fortunately, it is easy to deactivate the compression while creating the file: just set the option "COMPRESS=NO" in the DATA step. However, very many SAS systems write the compressed variant per default (the default behavior can be configured by means of OPTIONS). That means that in many cases, one cannot just take an SAS file and import it into R. Rather, one has to create a file which is legible in R by deactivating compression. Then, the import works reliably, although not overly quickly.
Building on Matt Shotwell's work, a Java library has been developed later which can also cope with compressed files. Matt Shotwell has made this Java library accessible in an R package named sas7bdat.parso. However, this package is a little harder to install (it requires rJava) and is not available via CRAN, but only via github. Similarly to the sas7dbat package, it is rather slow.
Thus, the best way depends on the exact situation, as so often. Personally, I would prefer the way of a jointly used database and, if that is not possible, switch to one of Matt Shotwell's two packages. If there is no other way, I would eventually make use of CSV files.