Comparison with SAS¶
For potential users coming from SAS this page is meant to demonstrate how different SAS operations would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and numpy as follows:
In [1]: import pandas as pd
In [2]: import numpy as np
Note
Throughout this tutorial, the pandas DataFrame
will be displayed by calling
df.head()
, which displays the first N (default 5) rows of the DataFrame
.
This is often used in interactive work (e.g. Jupyter notebook or terminal) - the equivalent in SAS would be:
proc print data=df(obs=5);
run;
Data Structures¶
General Terminology Translation¶
pandas | SAS |
---|---|
DataFrame |
data set |
column | variable |
row | observation |
groupby | BY-group |
NaN |
. |
DataFrame
/ Series
¶
A DataFrame
in pandas is analogous to a SAS data set - a two-dimensional
data source with labeled columns that can be of different types. As will be
shown in this document, almost any operation that can be applied to a data set
using SAS’s DATA
step, can also be accomplished in pandas.
A Series
is the data structure that represents one column of a
DataFrame
. SAS doesn’t have a separate data structure for a single column,
but in general, working with a Series
is analogous to referencing a column
in the DATA
step.
Index
¶
Every DataFrame
and Series
has an Index
- which are labels on the
rows of the data. SAS does not have an exactly analogous concept. A data set’s
row are essentially unlabeled, other than an implicit integer index that can be
accessed during the DATA
step (_N_
).
In pandas, if no index is specified, an integer index is also used by default
(first row = 0, second row = 1, and so on). While using a labeled Index
or
MultiIndex
can enable sophisticated analyses and is ultimately an important
part of pandas to understand, for this comparison we will essentially ignore the
Index
and just treat the DataFrame
as a collection of columns. Please
see the indexing documentation for much more on how to use an
Index
effectively.
Data Input / Output¶
Constructing a DataFrame from Values¶
A SAS data set can be built from specified values by
placing the data after a datalines
statement and
specifying the column names.
data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;
A pandas DataFrame
can be constructed in many different ways,
but for a small number of values, it is often convenient to specify it as
a python dictionary, where the keys are the column names
and the values are the data.
In [3]: df = pd.DataFrame({
...: 'x': [1, 3, 5],
...: 'y': [2, 4, 6]})
...:
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Reading External Data¶
Like SAS, pandas provides utilities for reading in data from
many formats. The tips
dataset, found within the pandas
tests (csv)
will be used in many of the following examples.
SAS provides PROC IMPORT
to read csv data into a data set.
proc import datafile='tips.csv' dbms=csv out=tips replace;
getnames=yes;
run;
The pandas method is read_csv()
, which works similarly.
In [5]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
In [6]: tips = pd.read_csv(url)
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
/usr/lib64/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1317 h.request(req.get_method(), req.selector, req.data, headers,
-> 1318 encode_chunked=req.has_header('Transfer-encoding'))
1319 except OSError as err: # timeout error
/usr/lib64/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked)
1238 """Send a complete request to the server."""
-> 1239 self._send_request(method, url, body, headers, encode_chunked)
1240
/usr/lib64/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1284 body = _encode(body, 'body')
-> 1285 self.endheaders(body, encode_chunked=encode_chunked)
1286
/usr/lib64/python3.6/http/client.py in endheaders(self, message_body, encode_chunked)
1233 raise CannotSendHeader()
-> 1234 self._send_output(message_body, encode_chunked=encode_chunked)
1235
/usr/lib64/python3.6/http/client.py in _send_output(self, message_body, encode_chunked)
1025 del self._buffer[:]
-> 1026 self.send(msg)
1027
/usr/lib64/python3.6/http/client.py in send(self, data)
963 if self.auto_open:
--> 964 self.connect()
965 else:
/usr/lib64/python3.6/http/client.py in connect(self)
1391
-> 1392 super().connect()
1393
/usr/lib64/python3.6/http/client.py in connect(self)
935 self.sock = self._create_connection(
--> 936 (self.host,self.port), self.timeout, self.source_address)
937 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
/usr/lib64/python3.6/socket.py in create_connection(address, timeout, source_address)
703 err = None
--> 704 for res in getaddrinfo(host, port, 0, SOCK_STREAM):
705 af, socktype, proto, canonname, sa = res
/usr/lib64/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
744 addrlist = []
--> 745 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
746 af, socktype, proto, canonname, sa = res
gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
URLError Traceback (most recent call last)
<ipython-input-6-8ab2297b7141> in <module>()
----> 1 tips = pd.read_csv(url)
~/rpmbuild/BUILD/pandas-0.22.0/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
707 skip_blank_lines=skip_blank_lines)
708
--> 709 return _read(filepath_or_buffer, kwds)
710
711 parser_f.__name__ = name
~/rpmbuild/BUILD/pandas-0.22.0/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
431 compression = _infer_compression(filepath_or_buffer, compression)
432 filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 433 filepath_or_buffer, encoding, compression)
434 kwds['compression'] = compression
435
~/rpmbuild/BUILD/pandas-0.22.0/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
188
189 if _is_url(filepath_or_buffer):
--> 190 req = _urlopen(filepath_or_buffer)
191 content_encoding = req.headers.get('Content-Encoding', None)
192 if content_encoding == 'gzip':
/usr/lib64/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
221 else:
222 opener = _opener
--> 223 return opener.open(url, data, timeout)
224
225 def install_opener(opener):
/usr/lib64/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
524 req = meth(req)
525
--> 526 response = self._open(req, data)
527
528 # post-process response
/usr/lib64/python3.6/urllib/request.py in _open(self, req, data)
542 protocol = req.type
543 result = self._call_chain(self.handle_open, protocol, protocol +
--> 544 '_open', req)
545 if result:
546 return result
/usr/lib64/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result
/usr/lib64/python3.6/urllib/request.py in https_open(self, req)
1359 def https_open(self, req):
1360 return self.do_open(http.client.HTTPSConnection, req,
-> 1361 context=self._context, check_hostname=self._check_hostname)
1362
1363 https_request = AbstractHTTPHandler.do_request_
/usr/lib64/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1318 encode_chunked=req.has_header('Transfer-encoding'))
1319 except OSError as err: # timeout error
-> 1320 raise URLError(err)
1321 r = h.getresponse()
1322 except:
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
In [7]: tips.head()