-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy pathsqlite-vs-pandas.html
195 lines (139 loc) · 8.09 KB
/
sqlite-vs-pandas.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>sqlite-vs-pandas.html</title>
</head>
<body>
<h1>Comparing performance of sqlite and pandas</h1>
<p>Depending on the task, you may need to choose one tool over another for
performance considerations. </p>
<p>For example, your analysis requires you to perform a large number of
group-by calculations on moderately large data. Or, your task is to filter out
many invalid values for a dataset with 10 million records.</p>
<p>Should you go with <code>pandas</code> or <code>sqlite</code>? </p>
<p>Let's find out the tasks at which each of these excel. Below, we compare
Python's <code>pandas</code> to <code>sqlite</code> for some common data analysis operations: sort,
select, load, join, filter, and group by.</p>
<p><em>Note that the axis is logarithmic, so that raw differences are more pronounced.</em></p>
<p><div class='tableauPlaceholder' id='viz1487191497220' style='position: relative'><noscript><a href='#'><img alt='Comparing performance of sqlite vs. pandas ' src='https://public.tableau.com/static/images/Co/Comparingperformanceofsqliteandpython-pandas/bars/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='Comparingperformanceofsqliteandpython-pandas/bars' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Co/Comparingperformanceofsqliteandpython-pandas/bars/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1487191497220'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script></p>
<h1>Analysis details</h1>
<p>For the analysis, we ran the six tasks 10 times each, for 5 different sample
sizes, for each of 3 programs: <code>pandas</code>, <code>sqlite</code>, and <code>memory-sqlite</code> (where
database is in memory instead of on disk). See <a href="#Code">below</a> for the definitions of
each task.</p>
<p>Our sample data was randomly generated. Here's what it looks like:</p>
<pre><code>sql_vs_pandas$ head -n 5 data/sample.100.csv
qqFjQHQc,c,1981,82405.59262172286
vILuhVGz,a,1908,27712.27152250119
mwCjpoOF,f,1992,58974.38538762843
kGbriYAK,d,1927,42258.24179716961
MeoxuJng,c,1955,96907.56416314292
</code></pre>
<p>This consists of a random string of 8 characters, a random single character
(for the filtering operation), a random integer simulating a year (1900-2000), and
a uniform random float value between 10000 and 100000.</p>
<h1>Results</h1>
<p><code>sqlite</code> or <code>memory-sqlite</code> is faster for the following tasks:</p>
<ul>
<li><code>select</code> two columns from data (<.1 millisecond for any data size for
<code>sqlite</code>. <code>pandas</code> scales with the data, up to just under 0.5 seconds for
10 million records)</li>
<li><code>filter</code> data (>10x-50x faster with <code>sqlite</code>. The difference is more
pronounced as data grows in size)</li>
<li><code>sort</code> by single column: <code>pandas</code> is always a bit slower, but this was the
closest</li>
</ul>
<p><code>pandas</code> is faster for the following tasks:</p>
<ul>
<li><code>groupby</code> computation of a mean and sum (significantly better for large
data, only 2x faster for <10k records)</li>
<li><code>load</code> data from disk (5x faster for >10k records, even better for smaller
data)</li>
<li><code>join</code> data (2-5x faster, but slower for smallest dataset of 1000 rows)</li>
</ul>
<p>Comparing <code>memory-sqlite</code> vs. <code>sqlite</code>, there was no meaningful difference,
especially as data size increased.</p>
<p>There is no significant speedup from loading <code>sqlite</code> in its own shell vs. via
<code>pandas</code>.</p>
<p>Overall, joining and loading data is the slowest whereas select and filter are
generally the fastest. Further, <code>pandas</code> seems to be optimized for group-by
operations, where it performs really well (group-by is <code>pandas</code>' second fastest
operation for larger data).</p>
<p>Note that this analysis assumes you are equally proficient in writing code with
both! But these results could encourage you to learn the tool that you are
less familiar with, if the performance gains are significant.</p>
<h1>Code</h1>
<p>All code
is
<a href="https://github.com/thedataincubator/data-science-blogs">on our GitHub page</a>.</p>
<p>Below are the definitions of our six tasks: sort, select, load, join, filter,
and group by (see <code>driver/sqlite_driver.py</code> or <code>driver/pandas_driver.py</code>).</p>
<p><code>sqlite</code> is first, followed by <code>pandas</code>:</p>
<h3>sort</h3>
<pre><code>def sort(self):
self._cursor.execute('SELECT * FROM employee ORDER BY name ASC;')
self._conn.commit()
def sort(self):
self.df_employee.sort_values(by='name')
</code></pre>
<h3>select</h3>
<pre><code>def select(self):
self._cursor.execute('SELECT name, dept FROM employee;')
self._conn.commit()
def select(self):
self.df_employee[["name", "dept"]]
</code></pre>
<h3>load</h3>
<pre><code>def load(self):
self._cursor.execute('CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);')
df = pd.read_csv(self.employee_file)
df.columns = employee_columns
df.to_sql('employee', self._conn, if_exists='replace')
self._cursor.execute('CREATE TABLE bonus (name varchar(255), bonus double);')
df_bonus = pd.read_csv(self.bonus_file)
df_bonus.columns = bonus_columns
df_bonus.to_sql('bonus', self._conn, if_exists='replace')
def load(self):
self.df_employee = pd.read_csv(self.employee_file)
self.df_employee.columns = employee_columns
self.df_bonus = pd.read_csv(self.bonus_file)
self.df_bonus.columns = bonus_columns
</code></pre>
<h3>join</h3>
<pre><code>def join(self):
self._cursor.execute('SELECT employee.name, employee.salary + bonus.bonus '
'FROM employee INNER JOIN bonus ON employee.name = bonus.name')
self._conn.commit()
def join(self):
joined = self.df_employee.merge(self.df_bonus, on='name')
joined['total'] = joined['bonus'] + joined['salary']
</code></pre>
<h3>filter</h3>
<pre><code>def filter(self):
self._cursor.execute('SELECT * FROM employee WHERE dept = "a";')
self._conn.commit()
def filter(self):
self.df_employee[self.df_employee['dept'] == 'a']
</code></pre>
<h3>group by</h3>
<pre><code>def groupby(self):
self._cursor.execute('SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;')
self._conn.commit()
def groupby(self):
self.df_employee.groupby("dept").agg({'birth': np.mean, 'salary': np.sum})
</code></pre>
<h1>Details</h1>
<p>We used <code>pandas</code> version <code>0.19.1</code> and <code>sqlite</code> version <code>3.13.0</code>. All tests
were ran on Digital Ocean <code>Ubuntu 14.04</code> with 16GB memory and 8 core CPU.</p>
<h1>References</h1>
<p>To learn more how <code>sqlite</code> works, check
out
<a href="https://jvns.ca/blog/2014/09/27/how-does-sqlite-work-part-1-pages/">this awesome blog series</a>.</p>
<p>Here is
a
<a href="http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html">syntax comparison</a> between
<code>pandas</code> and <code>sql</code>.</p>
</body>
</html>