Tag Archives: SQL

Learning SQL – review

To give you all some time to digest my review of the Art of R Programming, I thought why not continue this trend of book reviews with a review of Learning SQL. This book came highly recommended by some colleagues of mine as a place to whet your SQL appetite. As with the R review, I don’t expect much of what’s in here to be ground-breaking, but rather it can serve as a nice intro for some of you to the SQL language.

I’ll discuss each chapter in high level detail, providing a brief summary of the topics covered along with my own commentary.

A few quick notes:

This book focuses on SQL (Structured Query Language), which is a

special-purpose programming language designed for managing data held in a relational database management systems (RDBMS).*

That is, SQL is a set-based, querying language that is not tied to any specific company or program. You may have heard of variations such as MySQL, Oracle SQL, SQL Server, PostgresSQL, Sybase, etc… Those are extensions and variations of basic SQL. They all use SQL as the underlying basis for running queries, but each have nuances and built-in functionality. Furthermore, some are open-source (MySQL and PostgresSQL) and some are not (Oracle SQL and Microsoft’s SQL Server). For a debate on the pro’s and con’s of each, please see here.

This book’s goal is to explain the SQL language regardless of what program you use to run your queries, i.e. it’s program agnostic. Having said that, the author provides his examples in MySQL. Where there are differences and incompatibilities with Oracle’s SQL or SQL Server, he points them out. Those, however, are the only 3 variations of SQL that he discusses (sorry Postgres folks!).

Finally, I will not be covering the SQL vs. NoSQL debate. While an interesting discussion point, it’s beyond the scope of this post – don’t you worry though, I’ll cover that in a later post.

With that intro out of the way, I can start the review. Next time 🙂

*courtesy of Wikipedia

Set vs. Procedural programming in SQL

Recently I had to perform a procedure in which I counted the number of occurrences of certain, unique key words in a database where each record held text data. Here is my logic in psuedo-code:

For each record in KeyWord_table -- unique keywords
  Enter the source table   -- this is the table that holds the text data
  For all the rows in my source table    
    Count the number of occurrences of keyword in this record
    Go to the next record of the source table
  Sum all occurrences of keyword and store this value somewhere
  Go to the next record of KeyWord_table and re-enter the above process
Produce table of Keywords and count of occurence

Now the above might seem like a logical approach, especially for compiled language programmers (i.e. C/C++) whose code runs very fast. For R users, however, any time we see loops, we look for ways to optimize (more on this in a later post!). SQL developers are even more opposed to looping, or in SQL-speak, cursors, which is what the above logic would require.

Coming from a procedural programming background, I’m used to applying functions to a process and iterating. This could be the form of an if-statement or loop or other control structure. In this process, we tell the system “what to do” along with “how to do” it. This involves querying the database to obtain a result set and then outlining specific operational and manipulation logic to produce the final result. This is generally frowned upon in the database community, since DB developers think in terms of “Sets”, as opposed to procedures.

A set based approach is one which lets you specify “what to do”, but not “how to do” it. That is, you tell the program what your requirements are for a processed result that has to be obtained from a set of data. The program will decide how to most efficiently perform the process, depending on how the tables are set up, that is, how they are indexed. It then executes the operation and returns the result. This is almost always a must faster and more efficient approach than applying procedural programming logic in SQL. I will say, however, that I get a sense of “black box” programming with this approach, but perhaps that is because I am not a DBA who gets into the weeds of every query.

What do you all think – does that make sense? Let me know your thoughts and if you could come up with some simple examples to share, that would be great. In a future post, I’m going to share with you my procedural and set based code for the above task to see the differences. Stay tuned!

My current data analysis stack

So the shop that I work at isn’t heavily data-centric, although we are being bombarded with data all the time. We’re starting to appreciate the power of this data, but it’s a process. Having said that, my current tools are R and SQL Server.

I use SQL Server because that’s where all of our data is stored – in databases that have been collecting customer information (user behavior, billing breakdowns and payment patterns, …). I don’t see us transitioning to non-relational databases (NoSQL) any time soon, but then again, I don’t think we need since our data is very structured. Using SQL Server I get my data and, depending on the analysis, I’ll probably perform some minor manipulations to clean it up and make the dataset manageable for R.

I use R for my data analysis and some manipulation, where I couldn’t (or chose not to) with SQL. The deeper and deeper I dig into this data, however, I’m noticing that my datasets are becoming very large, at least for R’s purposes. This sometimes leads to very slow processing within R. I have overcome this by becoming much better at having SQL do the leg work, but I’m not sure I want to continue using SQL as a data manipulation program, when it’s purpose is really data storage and retrieval (someone correct me if I’m wrong here).

To combat this, I’ve heard of many data scientists using Python for running scripts, after performing the analysis in R. And then there’s the Hadoop/MapReduce crowd, but I’m not sure my data warrants such “big data” solutions – at least not yet. I’ve even toyed with the idea of using Rshiny for webapps, because I’m increasingly finding myself trying to make R’s power more visually pleasing for those who want more interactivity with R’s graphs. I haven’t given Javascript a thought, but some others have. And they’re reporting significant speeds improvements.

So what say you guys – do you do much of your heavy lifting with Python? Relational databases? Or do you focus on maximizing R’s efficiency where possible? Has anyone compared Rshiny to Javascript? Please chime in!