The Problem: We have a 5GB file, and no tool we have can work with such large files (provided you even have enough free RAM to hold the whole thing in memory).
How do we get around this?
The solution is to use a database. While some people think this answer is common sense, I have seen far too many people intimidated with databases, so this series also serves as an introduction to SQL and Sqlite.
The code is available here. We will be using Ipython Notebooks( see here for a quick intro). To start off, we will get a gentle intro to SQL. If you have used SQL with SqLite before, still have a look at the notebook, as I will be referring back to it. Intro to SqLite.ipynb is the notebook we are working with.
There are 6 videos + transcript in this series. The videos are mixed with the transcripts, so scroll down if you are only interested in the videos. Make sure you turn on HD.
Minor correction to video below: Sqlite can handle concurrent connections, though it may not be as fast.
Transcript:
So the file we want to open is almost 5.3GB. No tool (I know of) can open it. Ms Excel can’t open it, Openoffice can’t open it, Notepad++ can’t open it. Not to mention, you may not even have enough memory to open it.
Even if you buy a more powerful machine, the lack of tools to work with large files will hold you back.
So how do you open a 5GB file? Not all of it at the same time is the answer. You store the data in a database and read it.
Why use a database?
Many scientists and engineers happy to use Ms Excel, CSV files, or even text files. Nothing wrong with that. It works.
Until it doesn’t.
For an example, just look at the current file. It didn’t start off as a 5GB file, but grew over time till it became too large to open with Excel.
Databases routinely handle Gigabytes of data without any problems, so we should be using them.
Advantages of using a database: http://programmers.stackexchange.com/questions/190482/why-use-a-database-instead-of-just-saving-your-data-to-disk
Have a skim through the top rated answer. You don’t need to master everything database related, but make sure when someone uses the word Acid you don’t say Yeah man, groovy. Have some so we can try it now?
I have seen a fear of SQL in many people (and this is not a judgement, as I was in this club for a long time).
Real Reason (TM) programmers don’t use a database: They are scared of the weird syntax of SQL, which looks a lot like it was a language invented for business types in the 60s-70s (it was!). But it’s not that scary.
Okay, to the lesson. Get the data from here:
Download the file: http://download.cms.gov/nppes/NPI_Files.html
If you are on Windows, use 7zip to unzip it, as Windows gets confused with the file.
We will be using something called Sqlite, a free database that comes inbuilt with Python. While it’s a simple database, the main thing I want to emphasis is, it’s not a toy database. Sqlite can easily handle websites with 100K/day visitors. And since it is file based, it is easy to share the data.
Introduction to Sqlite with Python
Okay, let’s get started.
1 |
import sqlite3 |
Since sqlite is inbuilt, we just import it.
1 2 |
conn = sqlite3.connect(":memory:") c = conn.cursor() |
Sqlite is fairly simple, in that if you don’t have a database, it creates one for you when you run the connect() function.
If you give it a file name, it will create a database with that name (which we will see in the next example).
For the current example, we are using :memory:, which means the database is stored in memory, and will be lost when the Python session is closed.
In the second line, we open a cursor() connection to the database, which is sort of a command line interface to the database (though controlled by Python).
1 |
c.execute("CREATE TABLE my_table (Name TEXT NOT NULL, Salary INTEGER NOT NULL DEFAULT 0)") |
A few things about SQL. SQL itself is case insensitive, but good practice is to use UPPER CASE for SQL commands, and lower case (or Sentence Case) for stuff we create, like tables.
Data in a database is stored in tables. In the example above, we create a table called my_table, which has two fields. The first is Name. The keywords in uppercase in front of it was SQL commands. They tell us the field Name is a type of Text, and cannot be null (empty).
The next field is Salary. This is a type Integer, again, cannot be null, and will have a default value of 0. The default value is not really needed, as we have already defined the field cannot be empty. In case we did not have a NOT NULL instruction, Sql would give a default value of 0 to any empty Salary fields.
1 |
conn.commit() |
commit() actually writes the data to the database. If you forget to commit, you will lose all your changes. You can commit after each instruction, or once at the end.
Why would you want to do it once? In case you have multiple users, committing once will save time, as the database will be locked down when writing to it (to prevent data corruption, see Acid). I prefer to commit after each instruction, because I’m not really worried about 200 other people writing to the database at the same time.
1 2 |
c.execute("INSERT INTO my_table VALUES ('Joe', 3000)") conn.commit() |
Okay, so our database has been created, but it is empty. Time to fill it with values. INSERT INTO is the SQL command, and what follows is the way to insert data into the table we created.
Warning: SQL commands may vary across different databases (although not for simple commands like insert). If you were using a different database like MySql, you might need slightly different commands. I recommend just Googling “How to do X in database Y”.
We insert two values: The name Joe and salary of 3000.
Of course, this is a very slow process. What if you have 10,000 names and salaries? Will you be sitting there running one insert command after another?
Sqlite provides a way to insert multiple values.
1 2 3 4 |
names = [('James', 2000), ('Sally', 5000)] c.executemany("INSERT INTO my_table VALUES (?,?)", names) conn.commit() |
You need to create a list of dictionaries, as we have done for names above.
Then we use the (?, ?) method to insert multiple values. Sqlite will automatically insert all the values for us.
So we have inserted a few values into our database. Now is the time to read them.
1 2 3 4 5 |
c.execute("SELECT * FROM my_table WHERE name = 'Joe'") result = c.fetchall() print(result) [('Joe', 3000)] |
The Select command is used to search for data. You can search individual columns in tables, though we are using * to search through everything.
Once we run the SQL search command, we need to call fetchall() to get the results. As you can see, we get our name.
What if we search for something that doesn’t exist?
1 2 3 4 5 |
c.execute("SELECT * FROM my_table WHERE name = 'MrT'") result = c.fetchall() print(result) [] |
We get an empty array back.
We can also search with conditions.
1 2 3 4 5 |
c.execute("SELECT * FROM my_table WHERE salary > 2000") result = c.fetchall() print(result) [('Joe', 3000), ('Sally', 5000)] |
The above returns all people with salaries > 2000.
1 2 |
conn.commit() conn.close() |
A final commit(), and then we close() the database.
This was a very simple introduction to Sql and Sqlite. We will be building on this knowledge and looking at more complicated examples when we open our 5GB file.
I deliberately kept this section simple, so you can see how simple SQL is. Most of the commands are intuitive and easy to follow. If you get stuck, you can get an answer in 2-3 Google searches. In the examples below, we will cover this in more detail.
Okay, with that knowledge, let’s start with reading our 5 GB file.
Reading our file and storing in a database
I broke the video into 2 parts for easy editing, but it’s actually one long video. The notebook we are using is Open 5 Gb file.ipynb. Transcript is below the 2 videos.
Make sure you have downloaded the file from http://download.cms.gov/nppes/NPI_Files.html
On Windows, use 7Zip to unzip it.
Also download Sqlite Browser, a cool tool that allows us to look at the database visually. http://sqlitebrowser.org/
Have a look at the data you downloaded. The 2 pdf files describe all the data. There is a csv file with FileHeader in the name. That’s just the header. Open it in a text editor.
All of that is one line. It is being word wrapped, but it’s one kilometer long line of text. You can see how entries this database has, which is why it is so big.
To make our lives easier (because I really don’t want to type 557 SQL commands to create this database), we will only extract 4 values from this database: Name (of doctor), Legal name (business name), city and state.
Not all of these values may exist: This is one of the risk of working with data that has been compiled from multiple sources. We will have to assume some fields maybe empty, which is why we can’t use NOT NULL we used in the last example.
The Name field is itself compromised of 3 different fields. If look at the header field, you will see:
“Provider Last Name (Legal Name)”,”Provider First Name”,”Provider Middle Name”
We will combine these three to store the name as one field. Again, a decision I just made up, no rule you have to do it this way.
Let’s look at the code. We will create the database first.
1 2 3 4 5 |
import sqlite3 db = "npi_data.db" conn = sqlite3.connect(db) c = conn.cursor() |
Same as in the last example, except this time we give it a file name to connect to (npi_data.db). Sqlite will create this file if it doesn’t exist (though I recommend you delete the file manually if you run the code again).
We then connect to the database and open a cursor connection.
1 2 |
c.execute("CREATE TABLE npi_data (name TEXT, legal_name TEXT, city TEXT, state TEXT)") conn.commit() |
And we create our table with 4 values: name, legal_name, city and state. All are text, to keep things simple.
At this stage, open up Sqlite browser, and open the database file we just created.
This is just a good confirmation that our database was created as expected. We can see the table we created.
You can go to the Browse data tab, but it will be empty for now.
Okay, now we come to the meat of the code that will read our values.
Note: In the video, I develop this code from scratch, so I explain the choices I made, and why I did certain things in certain ways. I also make some mistakes on the way. In the transcript, I will just explain the final working code. For better explanation, I recommend watching the video.
We will see the whole function function, and then go over it line by line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
counter = 0 db_list = [] with open("npidata_20050523-20151108.csv") as infile: for line in infile: line = line.replace('"', '') data = line.split(",") if (counter > 0): city = data[22] state = data[23] name = data[6] + " " + data[7] + " " + data[5] legal_name = data[4] db_list.append( (name, legal_name, city, state) ) #print("Name: " + data[6] + " " + data[7] + " " + data[5]) #print("Legal Name: " + data[4]) # print(data[22], data[23]) #print("\n\n") #print(line) #if counter == 5: # break counter += 1 print(len(db_list) ) |
Okay, let’s go over it in parts.
1 2 |
counter = 0 db_list = [] |
counter will be explained when we use it. db_list is a list of dictionaries that will contain the values we want to write to the database (using the (?, ?) method we saw in the first video.)
1 2 |
with open("npidata_20050523-20151108.csv") as infile: for line in infile: |
We open our file with the with command (a great feature, as it automatically closes the file and handles any errors).
The next line is important. How do you open a 5GB file? Line by line. We loop over each line in the file.
1 |
line = line.replace('"', '') |
The problem with the names in the file is, they are stored with inverted commas. So: Joe is stored as “Joe”.
Problem is, Python automatically adds inverted commas to strings, so the name becomes “”Joe””
In the video, this got me and I had to work around it.
The way is, we remove all inverted commas (“”). That’s what the code above is doing. We don’t need inverted commas, as Python will do that for us.
1 |
data = line.split(",") |
You may note the files are comma separated (csvs). We will not be using Pandas for this example, but manually extract the fields.
We call the split(“,”) to create a list of entries separated by commas. So:
1 |
"Hello, World, I am going".split(",") |
will become:
1 |
['Hello', ' World', ' I am going'] |
We split each line of the file we get, so we can get all the individual fields.
So in the header field, the first entry is “NPI”. The field we need, the legal name (“Provider Organization Name (Legal Business Name)”) is the 5th entry ( [4] in Python, as Python counts from 0).
I have gone ahead, and counted the values we need to extract for you. This will become clearer in the code.
1 2 3 4 5 |
if (counter > 0): city = data[22] state = data[23] name = data[6] + " " + data[7] + " " + data[5] legal_name = data[4] |
We are checking if counter > 0 because the first line contains the header.
As I said above, the legal_name is in data[4]. city and state in 22 and 23 respectively.
The name field might require an explanation. The name is spread across 3 different fields, and is of the format:
Sirname, First name, Middle Name
We are combining these 3 fields to get: First name, middle name, sirname
1 2 |
db_list.append( (name, legal_name, city, state) ) counter += 1 |
Once we get the four fields, we add them to our list.
You could also write them to the database as you went along. It depends. If you were going to write everything to the database, it would make more sense to write as you go along, otherwise you’d be back to the old problem of storing 5GB data in memory. Since we are only working with subset of the data, it will fit in memory.
We then increment our counter, which isn’t really used much, except for debugging.
1 2 3 |
print(len(db_list)) 4740851 |
At the end we print the length of our list. It’s almost 4.7 million. Note, that each element in our list contains 4 values, so the total number of values = 4.7 * 4 = 18.96 million entries.
1 2 3 4 |
c.executemany("INSERT INTO npi_data VALUES (?, ?, ?, ?)", db_list) conn.commit() conn.close() |
Using the technique we learnt in the first video, we write the whole list to the database in one go, and close the database.
Open up the database in Sqlite Browser again, and go to the Browse data tab.
Yup, the data is all there.
Right to the next video. How do we actually search in this database?
Querying (searching) our database
We will start by opening the database.
1 2 3 4 |
import sqlite3 db = "npi_data.db" conn = sqlite3.connect(db) c = conn.cursor() |
For our first search, we want to know how many practices there are in the city of Houston.
1 2 3 4 5 |
c.execute("SELECT * FROM npi_data WHERE lower(city)='houston'") result = c.fetchall() print(len(result)) 37661 |
One thing you will note is that I’m only printing the length of the result, not the actual result. That’s because printing 37,661 results will take up the whole screen. Feel free to change the code and print the actual results if you want.
Ok, to the SQL. The SQL instrustion should be fairly simple. We are searching the whole database where the city is Houston. A few points.
One, I’m doing lower(city), as the original data is in uppercase. Even though SQL commands are case insensitive, the actual data isn’t.
One more really trick thing. In the SQL command, I am mixing single and double inverted commas.
Everything under c.execute() is in double commas ( ” ” ). While the city houston is in single inverted commas ‘houston’.
We have to do it this way, otherwise Python will get confused. The double inverted commas are for the Python instruction, executed in c.execute(). The single commas are so that Sqlite correctly interprets houston as a string. Try changing houston to double commas in your code, and you’ll see what I’m talking about.
Next, I found a city called Athens in the list (like the Greek capital). We search for that:
1 2 3 4 5 |
c.execute("SELECT * FROM npi_data WHERE lower(city)='athens') result = c.fetchall() print(len(result)) 2959 |
There are almost 3000 results, but surprisingly, there are multiple Athens. How do we only search for the Athens in the state of Texas?
1 2 3 4 5 |
c.execute("SELECT * FROM npi_data WHERE lower(city)='athens' and state='TX'") result = c.fetchall() print(len(result)) 205 |
SQL allows you to chain searches using an and command. In the example, we search for where city=houtson and state=TX (code for Texas).
This is all very good if you know the exact name of the city. What if all you know is the city starts with a Z?
1 2 3 |
c.execute("SELECT * FROM npi_data WHERE city LIKE 'Z%'") result = c.fetchall() print((result)) |
The % symbol says any number of characters after Z. One more difference is, rather than doing an exact comparison using the = sign, we are using the LIKE keyword to tell SQL that we are not looking for an exact match, but a pattern search. This will print several hundred results.
What if we want to search for cities that start with Z and have exactly 4 letters?
1 2 3 |
c.execute("SELECT * FROM npi_data WHERE city LIKE 'Z___'") result = c.fetchall() print((result)) |
The underscore operator (_) in SQL is used to search for one character. So *Z___* (Z followed by 3x underscores ) will tell Sqlite to search for all cities that start with Z and have 4 characters. This will return cities like Zuni and Zion.
We can always narrow the search by looking for 4 lettered cities starting with Z in the state of Illinois.
1 2 3 |
c.execute("SELECT * FROM npi_data WHERE city LIKE 'Z___' and state='IL'") result = c.fetchall() print((result)) |
To make it even more narrow focused, let’s extend the previous search. This time, we are looking for names that start with Ma.
1 2 3 4 5 6 7 |
c.execute("SELECT * FROM npi_data WHERE city LIKE 'Z___' and state='IL' and name like 'MA%'") result = c.fetchall() print((result)) [('MARGARET ROBINSON', '', 'ZION', 'IL'), ('MARTI SNITTJER', '', 'ZION', 'IL'), ('MARK E MITTEN', '', 'ZION', 'IL'), ('MAULIBEN PATEL', '', 'ZION', 'IL'), ('MARY ANN HOLEM', '', 'ZION', 'IL'), ('MARLA JOY SERDAR', '', 'ZION', 'IL')] |
One final thing before we close this section. You don’t need to depend on Sql for everything. You can fall down to Python as well. For example, you could have gotten all cities in Illinois, and then used Python to loop over the list, searching for names that start with Ma.
You can look at individual results using Python:
1 2 3 4 5 6 7 |
result[0][0] 'MARGARET ROBINSON' result[0][0].split(" ") ['MARGARET', 'ROBINSON'] |
This is just to give you an idea. That’s it, see you in the practice session.
Why you need to practice
This short presentation goes over the need to practice coding. Skip it if you have seen it before.
No transcript, instead download the presentation here.
Finally, time to practice our Sql.
Practice Session
Same repo as before, open up Sql Practice.ipynb.
No transcript. Just open the practice file and follow the instructions. If you get stuck, the video has a few hints, but the biggest hint is: Everything in the practice session is based on what we’ve done before, so go over all the examples in this page, and you should be able to finish the practice session.
More data science lessons here.