SQL Database Normalization Techniques
This guide walks through a poorly constructed database and explains how to implement database normalization techniques in order to ensure the database structure has limited duplication and data integrity issues.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

In this guide we're going to talk about database normalization now database normalization is essentially a set of best practices and guidelines to allow us to efficiently build and model our database.

So right here I created a table called the ugly table and what my approach with this is to actually show you a very poorly implemented database and then see what kind of things you need to be done in order to fix it.

So that's exactly what we're going to walk through in this guide. So one of the very first things that are required for a database normalization is you do not want to have duplicate values in the same table or you want to limit the amount of duplicate values and so a good example is here.

This ugly table kind of mimics what we've been going through in this course except it has not implemented normalization rules. So here we have an id, a username, a user email, and then we have street one street two city-state zip and then status. So there are a few issues with this.

So imagine a scenario where we have a user with multiple records so they have multiple addresses. Well, what we'd have to do in order to implement that in order for them to be able to be stored in the database we'd have to come here and we would literally have to create multiple records we'd have to create one record put in their username their email address their street. All of those and then when another address comes in we'd have to do the same thing.

Now, what happens if it's someone they get married and their name changes. You're not just going to have to make a name change in one row you're going to have to do it in every row where they're listed and it becomes very challenging and you're not going to want to perform that type of build.

It runs into even more issues. If you were to try to use this to authenticate users because right here you have a user e-mail. That's what a user typically logs in with. So you're going to not even really be able to implement this. So what data normalization does is it does exactly what we've been doing throughout the entire course. It separates any type of different values.

So for example here we have the concept of a user, a user can be connected to the address but it shouldn't be in the same table. So a table should only have one role. So we have our users table here. It only deals with users.

Now a user has addresses but that address really belongs in its own table. So by building it that way what we're essentially able to do is we normalize the table to make sure that we would never run into a situation where we needed to create a new record in this ugly table any time a new address came out.

So whenever you hear about database normalization one of the most common patterns you'll find is it's describing and telling you that you need to be able to clean up a table that has unnecessary columns and usually that means being able to split it up into a few different tables.

I have worked on legacy applications that I came on board for that literally had tables with about 50 60 columns in them and it was an absolute nightmare to work through. They did not follow database normalization standards whatsoever and it just made it very messy to work with the data and the very first thing I had to do was split all of those tables up.

And I can tell you for an application that's about a decade old and has millions of records of data and having to split that and move the data to those other tables that were no fun whatsoever. So that's part of the reason why I wanted to include this in the section was so that you really kept in mind a huge goal when you're building and design your database is to make sure things are as compact and as efficient as possible.

So if you have a situation where like this where you have addresses an address probably needs to live on its own it shouldn't be connected directly to a user. That's what the foreign key relationship is for. You have a foreign key then you can connect it and then you can use joins to wire all of them up.

So that's a big part of normalization. Now another one is want something that this table also does very poorly which is how it manages its data types. So right now we have an ID which is an int that's perfectly fine. That works nicely. Then we have a username, email, street all the way down to zip.

All of these are varchar's that's perfectly fine probably should think about having some different constraints on these. So, for example, a state shouldn't have to be 45 characters long at the end of the day though that's not a huge deal. You should streamline it limit it just for being intelligent about space.

But that's not absolutely critical. But when you come down to status this is where things start to get pretty ugly. Now a status is listed as a varchar right here. But if you think of the way status normally works usually if you have a users table a status is going to be something like a is paid or is free or active or inactive.

You're usually going to have a very limited number of status options and status is usually something that has to be looked up very quickly. So one it should probably not be a varchar. I know it's much easier to store a string but there are many times especially statuses where it's more of a best practice to be able to use integers and so you can map in your application whatever your wiring this too.

You can map this to a zero may be active and a one may be archived or something like that. And if you remember back when we talked about how to use CASE statements you can run a report that outputs each one of those as a string value. So for reporting purposes, it is perfectly fine to do.

That would be if I were shown a legacy application that looks like this. The very first thing I'd look at would be the status. So that's something to keep in mind is make sure that you're smart with the data types that you pick out or else you could run into some tricky things.

picture

Now if I come to the ugly table

picture

and actually, come to look at the details this is where it gets even worse because yes we have a primary key and that's not all that's just the default for the primary key. We don't have unique which it should be list unique. But if you look at it there are no other constraints or no validations anywhere in the database. And whenever you have a situation like this that can get really really bad in a very short period of time.

Now if you're working with frameworks such as the Rails framework many are validations can be placed directly at the application layer and that's fine but if you're working purely with the database that you've built from scratch then you need to make sure that your data integrity is always a top priority.

So, for example, the user email that one has to be in a situation where it's not null. And it also has to be unique for I can't really imagine too many scenarios where that email address would be fine to be duplicated. So that's definitely one that you'd want to cover now.

Normalization in addition to making sure that you have validation so you can have data integrity. It also has to deal with how you're structuring and modeling your data. And that comes down to naming as well. So another issue with this stable is that the naming and also just kind of styles all over the place.

So right here we have this snake case for a few items so snake case is when you use an underscore just like we did for username e-mail and street one then we use camel case right here where we don't use an underscore but we use a capital T. And then here with City we just capitalize the first letter and the whole thing is just a little bit messy.

So being able to streamline this and you already have a model for how this is supposed to look if you look at the guides table or the users table or the addresses table they all have a perfectly uniform naming structure and everything's the same that comes in very handy when you're writing your queries because you want to be able to not have to think about how you might have styled something.

I know if I want to access the address of the table I know that I pre-pended addresses in front of every one of the column names just so I can know I can type that in and that's going to that column and so things like that you know it gets messy and usually run into situations that look like this when you've had about five developers all worked on the same table and didn't follow the structure of the previous developers.

You can have this mishmash of different ideas and styles so it's important to always keep that in mind. Obviously, if you're going to do something like make it change to a legacy application and you're going to make a change with the naming that also can be very problematic because of it's connected to an application that calls city with a capital C or calls street to in camel case. If you change the name it's going to break.

So that's something to keep in mind. A lot of the things I'm telling you are so that you keep them in mind when you're building your own database and when you're modeling your own database. And so those are a number of key normalization kinds of concepts. The very first one is being very intelligent and very intentional about how you actually pick out what columns go in which table and not overloading a table with too many columns.

Another one is being able to make sure that you have all of the validations in place that every single time that you are going to query an item you know that there's going to be a value there for the ones that really need to be filled. And then obviously things that revolve around style. So keep those things in mind whenever you're modeling your own database so that you can have a clean structure that is as bug-free as possible.