SQL Bootcamp - Episode 2

Description

Learn intermediate SQL Server coding concepts. Commands taught: Outer Join (Left Outer Join / Right Outer Join / Full Outer Join), Inner Join, Cross Join, Aggregates (Count, Sum, Avg / Average), Windowed Functions (Row_Number OVER), Group By, Having, Partition By, SubQueries (Common Table Expression, CTE), Declare Variables, Filters (WHERE ON), cast

Transcript

This is part two of my series on learning the SQL database language we'll be completing the practice exercises on the hackerearth.com website once you navigate to the site you can click on the search icon at the top and type SQL then you want to filter for problems or challenges and as you scroll through the challenges and click on each one you'll notice that they've already been completed so for example this particular challenge was done in 2016. that was for people that wanted to compete live even though it has ended we can still complete the problems we just won't be able to compete intition so if I click on solve problems you'll see a list of potential problems I can solve that are part of this challenge but if you go back to the original list you'll notice that each challenge has its own set of maybe five or ten problems so you'll want to go through each challenge one at a time and solve all of the problems inside each challenge in addition once you click on a specific problem you'll want to scroll down to the bottom and pick the database engine that you want so in this case this particular problem can only be solved using the Oracle database engine if that's not the engine you want to practice with you should just skip this problem altogether and instead pick a different problem that supports the engine you want in my case I want to use Microsoft SQL Server 2019 and so this particular problem supports that so this is a problem I might want to solve and one thing to note at the top it says that the problem is not available for practice that's actually incorrect I'm not sure why the website says that because right here it says the table is triangle and it has these columns if I just really quickly do a select star from triangle and click execute and test you'll notice that the query output a bunch of records from the database so obviously it is working just fine so ignore the fact that it says it's not working so let's solve these problems this first one says I have a table with three different sides to a triangle and this person wants me to design a chimney which has the maximum area for a triangular shaped chimp so this is the formula to how to calculate the area of a trend it says that I need to pick the maximum side from each of the columns and calculate the area so if I look at these two rows the maximum side is just going to be the largest of the two so in this case side one the maximum is seven but for side two I have these two choices six and Seventeen so the maximum will be 17 and for side three the maximum will be 18. and then if I consider side one is a side two is B and side three is C I need to calculate this s variable and then after I have the S I can plug it in with s a b and c and calculate the area with that and the answer is in this particular case is 59. so I don't really understand what this formula is accomplishing exactly but I do know if I maximize each of these columns and plug them in that I will get this answer so let's just code it up so first of all I have a triangle table with these columns so let's query that data select side one side two and side three from triangle now it has an ID column but I don't need that in this case in order to get the maximum from each column I can just use the max keyword so Max is a function that's called an aggregate function there are other types of aggregate functions for example Min AVG which is average sum which will add up all the numbers and count and those are just a few examples there are more and essentially what they will do I need to change to Microsoft SQL and I just lost my code okay I'll switch back to mySQL copy my code and then Microsoft SQL server and paste it so what maximum will do is because I'm querying the triangle table then I'm referencing each row in that table almost like one at a time the maximum will take the entire list Loop through them and figure out which one is the largest and return just that one value so where I originally had two rows this will just return one row because it will just be the largest value but it will mix and match the value so for example for site one it will give me the number seven but for side two it will give me the number 17. because it's reading vertically through the entire column and getting the maximum value so it doesn't have to all be from the same room so let's just run this by itself and see if I get 7 17 and 18. if I get different values it could be that um yeah so probably the example data they show at the top is not the actual data set that they run the test against so it's probably fine in this situation that I got 12 22 and 20. so um if I wanted to give a variable name to this column so for example when I run maximum on side one it is no longer the column of side one because it's been modified and it doesn't modify the real data it's just modifying it on the fly as it shows it to the user but it loses its name because once a formula has been executed on it the data's been transformed so I can Alias it because without aliasing it it has no name at all so by using the as keyword I can Alias it and I could call it something like maximum side one and you can call it anything you want just like a variable name can be anything you want but you want it to be descriptive enough so that someone else reading your code is going to understand it or even yourself in the future when you come back that you might be able to understand it because if you just give it a random term that it's not descriptive like value it's like or number that's too generic so it might be hard later on to understand what it's doing so I'll call it maximum side one and actually in their original formula they called it a b and c so I could also use that as a ASB as C and the as keyword is not required if you just do space a the as is implied but I think it's better to still type it even though it's not required just so the code's more readable now I need to add them together in order to calculate the S but I still need to keep the original values because they need to be separated when I'm doing the final formula for area so I think I'm actually going to use variables in general in SQL it's better to just do everything in an actual query and not use variables as much as possible variables are more like traditional languages like C plus plus and the issue is variables only act on one value at a time whereas something like a select statement runs off of an entire table which has multiple rows and so databases are optimized for modifying a large amount of data at once so once you get into variable then it's no longer optimized for a large data set but in this particular case I think it'll be okay so I'm going to make a variable called a and make it an integer which means it cannot have decimals and it has to have numbers no letters and I won't give it a value yet so normally if I put like equals 0 or equals one that would be like a default value but instead of assigning it here I'm going to say at sign a equals inside of this select and then I don't need this Alias on the column anymore and I'll do the same with b and with C you use the at sign whenever you're defining a variable so when you write your syntax this way where the variable gets assigned inside of a select statement your query can only return one row so for example if I did not have this Max on here the triangle has two different rows in it at least in the example up here so this would be confusing because a variable can only have one value just an integer in this case for the a variable and by taking off the max I'm now reading the value 7 and putting it into the letter A the variable a and after it's reading the number four I'm putting into the variable a so since the variable Can Only Hold one value the 4 is going to overwrite the original Seven so the final value will be four but that's kind of like an error because it's trying to put two values in so depending on the database engine it may throw an error or you may just get unexpected results because you don't know which order the values are going to be assigned to the variable but most likely it's going to be the very last value that's going to be assigned and that's just not very useful so if you're not going to do something like Max to enforce that only has one row you could also use a command like top one to make sure that only one row is returned but I'll I'll use the max keyword so now after this query has been executed each of my three variables should have if the correct value then I just need to execute this formula and I'm just going to copy and paste it so I don't have to retype it so because s is a new variable I'm going to use the declare command and I'll use an at sign s and because it's being divided it could have a decimal so instead of integer I'm going to make it a numeric and the first number inside a parentheses the first number tells it how many maximum digits can be in the number so I'll just say something like 10 and then the second is how many decimals of precision so I think I'll say two so and because if you look at the original answer they said 59.39 so they're doing two decimals of precision so technically I could have used four comma two because in this case there's four total numbers and there's two to the right of the decimal but I just chose to do a higher number in case some of the examples they have when they execute tests happen to be larger numbers so because I copy these formulas in they are not an SQL form so I just need to change these to SQL variables by putting an at sign in front and then this should work to calculate the S variable and then I'll do the same thing for the area and again just put the at sign in front of each variable and semicolons at the end of each line are not necessarily required but you can add them just to separate the different commands it's up to you and then afterwards I just want to Output the area that was calculated now because this is a variable it also does not have a column name uh similar to how the max command modified the column and so it lost its column name variables are not tables and not columns they're just individual values so I can still use them in a select but I need to Alias them if I want it to have an actual column name when it's output and you'll notice that in this select statement I don't have a from I'm not reading from a table I'm just reading directly from a variable so in this case there's only going to be one row so let me just run this and see what happens so I am getting an error here I'm not exactly sure what the issue is it says that there so incorrect syntax means there's some kind of keyword or something that I type but what's weird is it says near and then has empty quotes I don't remember putting empty quotes in my code anywhere and it doesn't exactly give me a line number either oh I think I realized what I did wrong on my variables I'm missing the in so I did not Define my B and my C so I can run it again and see if that fixes here but what I was going to say is in a situation where you're not sure what you did wrong what you could do is copy the whole thing and uh back it up just in notepad or word or something and then just delete part of your code and run it and see if the error goes away so that you can narrow down one line at a time to see which particular line of code is broken and that's only when you don't have good error messages usually you're going to have better error messages that tell you the exact line of code now that I've typed this int let me execute again and see if it fixes the issue okay so it's a little bit better it says I have to just Define the variable a so I did Define it here so I'm wondering if it's going to confuse by my semicolons because it's thinking that they are separate queries I'm going to take all those off because it definitely was defined here and so it should have worked here although I did notice with the area in one case I've got an uppercase and somewhere else lowercase it may not matter but just for consistency sake okay so I've got a new error now now it says it doesn't like the equal sign again I'm not sure which equal sign is complaining about but I'm going to assume it's this one with the select so let me just cut out some of the code and re-execute to see okay so in this case I did not get an error but I just said my result was wrong so that makes me think it was a different equal sign it didn't like so let's see if it happens to be line six that's the issue okay you again no issue so maybe it's on the area equals okay now that it came the air came up so it must be on this line here let me make sure that my parentheses are all lined up right so I've got this set of parentheses that look fine these look fine and these look fine and then that square root oh um it makes sense now before my equal sign I did not Define the variable that area is using so it should also be a numeric 10 2 I think let's try again okay not only did it execute with no errors but it gave me a green check mark meaning I got the correct answer so perfect so just to recap declare defines a variable at sign is a prefix you use anytime you use it a variable name variables names should be letters but they can be full words like area they should not have like weird symbols or spaces in them and then int and numeric are the data types so the type of values that are allowed to be inside of that variable and then we've already covered in a previous video the select and the from Clause so I won't go over that but the equals allows me to assign a value to a variable people as it's querying the data the max is an aggregate function that's going to Loop over all the rows and find the largest value in that column and the star symbol in this case is doing multiplication so in the case that you do select star that asterisk means return all columns but in a math formula when you use a star or an asterisk it does multiplication and then at the end you can do a query off of just a variable without a from keyword and that will still execute and return the value of that variable as the final result of your query so I do want to get into more details on aggregate functions like Max but rather than explaining it here let's see if one of the next problems requires those because there are other keywords for example Group by that you'll commonly use along with Aggregates but we'll go over that hopefully in another example okay next example okay so it's given me a table of students and within the table there are three columns name age and height and the the problem basically says that there's going to be a sporting event at the University and students are going to be picked at random to be on teams and each team will have size three and because it's Sports then the height of the student is the indicator of how good they'll be at sports says if I sum all of the heights that will tell me the strength of that team or group of students and it says when I calculate the strength I need to round it to decimal places so in this case this team has a height of two for three different students so if you sum that up two four six answer six but it has two decimal so this seems again I need to change to Microsoft SQL Server this seems similar to the last one but instead of the max keyword I can use the assume keyword so I really just need the height column from the student table so this will return me select a height from student will return me three rows that say two two and two but I want to add them up so I'm going to do some so then it'll only give me one row that has the and the value six but I want it to have two decimals so I'm going to cast it so the original height is an integer so when it's summed up it'll still be an integer but if I cast it as a numeric and it doesn't really matter what value I use on the left as long as it's big enough to handle um like in this case 6.00 it would be three total numbers so for the example I could use three but I'll just make it bigger just in case there are some really tall people I guess if it's meters really three is probably fine but it doesn't hurt to make it larger anyways and then I want two for the decimals and it doesn't really matter what I call this column I don't think it didn't give me it didn't tell me that but again the original was called the height and now that all these formulas have modified it it's no longer going to have a column name so I'll just call it strength because that's what the problem said it's like the strength of the team so let me execute this okay I didn't like my keyword student That was supposed to be the table and it looks like I just missed the S on the end students try again okay I got the green check mark so I got it right next problem okay this problem is actually kind of challenging so we'll break it it up into smaller parts so essentially I have a table called first which stands for the first semester so each semester you have students that take classes in certain subjects and Marx is what kind of grades they got so I assume the higher the number since it's an integer the higher the number the better and for the second semester it's similar where you have subjects but instead of tracking the marks or like what grades they got it's tracking um tests this is actually a true or false so they really should have used a bit column but it's text and they use the word yes or no to say whether or not they took a test in that subject and in particular it says that in a student in the first semester table if it's the student's favorite subject they also get the best grades in their favorite subject so the problem I'm supposed to solve is by taking a list of students in two different semesters so I could match up the students based on the ID so id1 here the student's name is a and in the in the first table so if I take the second table and match up id1 it's even though this does not say the student's name so the subject is Hindi I can join these two up because they have the ID of one that means it's the same student so student a got a score of 205 in math and 30 in English so their favorite subject is math but then within id1 which is also student a they took a test in Hindi and they so these subjects are actually lectures that they attended so they took they attended two lectures in math but only one had a test and the other one did not have a test the other did so uh well the assignment is to figure out which students have attended at least two lectures with a test in their favorite subject so for example student a their favorite subject is math and they did have two lectures in math but only one had a test so they would not be considered an exceptional student student two is named B their favorite subject is Hindi because they had a higher score of 70 compared to 50 in science so Hindi is their favorite subject and within id2 they took Hindi lectures twice and both times it had a test so they would be considered an exceptional student but the idea was to what the name was B and in the output you see that it's supposed to show the name b and then C since they only have one class it's automatically the favorite which is Hindi um and then they also had two Hindi lectures which each had a test so they would be an exceptional student and I noticed this is a little odd but the second table wasn't really sorted so student one actually had an additional math lecture but because it was not tested they are still not considered an exceptional student so let's break this up into smaller pieces so first I'm just going to get my data out of my two tables right now it's two separate queries but I'm going to modify that in a second I'm probably actually going to join them together but let me think of what columns I need so I am going to need the ID column from both so that I can join them I'm going to need the student column from the first because that's what's going to be output is their student name I'm also going to need to join not not only on the ID to see if it's the to match up the student but also on the subject because in this case where student one uh two subjects math and English because English is not their favorite subject that does not matter um so I will need the subject from both um I guess I really need everything because Marx determines if it's the favorite subject and I need to know how many tests were taken so I think in this case I'm actually going to write two separate queries to solve each half of the problem and then I'll join the two together so let me type everything up how I would solve it and then I'll explain it as I go so I'm going to say with favorite subjects as and I'll say select inside of parentheses select Student's name and their ID and the subject okay so I don't need to actually return the marks because I'm only going to be returning the favorite class which is the highest Mark first of all the subject here because it's highlighted in blue and also first because it's highlighted in blue that means those are keywords that are part of the SQL language in this case it should work just fine but anytime there's a word highlighted in blue that's really just a column or a table I wrap it in square brackets and that just removes the coloring because it tells SQL that this is I'm not trying to use an SQL keyword I'm just trying to use like a column name or a table name that's not always required unless it actually gets an error because it can't understand your code but I choose to do that anyways and then next I'm going to use a different aggregate function called row number okay so let's break this down the entire query is not yet written but I'll just show the first part of it so the with keyword create something called a Common Table expression and a lot of times that's abbreviated as a CTE but another similar concept is basically a sub query so what this means is this query is is considered a sub query because I can use it in an outer query so I've essentially got two separate select statements but as a whole it's executed as if it's one single query so it doesn't execute this one and then afterwards execute this one it actually executes the whole thing so in essence what it's actually doing is doing a cut and paste it's taking the inner part and it's replacing where where the table name is aliased and basically wrapping in parentheses so this is another way that you could write it so I'll write it in both ways these two queries are actually identical they there's no difference between them other than just the way that they are read but I prefer the CTE version the comment table expression I think it's slightly more readable in particular when you have multiple sub queries because we will afterwards want one that's based on like the number of tests that a student took and so I could put a second sub query here and then I could join to tests afterwards and I have not yet explained joins yet but I will in a second but I like this CTE way slightly better because I can have multiple sub queries comma separated and then afterwards have my final query whereas uh this other way the code just kind of becomes more jumbled because if I were to do a join instead with tests in the second way the second query would go in parentheses so I think second was the name of the second semester where the tests are taken so in this case I have two separate sub queries that are mixed in with my join and my outer select whereas with the CTE way each of the inner queries are written on their own and then you have a final query so I just like that in the final query everything's very readable whereas in the bottom one everything just becomes more jumbled so that's that's why I prefer the CTE way but either one technically works so if we just look at the sub query what it's doing is it's taking the first semester table and it's returning the ID of the student the name of the student and the subject so in the case of student a they have two subjects both math and English and so it's going to return two separate rows for student a what this row number does is actually let me copy and paste the number so I can show you okay so if I have another column here called row number then the Partition by first is going to essentially group based on the student column so student a is going to be considered his own section student B is separated as well and a student C so it's kind of like three separate buckets one for each and then within that bucket it's going to do a sort based on the marks column but it's going to be doing a descending order so within student a it's going to look at the Mark so math is 205 which is the largest English is 30 which is the smallest so it's going to sort descending so in this case it's already sorted in that order but for student B it's in the wrong order so it's going to swap the order here so that Hindi is on top because it's the higher value and then this one since there's only one row it the sort doesn't matter it's already in the proper order so then afterwards it's going to assign a value to the row number so that it's going to go 1 2 and then it would keep going as 3 4 except that these are in separate buckets so soon because we partitioned by students and it put student b in a separate bucket instead of keeping going with 3 4 it's going to start over one two and then again with student C it's going to start over again at row number one so the value of the row number it's actually extremely useful I use it all the time the value is that I can now use a where Clause off of this row number to look for the number one and I know that number one is essentially the highest value for each student which is essentially their favorite subject so I called this um sub query favorite subjects but technically I don't have a where clause on this row number so it is not yet only their favorite subject it's just their subjects sorted in descending order so ideally what I would want to do is I would want to take this row number and move it down to a where clause and say equals one the problem with that is for some reason SQL Server does not allow you to use aggregate functions so row number is an example of an aggregate but others are average Max some count and an aggregate is essentially anything that takes a group of rows and calculates some kind of value by looking at the group of rows and so those are not allowed to be inside of a where Clause however once it's a sub query you can trick it and to allowing you to put it into a where Clause so the way I would do that is I'll make an additional sub query called favorite subjects and that's why I'm going to rename my original to just say I guess class classes so inside of favorite subjects I can reference my classes sub query as if it was a table so I'll say select start from classes where now I need to give this row number a column name because otherwise I can't reference it in the where Clause if it doesn't have a name and because row number is a calculation then it's a combination of both the student column and the marks column so it doesn't keep either of those column names once the data has been transformed so I'll just call it ronum and I'll just reformat this a little since it was wrapping the text down to the next line so because I gave it the name a ronum I can say where ronum equals one so I have a classes sub query and the favorite subject sub query references classes sub query and it does a filter off of the ronum column and because it's checking for row number one then for student a math will be kept because it's the favorite subject but English will be removed because it's not the favorite subject and for student B Hindi will be kept because it's a favorite subject science will be removed because row number is two not one so that won't be removed and student c will keep Hindi so now I have the three rows which are my favorite subjects so next I need to know how many tests each student took in a second semester for each class now I could do it just based on favorite subjects but instead I'm just going to do it on all subjects so I'll do select count now count is an odd aggregate function in this case I'm going to say that I'm going to count the subject column but in reality it does not matter most of the time which column you count and I'll explain why in a second and I put brackets around it those are not normally required so you just use parentheses when you call a function and so what goes inside the parentheses is the parameter that's passed in to your function but because it's highlighted in blue it's like SQL thinks it's a keyword and so that's why I put brackets around it so the brackets are not really related to count they're just to make subject not highlighted so I guess I'll leave the brackets out just so that it doesn't look confusing okay so the count will be as I'll Alias it as the number of tests so the table I want to read this from is going to be second and because it's highlighted and blew up with brackets around it and I'll say where test equals yes so the test column if it says no I don't really care that they went to the math lecture if it was not tested because I actually need um I only need to count out the ones that are tested and then I'm going to say Group by student actually student is not in this table it's only ID so I'll Group by the student ID so let's grab the data from the second table and show what this is accomplishing so just like we did Partition by before Group by is identical the reason the keyword is different Partition by or group by is one is used uh directly inside of the aggregate function on the like on the same line as you're declaring the column and it only affects that one aggregate column whereas Group by affects the entire query and all of the Aggregates so for example if I was doing a sum and account and an average all at the same time on the second table though so the second semester this group by would affect count sum and average whereas if I wanted each of them to have their own group by then that's when you use the Partition by command and each of these can technically have their own Partition by if you use Group by all of them have to use that same column or and you can do multiple columns but the the group is what determines the buckets that they're grouped into and partition y does the same thing but with Group by it affects all of the columns with Partition by you can do a separate arrangement of buckets for each column so I technically could use either one since they both essentially do the same thing but just to show both versions I'll do group by in this case so because I'm grouping by the ID which is the student ID the count will be based on those buckets so if I look at the second semester table first I'm going to separate id1 into its bucket then I'll do id2 into its own bucket then ID3 into its own bucket and notice that there is an extra ID one row here at the bottom so that needs to go in the first bucket so this will get moved up here so I've got four in this first bucket now the reason it doesn't matter which column I'm counting if you're doing a sum it would matter right because sum is adding the numbers so you know if I did sum of the ID column two plus two equals four but you can't even add a text together like yes plus yes doesn't mean anything and if I had a separate column which was there for example with Marx which is in the first table you know a sum would take 205 and add to it 70. but count it's not doing any math it's not adding the values together it's just counting the rows so in the case of student id1 it would be one two three four so whether I count Hindi math math math or if I count 1 1 1 1 in either case it's the value 4. um so in general it doesn't matter what column you pass into count so I actually a lot of times don't even do a column at all I do just a static value of one kind of to say it doesn't matter what I'm counting um because it's really just counting rows the one caveat where it could matter is if you have a null value which is essentially an the absence of a value then nulls do not get counted so in this case if this was some other column called ASDF and that's what I was counting nulls would get skipped so instead of being four rows it would only be two rows and the other caveat would be if for some reason I don't want math to get counted twice or three times in this case so I just want Hindi and math if I just wanted this to be a count of two because math I didn't want to count the duplicates then I could count the subject column and I could put the word distinct before the subject column and that would say get ignore the duplicates so math would only be counted once now because it's grouped by the ID even though Hindi was used by the first student with the distinct the second group bucket which is 292 it can start over counting again and ignore the distinct so it's it's not going to care that student ID 1 had a Hindi class it's just going to Care this unite D2 has a Indie class and because I have the distinct it would only count it once here at first 93 it would only count at once but in our particular case we don't need the distinct and we don't have to worry about nulls in our case so I think I will just do the count one just kind of say it doesn't matter what I'm counting however I don't actually want student ID 1 to count four I really just want it to count the times that it took a test and in addition I don't want to combine math and Hindi into the same count because even though I took two total tests as student id1 they were in different subjects and I want to know how many tests were on the subject of Hindi and how many tests were on the subject of math so I'm actually going to modify my group by to include the course or subject as part of the separation of buckets so that way Hindi and tsuna D1 gets its bucket math and student ID gets its bucket and then Hindi and student id2 and Hindi Institute 93 so it gives me an additional bucket so then in this case the count will be one in this case the count will be three and then two and two and then finally in the case of math since these tests were not taken I don't want to count them so instead of getting three for math I actually only want to get one for math so that's why I say where test equals yes now finally you can I mentioned with a where Clause that an aggregate cannot be inside of a where Clause however you can do a second where class laws after the aggregate is done and the keyword is called having so I can say having count of one equals two because I I want to know which subjects had two tests or technically I'll say greater than or equal because if they had three tests that also will work so and I think I could have actually done that here with the row number let me try that so I'll um say having row num equals one and then I shouldn't I could actually rename this classes to favorite subjects and get rid of the second sub query um so you can only have one having and one where I believe but the having executes after the aggregate is X is calculated and the where executes before the aggregate is calculated so I said earlier that I didn't know why you know you can put row number in the where clause and I guess that's the reason why it's because of the order of operations the where happens before the select and so the row number would not be calculated yet and that's why you can't put it in the where whereas the having happens after the select so um that's why where why Aggregates cannot be in the word clause okay and then rather than this is fine that I said count greater than or equal to two but it's better to give it a column name and so rather than copying and pasting your formula you can just use the column name it technically it's not doing any extra calculations because it's not going to actually execute the count function twice it'll realize it's the same formula and only execute it once but in case you ever change your formula or if your formula is really long then it's better just to use the column name okay so then rather than just calling it tests technically this is subjects that have two tests so maybe I'll just rename it to two tests so I don't need this example data at the top anymore but now that I have my two sub queries I need to combine them together and I'm going to use the join clause for that and I forgot along with the number of tests I need to know which student it was and I need to know which subject the was for now something to note with a group by command that is different than the Partition by command is the Partition by because it only affects the one column the one aggregate function you can Partition by anything you want and the select the other select columns are not affected by your Partition by but in the case of group by you cannot have any additional columns in your select that are not in your group by you can have missing ones in your select so I could remove subject off the select if I wanted to but I could not add the test column to the select if it's not grouped so if I add test to the select I have to also add it to the group and that's because since the group by is creating the buckets then if my buckets are ID and subject it would not make sense to have values that are unrelated to my bucket show up next to the aggregate of that bucket because the the data wouldn't make sense and so in the case of Partition by it actually can create data that does not make sense but since you're kind of more in control it's your job to figure out how to make that work so for example let's say I had like a city and a population and a country Las Vegas is inside of Nevada which is an inside of USA but there are going to be other cities inside of Nevada and they're going to be other states inside of USA so if you're grouping by the state and summing up the population then that would mean you would take all the cities in Nevada add all of their populations up and that would be like a subtotal of Nevada but it does not make sense to also put the country next to it because the country was not involved in creating that bucket of population um so you need to also Group by the country in the case of country it doesn't seem too wrong because Nevada is inside of USA but think about going the opposite direction if you did City then you know I'm grouped by the state which is Nevada and I'm adding up every single City in Nevada and getting the total population if I also show the city then the final result would say something like Nevada a million people and then Las Vegas well this million didn't only come from Vegas it also came from other cities like Reno and so it it doesn't make sense to show the city because that was one of the multiple items that were added up together so that's why the group by needs to match the select so if I'm going to show City in the select I have to also put in the group by in the case of Partition by If instead I did some population over Partition by state so if I do a partition by I don't need a group by but I could if I wanted have two separate summed populations one by the country and one by the state and so these are actually doing two separate calculations here so in this case this would be like you know total per state this is total per country um so maybe it would show Nevada if I put the state here next to a million I don't actually know how many people are in Nevada but I'm just guessing and then next to that it would show the USA population which is probably like 300 million and so with the partition buy you can do that if you want and it may create data that doesn't make sense because Nevada is not actually 300 million Nevada is 1 million so it's up to you to make sure you name your columns appropriately so that you clarify that this 300 million is for USA and this 1 million is for Nevada so that's one difference between Group by and Partition by so partition buys a little bit more powerful um and it gives you more options okay so I have my favorite subjects I've got my row number and my having so that it filters out any subjects that are not my favorite and then I've got separately all the classes that I took a test in and groups by the student ID and the subject and then I have a count with a having now I don't actually need to return the count I just realized because the only reason I need the count is just so that I can check if it's greater than or equal to two so I can move it out of my select because the select is what's actually going to be output to the screen or in this case since it's a sub query you know sub query is kind of like a temporary table in memory and so like that value would be calculated and kept in memory so that later on when the rest of the query is running then that value is available but I don't actually need it to be returned by the sub query I just need to use it to calculate if it is greater than or equal to two and I think the same is actually true the row number I just need to know if it's the first row I don't necessarily need to return that value so I think that will still work if it doesn't I'll undo those a couple of changes there but anyways now I have my two sub queries so normally when you're you know writing a select from you put a table name in this case I'm writing my sub query name and then with a join you have multiple types of joints you have an inner join and an outer join and a cross join um and then within an outer join you can do left and right now I would just ignore the right because a left and right is almost essentially the same thing so to explain joins let me copy the first semester table and paste the data down here and then let me also take the second semester data and paste it as well and just to simplify it I'll remove a few of the rows so I'll keep two rows from student one and just one row from student three and then I'll do on the other table I'll do one of student one two of student two on one of three so let's talk about the cross join first what a cross joint does is it will take the very first row and match it one at a time against each of the other two rows so it's combining two Separate Tables together into one giant table and so this first row becomes three rows the second row also becomes three rows because it's going to match against that same list this third row also becomes three and this fourth row also becomes three so four rows multiplied by three rows is twelve total rows and another name for cross join is a Cartesian product so let me paste what the results would be so my columns are going to show two times this the on the left is all my first set of columns and on the right is my second set of columns and then this first row is going to get pasted next to each one and then the second and third rows will also get pasted so here's my original set of rows four times so on the first four rows of the original table the second row got pasted next to it and then for the fourth row then all three rows from the second table will get pasted next to it so this is a cross join and an inner join and an outer join are similar but they have slight differences so you're probably wondering what is the value of a cross join well you're typically always going to do a where Clause afterwards so if I have all of this data the data that makes the most sense in the situation is when the IDS match up because the IDS are the student's ID so on the right side student ID 1 took a Hindi lecture and also took a test and then on the left side student id1 has a name of a they took a course in the first semester on math and they scored at 205. so combining these records together makes sense because they're for the same student however it doesn't completely make sense to combine these two records together because one is for Math and one is for Hindi so your where Clause is going to determine which rows of these you keep and which ones you filter out you really it depends on the specific logic that you're trying to achieve with the specific business rules and that you're being asked for in the report that you're writing but in our specific situation we're looking for a favorite subject to match up with a two lectures that each had a test in that same subject so after we do our cross join we would want to wear claws to match up the student IDs and also tests match up the subject and in our case we've actually already filtered out to only favorite subjects and to only courses that had two tests whereas when I did this cross join I didn't run those initial filters I just picked a few random rows of e but let's just pretend that I did this where Clause so on the left this table would be the favorite subject table or sub query and on the right this would be the tests sub query now I can Alias this to make it shorter so I'll say as f for favorite subject and as t for tests so in my where class laws I can say F dot ID equals T dot ID and that will match up these ID values and if they're not equal because I require them to be equal in my filter so if they're not equal they'll get removed so this one is id2 matched up with id1 so it'll be removed same thing here same thing here same thing here so this would be the final result of checking for the IDS to be equal and then if I also require the subjects to be equal F dot subject equals T dot subject then it would also filter out this top row so the cross join by itself creates way too many rows that don't match up but then the where Clause filters it so that you have only the rows that match up together and the reason we put an alias on favorite subjects or tests is because the two column names are identical so without the prefix if you just say ID equals ID SQL Server doesn't know which table each of those columns are in so it's ambiguous and it gets confused so you need to give it a prefix but you could instead instead of using an alias putting the Full Table name but it gets annoying to have to type such a long name over and over again so that's why you typically will Alias it with just the first letter of the table name and if you end up with multiple tables that start with the same letter then you might do something like for favorite subjects this I would use the first letter the second word as well so FS instead of just F so that's the cross join but inner join and left outer join are similar let me get the original data back so I'm going to comment out the cross and the where and instead I'll do the inner join so a comment just means don't execute this code so that can be used to remove code from being executed or it could also be used to write a note to yourself like note this is a cross join so maybe if you're doing something a little weird that you might not remember or understand later you might write yourself a note so with an inner join it's doing the exact same thing as this cross combined with this where the difference is that the inner join require hires you to give it a filter the cross joint does not have to have a filter so if I actually wanted to keep all 12 of these rows I could do that with the cross join because the where command is a completely separate command from Cross join inner join it is required to have an on keyword afterwards so I would say so up here at the top I did favorite subjects join tests if I don't say the word inner it's automatically implied if I just say the word join that it's an inner join but I'll say inner join tests and I guess I'll start with the well I'll write the code up here at the top inner join tests on and then I'll copy this where Clause to the on so I can still have a separate where Clause that does something else like where you know favorite subject equals maths or something like that but the on Clause is essentially an additional where Clause that gets executed early and the reason that's valuable is you may have like three different inner joints to you know different tables like courses and teachers and you where Clause if you were to use cross joins then each of the Cross joins will get exponentially more and more records created and then your where Clause will become very complicated because you'll have a filter for the tests cross join in your where clause and you'll also have a filter for the courses cross joint anywhere clause and also for the teachers cross joining your workflows so that workloads will become very complicated benefit to the on Clause is it's the same as a where Clause but it's filtered early so when the jointed tests happen instead of getting 12 rows it's going to immediately filter it down to the smaller subset of rows based on your criteria and then when you join again two courses with a cross join it would exponentially create you a bunch of rows but by doing the inner join since you would have to have an on Clause afterwards which filters it it's not going to create nearly as many rows and from an efficiency standpoint the fewer rows you're working with in each iteration of joins the quicker the query can be executed so and also it makes it easier to read so for now I'm just going to join to the one table but I'll values the on clause which is essentially the same as a where clause and I'll filter based on that same criteria so just like before it's going to check for the ID to equal the ID and the subject will also need to equal the subject so in this case math and Hindi don't match so this filter this row would be excluded and then in these would be excluded because the IDS don't match this would be kept because both ID and math matches these would be excluded because IDs don't match and this would be kept because the ID and subject patches so you'd be left with just these two rows so the only difference with an outer join and I mentioned that you have a right outer join and a left outer join the reason I suggest to just ignore the right outer join altogether is because you can easily convert between left and right so what it means by left or right is simply the order that the sub queries or tables are joined in so the when this I forgot the word join here just like before the word inner or the word outer is not required because all left joins are also left outer joins they mean the same thing and all right joins are also right outer joints because they mean the same thing so you can just say left join or just say right join so when you're doing a join if you say left essentially this table favorite subjects is on the left and the tests table is on the right because if you were to put it as one sentence then as you're reading left to right favorite subjects is on the left and Tess is on the right I haven't explained what it means to be on the left or right but I mentioned that you can easily convert from one to the other so if I wanted to make an equivalent right join all I would have to do is take favorite subjects and move it to the right and take tests and move it to the left and now I've converted from a left join to a right join simply by changing the order of my tables and because of that I just think it's easier in my brain to always think about left joins and just forget that right joins ever exist and if I ever wants to do a right join I'll just swap the order of my tables and make it a left join it's just easier for me to think about that way but I'll explain both versions just so you understand so what a left join or a right join means is instead of deleting the row out of memory when it's checking if they match instead it's going to essentially do like a comment in memory to just exclude the row temporarily so in this it's still going to use the same criteria ID equals ID and subject equals subject so this row would be commented out because it has a different subject Math versus Hindi this row commented out because different IDs this one will be kept because it's the same ID and the same subject these commented out because IDs don't match and this kept because the subject and the ID matches now the reason I mentioned that it's being commented out is because what the outer join does is after it's done that preliminary filter it's going to re-look over the rows that are kept and if there's any IDs from the original table that were excluded completely excluded it will add that then back in with nulls for the other table because it couldn't find a match so as an example because it's a left join it means kind of like the primary table in the join is the favorite subject so it's saying I want to always keep favorite subject regardless and then I want to try to match it up with tests but if I can't match it I still want to keep favorite subjects I just want to give a null for the tests so in this case the ID of 2 got completely excluded because it was in favorite subjects but it was not in tests so instead of it being completely excluded this row here of two science 50 would be kept but then on the right side would be nulls and the same would be true for um so it's not just based on the ideas but essentially every row that was in favorite subjects is going to be kept if it can't find a match so this first row one a mass 205 it was kept because it matched up with tests so it doesn't need to know null out the test because I found a match this 2B science is the one that needed to have nulls with it because it could not find a match this to be Hindi also needs to be kept but with nulls 3C Hindi 40 that did find a match down here at the bottom with three Hindi yes on the test so it does not need to copy this and null it out and that's all of the original rows that were in the favorite subject table because everything after that is just a copy of so these four rows are a copy of the original four and these four were a copy of the original four the difference was since I originally did a cross join then the first four were matched up with the first row of tests the second four were matched up with a second row of tests and then the third row of test so the left join allowed me to keep these two rows rather than completely filtering them out but because it didn't find a match in the test table it shows all nulls for the test table so a right join would be identical to that except the primary table would be the test table because it's on the right so that would be the rows we don't want to lose and the ones that could be nulled out if they don't match would be the favorite subject so if I put my cross join results back here and then I run my filter the same rows are going to be commented out so I only have these two rows but that's just the preliminary filtering but then it's going to decide which rows from the right table were completely excluded and it's going to keep one copy of it with all moles on the left so in this case the one Hindi yes was excluded completely so it needs to keep a copy of the 180 yes with nulls on the left table but the one maths no that did find a match and the three Hindi yes did find a match so the other two rows in the test table because they found a match they don't need to keep a copy with the nulls so that's the way an outer join works and then there is also one called a full outer join or a full join and it essentially does what both the left and right outer joints do together so if I paste my original cross joint back in and then I run my preliminary filtering criteria to comment out the ones that don't match then I'll check everything on the left to see if it was completely excluded so this so I have these first four rows on the left I need to think about so if if we remember the original left table looked like this without the comments and the original right table looked like this without any of those comments so if any one of these rows is not returned because it's the left table it would keep it but put notes on the right and if any of these rows didn't match which are the right table it would keep it but put nulls on the left because the full outer join is doing both a left and a right outer join so I only have these two rows that were kept which means since I originally had four on the left table there are two that are missing so it's the 2B science and the 2B Hindi so it's going to keep a copy of each of these with all nulls on the right so that's the left outer join portion and then on the right table I have three rows and the one Mass no was did find a match the three Hindi yes also found a match but the one Hindi yes did not find a match so I would keep that one Hindi yes on the right side and put all nulls on the left side so that's a full outer join the most common ones that you're going to use are just join which really means inner join and then also you'll use left join which really means left outer join there's really not a reason to use cross join very often at all that might be something you use like once a year whereas the regular join and left join you'll use like 50 times a day and then the full outer join you'll use that one once in a while but it's probably more like one time a month rather than 20 times a day like these other two joins and the right join which really means right outer join you could use that just as often as the left joint if you wanted but because it's really easy to rewrite a right join into a left join I think it's just easier to forget right joint exists and just always write it as a left join so if I convert this back to just a regular join or an inner join I think this query is all written and I think it does the logic that I was asked to do since this subject's keyword is blue I'm just going to put brackets around it so let me execute this and see if it's happy okay I got an error and it says first.student is invalid in the having Clause because it's not in an aggregate or group by having only works with Group by but does not work with Partition by and row number cannot be used with a group by so instead of doing this having row number equals one I'm just going to move it to the where Clause now I can't put it in the where Clause of this sub query I have to move it to the where Clause of this outer query and because I'm joining two tables it's best to put F dot on the row number that I'm filtering technically it will figure out which of the two sub queries it exists in because the row number column does not exist in the tests sub query but just for readability's sake I'll put the F dot in front and then I think that this second sub query using it's having I think it does work because it uses a group by whereas the other uses a partition by so I can probably even remove it back out of the select but let me try it okay so it doesn't like it with this num tests being referenced in the having Clause let's see if I just paste the count one in the having Clause so it did still give me an error but it's a different error it's saying it doesn't like this join because I called it tests and the sub query is called two tests so I'll fix that so it looks like this did work I'm able to do the having with an Aggregate and because I don't actually need the aggregate returned I don't need to put in the select and because this other one was using a partition by I wasn't able to put in the having and so I did have to return it in the select even though the final result doesn't need it I need it for this bottom query so that I can do the filter and so I really should not use an asterisk here because that's going to return this row number which the original assignment did not ask for that to be output it said the sample out output should just be the name so that would be F dot student and because they called their column names I should do the same thing so let's execute it again okay I got a green check mark which means it's working so essentially this is the query that solves this problem now this is a very complicated query and I wasn't planning to get into an assignment this complicated yet but that's just the assignment that this tool gave me but we learned a lot of good Concepts here so I'm just going to do a quick recap to make sure it all makes sense so the original assignment was to take the first semester table and the second semester table and to in the first table figure out what the favorite subjects were for each student based on which one had the higher marks so for student one math had the higher Mark so that's their favorite class and then with the second table we were supposed to figure out which lectures each student took that had a test and in addition which students had two lectures in that same subject so in this case student three had Hindi with two different lectures and student two also had Hindi with two lectures but student one did not and then finally the assignment was to match these two together and figure out which student had two lectures with a test in their favorite subject so even though student three and student two both had two lectures in Hindi that may not necessarily be their favorite subject so for example student three that is their favorite subject actually both students who and student three it's their favorite subject because in the case of student two Hindi it's a higher value than science at 70 instead of and so because of that students B and C were output which are the names of student two and three so that was the requirements so let's just go over again how we coded it the with keyword is a common table expression or CTE which allows us to make a list of sub queries that can be used in the final query and that allows us to break up our logic into smaller chunks and then use each of the small pieces together in the final query so we wrote the favorite subjects query by querying the first semester table returning the ID the student name and the class name as well as by ordering the rank of their classes based on the highest mark because it's descending order but doing a separate bucket for each student so that student one can determine its highest Market class separately from student two's Highmark class and once we've determined that row number we're filtering for the first row number which ensures that we're only getting the favorite subject and then for the second sub query we're taking the second semester we're returning the student's ID and the subject we're filtering for only courses that had a test or lectures that had a test and then we're grouping by the student ID and the subject and the group by Will Allow our aggregate to create separate buckets for these categories so the combination of the student and the subject creates a bucket so we can get a count for student one with math and a separate count for student one with science and a separate count for student two with Hindi and a separate count for student two with math and we're returning any subject where the student had two tests with that subject or more then after we've created both of these sub queries with those pieces of logic we're joining the two tables together based on the IDS matching and based on the subjects matching so the join will make every possible combination of favorite subjects with subjects that had two tests but initially that join is not going to match up the right students or the right subjects because it's every combination so the on Clause is going to filter out the combinations that don't match our criteria so if the IDS match and the subject match is going to keep them and the other combinations will be excluded so after the join and on have finished executing then we'll have a list of only the students who had two tests in their favorite subject and then we're just going to return the names at the end so we've covered a lot of new keywords here and I don't want to overwhelm you or make this video too long so I think I'll stop here and then I'll go on to other keywords in the next video thanks for watching and let me know what questions you have about SQL that I can answer

Search This Blog

YT_AIGamer

SQL Bootcamp - Episode 2 - HackerEarth

Popular posts from this blog

AoE4 Mod Tutorial: Making a Crafted Map in the Content Editor

TeamFightTactics - Rules & Strategy - Coding an AI - Part 2

Coding an AI to play video games