Friday, February 24, 2012

Is Order By affect The Query Speed

hello,
I have a query that insert insert into new table , and then i select from this table,
if i add ORDER BY in the INSERT INTO script , does it affect the speed of the SELECT
i have big table that take about 70 secyes, it will either be ignored, or it will slow the INSERT down|||I'm confused. ORDER BY is not allowed in INSERT statements, is it? Unless part of a sub-query or something along those lines. And why would you do an order-by on an insert, since it makes no logical sense to try to instill order on a table that is not ordered.

The ORDER BY has to be on the select OUT from the table.|||I use ORDER BY in INSERT INTO cause I use INSERT INTO to a SELECT Statment, i have SELECT nested into INSERT

My Question is if i add the data in order in the table do i have better SELECT query performance|||So your ORDER by is not in the INSERT itself, but rather in the SELECT that is gathering the input. That is what I meant.

You're wasting your time and typing fingers. You cannot instill order when inserting to a SQL table, because there is no valid concept of that type of order in a SQL table.

SO, the short answer is NO. It will not affect later query speed. It is also a direct affront to all things good and beautiful. Most probably it would cause a Smiting Blow from the SQL Gods were it to be implemented in a production database.

If order is important, do it on your SELECT when you are pulling data out of the table.

The only time I would put an ORDER BY in an insert statement is if it was necessary in the SUB-SELECT to assure that the right data is pulled from the source table to be slapped into the INSERT table. OR if you need to limit the input from the select...as in: INSERT dbo.MyDestinationTable
SELECT TOP 100
Myfield1,
Myfield2,
MyDate
FROM MySourceTable
ORDER BY MyDate DESC and in this case, the ORDER BY is relative to the SELECT, not the INSERT.|||Thanks man|||You're welcome, dude.|||cowboy,

all of that is straight from the orthodox canon and that's fine and good, yet before I knew any better i wrote a server side pagination thingie that does an INSERT with a SELECT and an ORDER BY into a temp table and it does the row numbering with an identity column. It worked and still works and everytime I read this thing about it not being reliable, I have never seen it not work in practice.|||SO, the short answer is NO. It will not affect later query speed. please, sir, i must disagree

i think i gave the correct answer already

this --

INSERT INTO ... SELECT ... FROM ... ORDER BY ...

is going to be slower than this --

INSERT INTO ... SELECT ... FROM ...

simply because the ORDER BY will take extra time

(and, as we all know, the ORDER BY isn't guaranteed to actually insert them in the correct order anyway)|||OK Sean, I see your point, I knew I should have loaded my Orthodox Sidearm with birdshot so it scattered better. I still think that your example is another situation where the ORDER BY is related to the SELECT rather than the INSERT. Maybe it's semantics, but an order by in your situation is actually used to populate one of the INSERT table's columns, which is the ID column. Conceptually (in my mind, anyway) this applies to the gathering of the data to be inserted. NOT the ordering of the insert table.

In your example, the insertion of the data has nothing (reliably) to do with the actual physical order of the data in the table, nor in and of itself anything to do with the speed of a subsequent select from the table, correct? Of course the speed to be gained, if any, on the select has to do with the construction of the SELECT statement, and any applicable indicies that may be in place. NOT, per se, by anything you did on the insert itself.|||But the original question was whether subsequent SELECTS from the table will be faster, which they are not. There was no question as to which method of inserting was going to be faster. Presumably that is beyond debate.

If you have some free time, and a busy multi-CPU server, do a sp_helptext sp_who2, and scroll down to the bottom. I always loved the comment there:

-- (Seems always auto sorted.) order by spid_sort

On a busy server, parallelism takes over, and you get the results in every which way. I wonder how much processing time is shaved off by commenting out the order by statement. I mean, you might even have 200 or 500 rows to order!|||please, sir, i must disagree

i think i gave the correct answer already

this --

INSERT INTO ... SELECT ... FROM ... ORDER BY ...

is going to be slower than this --

INSERT INTO ... SELECT ... FROM ...

simply because the ORDER BY will take extra time

(and, as we all know, the ORDER BY isn't guaranteed to actually insert them in the correct order anyway)Well then, I guess you've forced my hand. Unfortunately, I've got to disagree with your disagreement then. Your answer was correct in your interpretation of the original question. That is, if he/she (or is that HeShe? I get that confused sometimes, which really has cause problems in the bar at times at closing) was asking if the ORDER BY would case the INSERT to be slower. If that is the case, I am behind you (at a respectable distance, of course) 100%.

However, if heshe asked if using ORDER BY on the INSERT would have any effect on later and independent SELECTs, then I stand by my answer. Nope.|||Daft I know but I have always* wondered whether ordering an insert by the clustered index definition is better\ no different to not ordering an insert. I have a feeling I have read somewhere one way or the other but I can't remember what the upshot was. The thing is - unless SS does implicitly order the insert anyway then it will be slower than an unordered insert as there will be page splits. When using BCP you can tell BCP if the order of data in the file & clustered index order are the same and the bulk insert is faster.

It is this sort of thing that gets me to sleep at night.

*not literally|||But the original question was whether subsequent SELECTS from the table will be faster

Oops.
There was no question as to which method of inserting was going to be faster. Presumably that is beyond debate.
Double oops.|||I have a feeling I have read somewhere one way or the other but I can't remember what the upshot was. perhaps this may help:
INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted

-- Ordering guarantees in SQL Server... (http://blogs.msdn.com/sqltips/archive/2005/07/20/441053.aspx)|||perhaps this may help:
INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted

-- Ordering guarantees in SQL Server... (http://blogs.msdn.com/sqltips/archive/2005/07/20/441053.aspx) Yeah, what he said.|||Thanks Rudy. That does actually fit with my observations - truncate a table\ clustered index, pump in several million rows in a single statement, check the fragmentation and the table turns out to be totally fragmented. I think I had always hoped that SS would order the input to fill the pages up to the fillfactor. I guess it must insert them in whatever order is most efficient for the select irrespective of what might be the most effecient insert order (assuming ordering the insert < splitting pages all over the place).

I appreciate all the logical "no order in relations" stuff - I am thinking solely about the physical aspect.


Did anyone let mgsn know that the physical order of the data (i.e. the clustered index) may affect many of the queries he\ she might perform on the table?|||Yeah, what he said.Thanks to you too Sean :)|||.
.
.
On a busy server, parallelism takes over, and you get the results in every which way. ...
So then; limiting the INSERT SP to a single processor might help reduce page splitting? Not "absolute guarantee", but reduce it.

My (unwarranted?) *impression* with reading about order-by in sub-queries not necessarily returning ordered sets to the outer queries, a symptom I've seen happen on rare occasion, was that it had to do with query plans.

Multi-processing affecting this also makes perfect sense.

However; since Ordering a Clustered Insert is relevant would it follow that setting the "Insert ... from select .. ordered by ..." SP to single processor would help matters?

Again; not looking for perfection, but in the real world, achieving an ordered insert 99.9% of the time.

I'd like to comment that, if the nature of the data is that ajoining records are often the objects of the same select, then having stuff together (in the same page) would be a performance gain.

For example: Let's say you have a non-clustered Child table that's ordered by ParentID. Let's say every "Family" has about 20 children on average, and the pages just happen to fit about 30 or 40 children rows each. Now let's say the child table has 30 million rows (to keep the Query Planner from invoking a full table scan). So; selecting a given 50 families would naturally use the ParentID key to go get the children. You would wind up reading maybe 25 pages on average if the related children (for each family) are all together, and therefore always fitting in either 1 or 2pages. By contrast; if they're all hap-hazzard you would get 50 x 20 page reads in the child table. Of course; at least some of the WHERE would have to be resolved outside of the child data records.

As a 2nd comment. For similar reasons; wouldn't it make sense to put stuff together for head-movement purposes? Not relevant for nearly all tables, but the situation may come when some table is so huge and so key that you put it on it's own HDD, and due to it's nature (let's just say it's financial transactions that are Usually queried within a range of perhaps 1 month but you don't archive until data is 8 months old), then the disks could predominately stay in the same area. Same would apply if lots of ordered transfers are done, so if you built an index that resolves the Where Clause but the actual transfer usually has large amounts of pages to read (so, of 100 million rows, let's say it's selecting a group of 5 million), if the data was all over the place, the heads would thrash. I believe that EVEN IF THE TABLE weren't on it's own drive, this COULD be a thrashing-saving quality.

I'm just saying that although unordered heaps and clusters are logically equivalent to ordered ones, that there are special cases where there's a cost to being unordered and that the cost can be severe for totally random cases. While I'm no expert in Relational databases - the basic concept of "reduce physical reads" surely remains valid.|||Hi vich

I've read this a couple of times. I think I know where you are going but there are a couple of lines that don't work.
Let's say you have a non-clustered Child table that's ordered by ParentID. This is an oxymoron. Did you mean clustered child table?

I'm just saying that although unordered heaps and clusters are logically equivalent to ordered onesAre you saying that a clustered index can be unordered? The definition of a clustered index has ordering right at the centre of it so again I don't quite get it.

Just to be clear - there are two types of table in the sql server world - Steers and Que... oops not them... Heaps & Clustered Indexes. Heaps are unordered and have no B-Tree structure. CIs are ordered (at the leaf level) by the index column(s) and have a B-Tree.

Overall though you are spot on - selected data can be more efficiently retrieved if it is physically contiguous and occupies the minimum number of pages.|||Hi vich

I've read this a couple of times. I think I know where you are going but there are a couple of lines that don't work.
This is an oxymoron. Did you mean clustered child table?

Are you saying that a clustered index can be unordered? The definition of a clustered index has ordering right at the centre of it so again I don't quite get it.

Just to be clear - there are two types of table in the sql server world - Steers and Que... oops not them... Heaps & Clustered Indexes. Heaps are unordered and have no B-Tree structure. CIs are ordered (at the leaf level) by the index column(s) and have a B-Tree.

Overall though you are spot on - selected data can be more efficiently retrieved if it is physically contiguous and occupies the minimum number of pages.
On point 1, by "ordered" I meant "contiguously in sequence" and no, I did mean heap, not clustered.

x1x2x3x4x5
not x3x1x5x4x2

So; "ordered heap" is an oxymoron in strict logical terms, but the records in the heap do physically reside somewhere. My point was that IF you know what groups of rows your will often read together, then physically grouping them will reduce physical reads in direct proportion to rows-per-page. Even if it's only "sometimes", it's still an improvement over being totally random since that will gaurentee worst case page hits.

To rephrase; one aspect of maximizing page hits is through placing rows into the same page if they can predictably be read together.

The phrase "Ordered Heap" must be to a DBA as "Perpetual Motion Machine" is to an engineer so lol; hence my clarification.

On the 2nd point, I was redundant. It should have read:
I'm just saying that although unordered heaps and clusters are logically equivalent.|||Ah - I see. However, in the heap the data being physically contigious would be useless - the engine would still need to scan every page in the table. It doesn't matter how many rows there are - heaps -> scans. Or have I missed your point again :D|||Ah - I see. However, in the heap the data being physically contigious would be useless - the engine would still need to scan every page in the table. It doesn't matter how many rows there are - heaps -> scans. Or have I missed your point again :D
What about indexes? Direct Access? Heap -> Full Scan is only occassionally true, correct? Often even, but not always.

With all do respect sir; in my case of 30 million rows when it's going after about 1000 of them, wouldn't you agree we should shoot the query plan that reads them all?

EDIT - added section below:
I was thinking about what bearing .mdf file's fragmentation has. My 2:30AM conclusion is; not too much. It's all about page hits. If the pages are scattered around, then seek ahead buffering, head movement, etc would come into play, but those ramifications (I think) would be far less than finding as much as possible per page (block really). It would probably come down to row size vs. block size (what is a "Page" anyway, at least as compared with a "block").

Anyway; this all boggels my uninformed mind and I submit to the engineer's recommendations, but I also want to understand the pieces so I can make good tuning decisions some day.

This thread's complete dismissal about physical placement of indexed heap data just struck me as "file this under, question this one'".|||Do you have a NC index on ParentID in this case? And we are still talking about a heap yes (the oxymoron-ordered heap ;))?|||Do you have a NC index on ParentID in this case? And we are still talking about a heap yes (the oxymoron-ordered heap ;))?
Yes, an indexed heap. But if the indexes tell you "go get these 1000 rows", I'm just saying that if prior inserts managed to maximize paging efficiency, then obviously 50 reads are better than 1000.

Sorry; don't know what "NC" is, but I guess I assumed that from the parent/child relationship. However; it wouldn't matter for the point being made. Even if the child's indexes required a full scan; if the DB Engine determined "go get these 1000 rows", it still holds true that 50 page reads trumps 1000. The scenerio did say that all children would be read as part of the "family".

Again; I'm just saying there are special cases, and not all that uncommon ones.|||NC = nonclustered. You have a nonclustered index on parentID?|||Even if the child's indexes required a full scan; if the DB Engine determined "go get these 1000 rows", it still holds true that 50 page reads trumps 1000. I think that is my point. Without any useable index on ParentID then the engine will scan the pages. Remember that scanning the pages involves getting them into memory and checking for the particular ParentID value. It is the number of pages scanned not the number of pages that the data is on that counts. As such, having all the relevent data on a small number of pages is only beneficial if the engine can know this and limit its retrieval to these pages only.|||So then; limiting the INSERT SP to a single processor might help reduce page splitting? Not "absolute guarantee", but reduce it.

Absolutely not. Limiting the number of CPUs that process an insert has no relevance to how data is stored on disk. And creating an "ordered heap" is a bit of a fantasy in 99% of cases, since you expect some deletes to happen on occasion. What do you think SQL Server is going to do with those gaps? Use 'em, that's what.

Besides which, you have eloquently explained in your first scenario EXACTLY why you would want a clustered index on the ParentID column. So the children are all clustered (pun intended) on the same set of pages.|||Besides which, you have eloquently explained in your first scenario EXACTLY why you would want a clustered index on the ParentID column. So the children are all clustered (pun intended) on the same set of pages.Just what I have been trying to say. The crucial difference between this clustered index and the magically ordered heap is that SS also can guarentee that these values are only on these pages and not on any others AND can get there quickly via the B-Tree so it does not need to hunt through the entire table looking for them.|||Absolutely not. Limiting the number of CPUs that process an insert has no relevance to how data is stored on disk. And creating an "ordered heap" is a bit of a fantasy in 99% of cases, since you expect some deletes to happen on occasion. What do you think SQL Server is going to do with those gaps? Use 'em, that's what.

Besides which, you have eloquently explained in your first scenario EXACTLY why you would want a clustered index on the ParentID column. So the children are all clustered (pun intended) on the same set of pages.
Drrrr (que anvel falling on my head). That actually did dawn on me last night when thinking of it after posting.

If you know Parent ID will determine a lot of read clusters during future queries, and being a "parent ID" (presumably auto sequenced), it will tend to be a good cluster key (adding to End).

However; if all families tend to grow (ie: Parent table rarely grows) then would page splitting occur a lot? That's where padded indexing and fill factors would come into play, I imagine.

Example: A scientific statistical program that maintains thousands of counters and is available to the scientific community on the internet (therefore; gets LOTS of reads). For sake of arguement, let's say that different scientists are only interested in their pet statistic but when they get it, all data for that statistic type (child rows for that parent) need to be read. So; the child table (the gathered statistic data) is constantly inserted into and the inserts are for all parent IDs (the statistic type).

This would benifit greatly from physical clustering by Statistic Type (parentID) but inserts would require enough padding in each statistic type (index padding?) to prevent page splits during Insert until the next index reorg can refresh the padding amount.

Thank you. I know it's painful teaching stubborn beginners.|||Broadly yes. Don't be too scared of page splits though. If you set a fill factor whereby you never get page splits then it is set too high. One of the problems with page splits is the increase in the number of pages needing to be read as they are not "full". Fill factor has exactly the same effect so putting in such a low fill factor pages never get full is counterproductive as you are increasing the number of pages that need to be read to satisfy queries.

No comments:

Post a Comment