Blog Home  Home Feed your aggregator (RSS 2.0)  
What did you learn today? - Monday, March 19, 2007
Phil Denoncourt's Technology Rants
 
 Monday, March 19, 2007

On Wednesday, March21st, I'll be giving an introductary talk on Data Mining for the Boston Access User's Group.  I was scheduled to do this last month, but I got sick.

Introduction to Data Mining in SQL 2005
Data mining is the act of deriving patterns and trend that exist in your data.  SQL 2005 provides a host of different algorithms that allow you to explore your data for hidden patterns and trends.  Popular uses of data mining models are used to predict future sales, target customers for marketing, examine patterns of navigation through a website, and group products that are likely to be purchased together.

Phil will give an introduction to data mining concepts, an overview of the different algorithms, and how to incorporate the data mining results into your applications.

Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications. When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.

Monday, March 19, 2007 12:06:22 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Speaking Engagements  | 
 Sunday, March 18, 2007

As a developer, I love to write code.  Getting involved with a problem, putting my head down, coming up a solution and getting immersed in implementing it is an awesome feeling.

 

I also enjoy reading code.  I like to look at a set of code and try to understand what the developer was thinking.  You can tell a lot about how a person thinks by reading their code.  To me, participating in code reviews is a lot of fun. 

 

I originally wrote these articles 6 months ago when I was about to launch a new service to review code (For details, go to http://www.codefrisk.com/).  I’m now much too busy to review code, but these articles might be of value, so I’m uploading them now.  I’ll be posting a set of articles about code reviews, best practices, coding standards…  Here’s the first article in the series:

 

Coding Standards – How to write a useful Coding Standard document

 

Coding standards are an agreement in a development team on how the code for an application will be structured.  They promote consistency and easier maintenance.    A well-written and enforced standards policy will reduce the time that it takes for new developers to feel comfortable with an application.

 

I’ve worked with many organizations over the years, most of which have some sort of coding standards documentation.  The needs of each organization are very different.  Some organizations sold their source code, so they needed to make sure there was a very high level of consistency.  Others were in regulated industries and needed to conform to various practices.   Others were small shops that didn’t see tremendous value in having standards.

 

The standards document had varying degrees of success.  Some negative factors in coding standards I’ve seen are:

  • Aggressively guarded by one person
  • Enforcing rules that were necessary in older technologies, but no longer hold relevance (Especially if they were burned by the problem)
  • Too slim
  • Too thick
  • Too strict
  • Begun, but never completed.
  • Finished, but never enforced.

 

I’ve written a few coding standards docs and want to share what I’ve found useful, and what you should try to avoid.

 

First of all, if you’re the person tasked with writing the documentation, make sure that you involve the rest of the development team as much as possible.  You shouldn’t take the Moses approach; climb a mountain and come down with a set of commandments.  If the developers on your team aren’t consulted on the coding standards, you’ll see a lot of them not adhering to them.  These aren’t your coding standards; they are the teams coding standards.

 

Recognize that you aren’t writing a document with a stone and chisel.  Coding standards must evolve as your team has learned new techniques.  It’s also not uncommon to see coding standard docs that never get finished because the author feels the document must be 100% “complete” before they release it.  Take an iterative approach and get the document out as soon as possible and continue to refine it.  It’s impossible to get it all right the first, second, or even tenth time.  You’ll probably find that most code reviews will result in changes to the standards document.

 

Kick off a discussion for an hour or two and brainstorm with the development team about possible standards.  There will be contention.  If the team can’t come to an agreement after 5 minutes, table the discussion point for later.  If people have time to think about something, they will usually be able to justify or back off on their opinion.    The important thing is to limit the time of the discussion in the interests of getting an initial version within the same day.  Good developers will talk about coding all day.

 

Areas to discuss would include:

  • Naming standards – Camel, Pascal, Hungarian, Dewey Decimal System?
  • Documentation expectations – XML Documentation, when/where to put comments, class headers…
  • Design Patterns that are expected to be used – Factories, Observers, Decorators…
  • Automated Unit Testing strategy – TDD? Tool type, Code Coverage…
  • Exception handling – Handling, Reporting…
  • Use of constants/resource files – Yes, No?  One resource for the entire project, one constant class?
  • Complexity limits for methods - > 25 is a lot of branches
  • Data access methods - Stored Procs, Dynamic SQL, DataSets, ORM, Async, Transaction strategy…
  • Common components to use – Enterprise Library, Log4net, Infragistics…
  • Localization strategy -  Needed?  Roll your own or use .NET framework...
  • Versioning – Versioning assemblies, strong name, signing, delay signing…
  • Build strategy – Nightly builds, one solution or many solutions, creating references
  • !Code Formatting

 

For the most part try to stay away from talking about code formatting.  Formatting standards inevitably come down to one developer’s preference over another.  Capable developers will be able to read the code regardless of how it is formatted. 

 

After you’ve had your discussion with the team, hole yourself up in your office/cube and write up the standards doc.  Unless you’re a dreadfully slow typist, this shouldn’t take more than 1 – 3 hours.

 

The important thing is to make sure you get feedback from your team on the document.  If they don’t feel involved and disagree with a particular piece, they will exhibit a little bit of passive-aggressive behavior and not adhere to the standard.

 

The Coding Standards document sets up expectations at code reviews.  It is a declaration of what is expected from a developer when they submit code.  Without some set of coding standards, code reviews can quickly become counterproductive.

 

Coding standard docs aren’t hard, aren’t intrusive, and are a valuable resource.  When there is a set of guidelines to use when coding, it is much easier to bring new developers into the project.  The benefits justify the one day of effort it takes to put one in place.

 

 

Wouldn’t you feel comfortable having your code reviewed by an expert?  Go to CodeFrisk.com to see how I can proofread your code at a reasonable price.

Sunday, March 18, 2007 11:18:14 AM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Code Reviews | Development  | 
 Friday, February 02, 2007
Introduction to Data Mining by phildenoncourt

On Wednesday, February 21st, I'll be giving an introductary talk on Data Mining for the Boston Access User's Group.

Introduction to Data Mining in SQL 2005
Data mining is the act of deriving patterns and trend that exist in your data.  SQL 2005 provides a host of different algorithms that allow you to explore your data for hidden patterns and trends.  Popular uses of data mining models are used to predict future sales, target customers for marketing, examine patterns of navigation through a website, and group products that are likely to be purchased together.

Phil will give an introduction to data mining concepts, an overview of the different algorithms, and how to incorporate the data mining results into your applications.

Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications. When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.

Friday, February 02, 2007 1:40:51 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Speaking Engagements  | 
 Tuesday, January 09, 2007
Introduction to WCF by phildenoncourt

I'll be speaking at two user groups in the upcoming weeks:

Introduction to WCF (Windows Communication Foundation).

.NET 3.0 was released in November.  WCF is one piece of the 3.0 puzzle and is a substantial improvement over ASP.NET web services. 

 

Phil will show the overall architecture and how easy it is to develop WCF Services.

 

Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications.

 

When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.

 

Jan 11 2007 at New England C# Users Group

 

Jan 18 2007 at New Hampshire .NET Users Group

Tuesday, January 09, 2007 6:13:05 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Speaking Engagements  | 
 Sunday, November 19, 2006
Sample Databases by phildenoncourt

For the past month, I've been focusing on getting up to speed on the data mining features of SQL 2005. Really amazing stuff. I'll be giving a presentation on this in the near future. One of the things that took a significant amount of time was putting together a database that I could use for testing. I didn't want to use AdventureWorks or something like that because I wanted something that had more data and more "real world".

So I downloaded the past 20 years of stock prices from a public quote server, the past 10 years of foreign currency prices, and a slew of economical data from Federal Reserve. (See picture below). The database has over 7,300 companies (I tried to get all NASDAQ, NYSE, and AMEX tickers) with over 16 million quotes.

Now I'm creating models and running predictions. For the most part, I'm able to exercise all of the data mining algorithms. I haven't found the secret formula to the stock market yet, but someday…

If anybody else is interested getting a copy of this database, let me know via the contact link. Because it's over a gigabyte in size and I don't have massive bandwidth allowances in my hosting account, it has to be transported via postal mail. PayPal me $30 to cover the cost of burning a DVD and sending it, and I'll get it out to you.

Sunday, November 19, 2006 3:45:32 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   SQL  | 
 Wednesday, September 13, 2006

I was sitting in this meeting in which I was only peripherally involved and there's only so much doodling you can do.  So I did what most other people in the meeting were doing, I started daydreaming. 

I'm a big fan of GUIDs, but there's all this noise about how slow they are because they're bigger than an int making for inefficient searching.  So I started to wonder, how much slower are GUIDs than ints? So I started to do the math, and for a table with 4 billion rows, I was coming out with only one more logical read if the key was a GUID than an int.... I know.. Didn't make sense to me.  So I put it to a practical test.

I created two tables.  One called IntWithData and the other called GuidWithData.


I then populated them with a million rows.  --As a side note, I started the scripts to populate the tables at about the same time, and they finished at the same time.  Something I think I will research further is how much slower it is to insert a record in a GUID table compared to an INT table.  Common sense tells you it's supposed be way slower to populate a GUID--

I took a look at the statistics for the index by using the following statements:
SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.GuidWithData'),null,null,'DETAILED')
SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.IntWithData'),null,null,'DETAILED')

That produced these results:

database_id object_id index_id partition_number index_type_desc alloc_unit_type_desc index_depth index_level avg_fragmentation_in_percent fragment_count avg_fragment_size_in_pages page_count avg_page_space_used_in_percent record_count ghost_record_count version_ghost_record_count min_record_size_in_bytes max_record_size_in_bytes avg_record_size_in_bytes forwarded_record_count
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 0 99.037 5919 1 5919 68.85677 1000000 0 0 31 31 31 NULL
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 1 96.2963 27 1 27 67.68663 5919 0 0 23 23 23 NULL
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 2 0 1 1 1 8.314801 27 0 0 23 23 23 NULL

 

database_id object_id index_id partition_number index_type_desc alloc_unit_type_desc index_depth index_level avg_fragmentation_in_percent fragment_count avg_fragment_size_in_pages page_count avg_page_space_used_in_percent record_count ghost_record_count version_ghost_record_count min_record_size_in_bytes max_record_size_in_bytes avg_record_size_in_bytes forwarded_record_count
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 0 0.423403 317 8.195584 2598 99.84113 1000000 0 0 19 19 19 NULL
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 1 0 9 1 9 46.33886 2598 0 0 11 11 11 NULL
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 2 0 1 1 1 1.420806 9 0 0 11 11 11 NULL

This is tough to read, but what you should be getting out of this is: Guid table is very fragmented.  Int table is not.  Also the Guid table had about 2.25 more pages than the Int table.  Ideally you would be running an overnight process that reorganized pages in your DB, and that would help with the fragmentation.

So now the test query.  The question I wanted to answer was:  How many logical reads are needed to read a record out of the Int table and how many are needed to read out of the Guid table?

SET STATISTICS IO ON

SELECT * FROM IntWithData WHERE ID = 619284

SELECT * FROM GuidWithData
WHERE ID = '7DD78950-D3CD-4016-8D92-738A6E0666F2' -- I had to find this value in my data

The results:
....

(1 row(s) affected)

Table 'IntWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

Table 'GuidWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

3 logical reads. That means in a table with 1 million rows, there should be no difference between a table with Guid as a key and one that has an int.  This is because the records are searched for using a binary search algorithm.  In a perfectly balanced tree, it shouldn't take the algorithm more than 20 tries to find the key with many of those "tries" are contained in the same index page.  At some volume, and I haven't calculated exactly where, there will be a difference of one logical read, but that's it.

Hopefully, I haven't misinterpreted the results of this test.  I don't think I have.  The bottom line is that Guids take up more physical space, but you're able to find them just as fast as ints.

 

Wednesday, September 13, 2006 9:51:55 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Development | SQL  | 
 Friday, August 18, 2006

I was preparing for the SQL Server Exam 70-441 (Designing Database Solutions by Using Microsoft® SQL Server™ 2005) and I kept coming across a whole bunch of small features that you don’t hear a lot about.  I’m not talking about the big ones everyone already knows about like CLR Integration, DDL triggers, XML, and ranking functions.  These are ones that took me by pleasant surprise.  I think they’re really cool, so I wanted to point them out.

 

Included Columns / Covering Indexes

Say you had a critical/oft-used report that ran a statement like:

 

SELECT Name, ProductNumber

FROM Production.Product

WHERE Style='U' AND Class='L'

 

If you wanted to put an index on the table to speed things up, you’d put an index on Style, Class, Name, and ProductNumber.  That would remove the need to go from the index page to the table page to get all the information for the query.  SQL 2005 has added the include clause to the Create Index statement.

 

CREATE INDEX IX_Product_Class_Style

ON Production.Product (Class,Style)

INCLUDE (Name, ProductNumber);

 

What this is doing is indexing just the Class and Style fields of the table.  The Name and ProductNumber are stored at the leaf nodes of the index.  This saves space in the index pages, meaning you can get more rows in an index page, reducing I/O and Disk Space.

 

Dynamic Management Views

Seriously, these are pretty cool.  Really.  If you haven’t looked at these, take the time to play around with a few. 

 

With these new views, you are able to get high visibility into the SQL Server engine.  Wondering what’s about to be written to the disk?  Query sys.dm_io_pending_io_requests.  Troubleshooting execution plans?  Use sys.dm_exec_cached_plans and sys.dm_exec_plan_attributes.  Need to see what’s running on the server right now?  sys.dm_os_tasks.  How often is an index used?  There’s sys.dm_db_index_usage_stats that will answer that question.  There are over 80 views that provide very useful information when you are troubleshooting problems or just curious as to how SQL Server organizes itself.

 

sp_create_plan_guide

Every now and then, I have a query that runs like a jackrabbit on my development machine, but runs like 3 legged turtle on a production machine.  This is usually because it decided not to use an index on the production machine due to differences in data density or because running the query with parallelism is slower with some statements.  To troubleshoot these situations, I end up modifying the stored procedure to add various hints.

 

Rather than modify the stored procedure, you can specify a hint for an existing query plan using the sp_create_plan_guide stored proc.  This way you can experiment without modifying the stored procedure and the hint doesn’t become part of your codebase.  MSDN documentation can be found here.

 

OUTPUT Clause

A lot of people already know about this pretty cool feature, but I didn’t.  In the past, if you do an insert to a table, and the table has an identity column, after the insert, you would look at @@IDENTITY or scope_identity().  That always seemed hackish to me.

 

You can now add an OUTPUT clause to the end of an INSERT, UPDATE, or DELETE statement.  That will return the rows changed by the statement.

 

INSERT INTO Person.Address

      (AddressLine1,AddressLine2,City

      ,StateProvinceID,PostalCode,rowguid,ModifiedDate)

VALUES(@AddressLine1,@AddressLine2,@City

      ,@StateProvinceID,@PostalCode,@rowguid,@ModifiedDate)

OUTPUT INSERTED.AddressID;

 

You use the pseudo tables INSERTED and DELETED, just like in triggers.

 

Storing BLOBs in Table page

Inevitability, on each project I work on, there is a requirement to have a field capable of storing more than 8k in a table.  It’s usually a field that the user can enter unlimited comments about a customer.  So I define it as a text field.  I always feel guilty about it because the user’s text will rarely be more than 100 bytes and I know it’s going to be stored on a different page.

 

In SQL 2005, if the data will fit in the same page as the rest of the record, it will place it in the same page.  This works with varchar/varbinary/nvarchar(max) and XML datatypes.  The behavior is described here at MSDN.

Persistence of Computed Columns

This is kind of neat.  We’ve had the ability to define computed columns in tables for a while.  They worked exactly as they’re named.  When the column was needed, it was computed.  Now the results of the computed column can be saved to disk.  This allows you to index computed columns.  You can also feel free to create calculated columns that can be computationally intense.

 

CREATE TABLE LineItem (

      LineItemID int IDENTITY(1,1),

      OrderID int,

      ItemDesc varchar(512),

      Qty int,

      Cost money,

      LineTotal as Qty * Cost PERSISTED

)

 

Those are all the ones I found.  What you have noticed as small new features in SQL that make life easier?

Friday, August 18, 2006 9:01:27 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Development | SQL  | 
 Wednesday, June 14, 2006

Tomorrow I'll be speaking at the New Hampshire .NET User's group.  More information can be found at www.nhdn.com

6:00 - 7:00 PM Xml Serialization.  XML Serialization is the process of transforming a complex object to and from XML.  This is made straightforward with the .NET framework.  We'll take a look out how to serialize a simple object and some of the attributes used to control the output.  Then we'll take a look at some of the more advanced features like capturing processing errors, overrides and unknown elements.  Finally we'll take complete control of the serialization process using the new features of the 2.0 Framework.  Be the envy of all your coworkers by mastering XML serialization

7:00 - 8:00 PM Xml in SQL 2005.  There are a variety of different ways that SQL Server can work with XML data.  Phil will demonstrate the various methods such as the XML datatype, OpenXML, How to return XML data in queries, Annotated XSDs, and interoperability with .NET

Phil Denoncourt is a .NET consultant, who over the past 4 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications. When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.

Wednesday, June 14, 2006 11:15:28 AM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Speaking Engagements  | 
Copyright © 2010 Phil Denoncourt III. All rights reserved.
DasBlog 'Portal' theme by Johnny Hughes.
Pick a theme: