Matt Cutts discusses what determines the amount of pages that Google crawls and indexes. PageRank is the most important factor but host load and duplicate content can impact a site as well. You can reduce duplicate content with the use of the canonical tag and site architecture.
Some keypoints of the interview:
Matt Cutts: The number of pages that we crawl is roughly proportional to your PageRank.
Eric Enge: So you have basically two factors. One is raw PageRank, that tentatively sets how much crawling is going to be done on your site. But host load can impact it as well.
Matt Cutts: That's correct. By far, the vast majority of sites are in the first realm, where PageRank plus other factors determines how deep we'll go within a site. It is possible that host load can impact a site as well, however. That leads into the topic of duplicate content. Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We'll drop two out of the three pages and keep only one, and that's why it looks like it has less good content. So we might tend to not crawl quite as much from that site.
Matt Cutts: There are a couple of things to remember here. If you can reduce your duplicate content using site architecture, that's preferable.
Matt Cutts: There are ways to do your site architecture, rather than sculpting the PageRank, where you are getting products that you think will sell the best or are most important front and center. If those are above the fold things, people are very likely to click on them. You can distribute that PageRank very carefully between related products, and use related links straight to your product pages rather than into your navigation. I think there are ways to do that without necessarily going towards trying to sculpt PageRank.
Read the full interview here: Eric Enge interviews Matt Cutts