Matt Cutts discusses what determines the amount of pages that Google crawls and indexes. PageRank is the most important factor but host load and duplicate content can impact a site as well. You can reduce duplicate content with the use of the canonical tag and site architecture.
Some keypoints of the interview:
Matt Cutts: The number of pages that we crawl is roughly proportional to your PageRank.
Eric Enge: So you have basically two factors. One is raw PageRank, that tentatively sets how much crawling is going to be done on your site. But host load can impact it as well.
Matt Cutts: That's correct. By far, the vast majority of sites are in the first realm, where PageRank plus other factors determines how deep we'll go within a site. It is possible that host load can impact a site as well, however. That leads into the topic of duplicate content. Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We'll drop two out of the three pages and keep only one, and that's why it looks like it has less good content. So we might tend to not crawl quite as much from that site.
Matt Cutts: There are a couple of things to remember here. If you can reduce your duplicate content using site architecture, that's preferable.
Matt Cutts: There are ways to do your site architecture, rather than sculpting the PageRank, where you are getting products that you think will sell the best or are most important front and center. If those are above the fold things, people are very likely to click on them. You can distribute that PageRank very carefully between related products, and use related links straight to your product pages rather than into your navigation. I think there are ways to do that without necessarily going towards trying to sculpt PageRank.
Read the full interview here: Eric Enge interviews Matt Cutts
Page Rank is like a lighthouse in a dark night, sometime you see it, sometime you don't
Those who can make you believe absurdities can make you commit atrocities.
"The number of pages that we crawl is roughly proportional to your PageRank."
Totally common sense. PageRank is based on inbound links. It's inbound links that will get your site crawled more frequently and more pages indexed, especially if you hve lots of deep links. This not only includes links from external sites, but also your internal linking structure... i.e. your site architecture.
"There are a couple of things to remember here. If you can reduce your duplicate content using site architecture, that's preferable"
Totally agree. A good site architecture can solve MOST problems a typical site might have. Having to use the canonical link element is typically a sign of a poorly though out site architecture.
bogart (30 March, 2010)
Text links have always been preferable from a ranking perspective. Using image links with alt attributes don't carry near the ranking weight that the link text in text links does... because the alt attribute of an image link is supposed to describe the image, NOT the page the image links to. Link text should describe the page being linked to.
It has always been said that 301 redirects cause some loss. If you think of how page rank is calculated... a page doesn't pass out 100% of its PR to its outbound links. There is the damping factor (used to be around 15%) to consider. If a page has, say, 100 PR "points" and has 10 outbound links, about 15% of that is wasted and the remaining 85 points are divided over the 10 outbound links. So the 10 links get 8.5 pts each passed to them instead of 10 pts.
If you think of how a redirect basically means you have to go through the redirected page to get to the "final" page, if 10 pts are passed to the redirected page, the damping factor will eat about !5% of that so what gets passed out to the final page is less than what came into the redirected page.
So over time lots of people end up with:
A --301--> B --301--> C --301--> D
And Cutts has always said stacked redirects like the ones above are bad. He's always recommended unstacking them so you end up with:
A --301 --> D
B --301 --> D
C --301 --> D
bogart (3 April, 2010)