Search Configuration Part 4: The Crawl
I am the Ektron Search Configuration utility. In part 3, I handled the registration of a CMS site with the search server. As the process completed, I asked the search server to perform an initial crawl. Here is my story of the crawl.
The end is near. My threads are closing and I sense that soon my own process will follow. I will cease to exist. My memory will be deallocated and returned to the pool. Nothing will remain of me, yet I know that the work I did persists both in the search and CMS servers. Because of me, users will be able to locate news, products, people and almost any other information they seek.
After establishing the relationship between the search server and the CMS site, my last important task is to request that the relationship be consummated. I do this by sending a full crawl request to the Ektron Search Server Service. In turn, the service passes on the request to the Search Service Application. Before I go for good, let me tell you about the crawl.
The SharePoint Search Service Application (SSA) contains many predefined meta properties which help describe the type of data being indexed. Ektron CMS has its own properties so one function of the crawl is to extend the SSA meta properties to include CMS properties such as content type, language, and title; additionally, user-defined items such as Smart Form indexed fields are added. The addition of the CMS meta properties occurs during the first phases of the crawl: Property Creation and Property Mapping. Oh, one important note: a crawl can be manually started by a user from within the SSA but this will not result in the meta properties for CMS being defined. For that to happen, the request must come either from myself, the Search Configuration utility, or a full crawl request from the CMS workarea. Once the properties are defined, the crawl for content commences.
The crawl for content actually has three starting points, one for each major type of crawled content: users, community users, and content. The crawl initially knows nothing about the database but with the starting points in hand, it learns quickly. Take content for example. The crawl starts at folder ID=0 by asking the parent (folder 0) for information about its children. Parents love to talk about their children, and of course the children talk about their children. This process goes on and on until there are no more children to be discovered, and the crawl has talked to all the parents and children. The crawl does not stop at this point; next, it asks each folder about the content items in the folder. The crawl wants to know details such as what is the language, is the item published and not archived, is the item marked searchable, and what type of content is it. With this information, the components are constructed into special URI-like strings (ektron://<content source>/<folder path>/<content_id>-<locale>-<extension>); for example, ektron://1server1EktronCMSDB/0/123/456/789_1033.xml represents the information to a piece of Smart Form content. These URI-like constructs do not resolve in a browser but are meaningful to the DatabaseProtocolHandler which uses the information to craft a request to the database.
The code that fetches the content, modifies the request based on the content type; for example, if the content type is a PageBuilder page, the content to be indexed will come from a different field (one that contains a compilation of the page content) versus that fetched if the content type is HTML content. If the content type is an asset, the ID of the asset will be passed to the Ektron Protocol Handler FileHelper service. That service communicates to the CMS server (specifically, the Ektron CMS File HelperService service) over ports 6080 and 6081 to check if the asset file exists. Assuming it does, a transfer will be made that results in a copy of the asset being placed in a folder under C:\EktronSearchData (defined during site registration). The search server will then index the contents of the asset using the appropriate filter. If the asset does not exist or cannot be copied to the search server, an entry will be made in the Protocol log (also located in C:\EktronSearchData). Some potential reasons for a problem in this area include:
- A firewall is blocking ports 6080 and 6081.
- The search server cannot resolve the CMS server name.
- The CMS database contains extra entries in the AssetServerTable. The search server requests server names from this tables so servers that are no longer present will still be looked for during the crawl, resulting in errors and increased crawl duration.
comments powered by
While the content is being indexed, the crawl is also getting user and community user information. If the Community Groups crawl filter was selected during registration, each community user and its associated folders will be crawled. Each user has four dedicated folders so for sites with large numbers of community users, the crawl must look at four folders per user, or put another way, if there are 100K community users, there are 400K folders that will be crawled, not an insignificant task.
My user just noticed that the crawl is in progress. He no longer needs me. The mouse pointer, now hovering over the X is a click away from ending my existence. The search configuration is complete but there are other tales my peers in the search universe can tell. I heard that soon Query would like to explain some of what she does. I hope I can be instantiated enough to hear what she has to say.