Thursday, July 26, 2018

Thoughts About Indexing

Updated with more thoughts about automating index creation.
-----Original Message-----
From: Mark Baker
Sent: Friday, July 27, 2018 8:49 AM
To: 'Jonathan Baker'; dick@rlhamilton.net
Cc: techwr-l@lists.techwr-l.com
Subject: RE: Looking for classes in indexing

Index automation can only take you so far. For the book I referred to earlier, we experimented with index automation. The book is about structured writing, so naturally we used structured writing techniques. This included the annotation of subjects mentioned in the text, and of the subjects covered in chapters and sections. This is essentially what an index does -- it points you to the places were different subjects are treated in a document. So it follows that we should be able to derive an index from these annotations.

This is already a much more controlled process than using software to scan an unstructured text for keywords, which is all that automated indexing software can do, short of an AI revolution that has not arrived yet. The subject annotations that we used noted the type of the subject and rectified the terminology (for example: {XML}(markup-language "Extensible Markup Language")). This gave us a significant degree of terminology control and allowed us to detect a lot of inconsistency in the book (as Richard mentioned earlier). It also allowed us to automatically create entries for the major types of subject matter discussed in the book:

markup languages
XML, 56, 68, 72
HTML, 345, 403
Markdown, 3, 76, 432

The result was a not bad index, but certainly not as good as Richard wanted. In particular, it did not let us do things like this:

constraints, 27, 228, 367–390
auditing, 431
cost of reuse, 153
data entry, 315
detecting duplication using, 414–415
extensibility and, 334
factoring out, 29, 42, 308
managing reuse, 134
media-domain, 29
personalization, 167
rhetorical, 246
semantic, 312–315
structural, 312–315
types of schema, 389
uniqueness, 174

These types of entries put terms in their narrative context. This requires a human reading of the surrounding text. It can't be done effectively from subject-domain semantic markup and it certainly can't be done reliably (yet) by indexing apps working on unstructured text.

Why is this important? Search engines have two big advantages over indexes (other than their enormous advantage in scope, which I mentioned earlier). First, indexes work on individual terms, while search engines can work with phrases and sentences. You can type an entire question into a search engine and it will use the whole sentence to discern what you are interested in. In other words, you can put your search terms in their narrative context up front by searching on the right phrase.

Second, they have a ranking algorithm that does a remarkably good job (most of the time) at selecting the most relevant entry and putting it at the top of the list. Indexes, by contrast, list pages in numerical order. If you want to get really fancy, you can bold page numbers for the main entries for a subject, but that is not in any way specific to the user's individual query. Search engines not only rank the subject matter statically, they rank it for the known interests of the individual user.

These entries that put terms into their narrative context help indexes partially make up for these deficiencies vis a vis search engines. They can only be created by hand, and Richard felt it was important to do this for the book, particularly in cases where a subject is mentioned many times. An undifferentiated list of 30 page references presents a rather daunting task to the reader. The context setting entries can help them narrow down what they are looking for.

We did not throw out the automated generation of the index altogether, however. Rather, we added markup that allowed us to supplement the generated entries with human created entries (100% of which were created by Richard, who is way better at this sort of thing than I am). As a result of this hybrid approach, we were able to reduce the indexing effort significantly, while still incorporating valuable index features than can only be created by hand.

I'm planning to blog about this and other aspects of the development process for the book sometime soonish.

Mark

> -----Original Message-----
> From: Jonathan Baker
> Sent: Friday, July 27, 2018 7:22 AM
> To: dick@rlhamilton.net
> Cc: techwr-l@lists.techwr-l.com
> Subject: Re: Looking for classes in indexing
>
> I’m not into religious wars, so I won’t go there. However, in following this
> conversation , it occurred to me that there may be some tools out there to
> automate the indexing process. I didn’t do a search, but did stumble upon a
> tool called TExtract (texyz.com). I haven’t used it, but may the next time I
> need to do an index.
>
> Also, one of the best books about indexing was written by Ruth Canedy
> Cross. Unfortunately, Indexing Books is out of print and only sometimes
> available on Amazon.
>
> Jon
>
> Sent from my iPad


Original Post below:
This is from a thread about indexing on the Techwr-L list which I am including here because I am weak at indexing and the links in Monique's email look like some sites I should investigate.

-----Original Message-----
From: techwr-l [Monique Semp]
Sent: Wednesday, July 25, 2018 2:20 PM
To: Lin Sims
Cc: TECHWR-L
Subject: Re: Looking for classes in indexing

Really old, but I have printout of an article from the October 2012 STC Intercom magazine that I found quite useful: The Top 10 Indexing Errors Made by Technical Writers, by Lorne Griffith. I imagine it wouldn't be too hard to find online. (And if it is, you can ping me and I'd be happy to scan it and send the PDF.)
And I second Lin's comments around the idea that digital search does not take the place of a good, human-produced index. For example, if the text says, "remove", but I search for "delete", I'll never find it. But in a manually-produced index, I'd expect to see an entry for "delete" that says, "see remove".
I was a member of ASI for many years even though, like Lin, the only things I indexed were tech docs that I wrote. But I thought of doing more (until I found how low the pay was for how much work it would take to write a really excellent, vs. mediocre, index), and I liked the erudite discussions among the REAL indexers.
It is unfortunate that ASI seems to require membership to take their courses. Maybe a personal note to them to ask for an exception might work? I did look at their webinars page, and those seem generally available to non-members, although the topics are mostly pretty narrowly focused to a given tool or a given indexing focus. See [1] https://www.asindexing.org/category/webinars.
The Portland chapter of ASI lists a bunch of non-ASI indexing courses: [2] http://pnwasi.org/wp/?page_id=141. Perhaps other chapters (don't recall where you are, so not sure if there's an active chapter near you) have similar lists of resources.
And here's a nice list of resources, including email discussion lists, but a lot of it is geared to people wanting to have their own indexing business. But of course, the email discussion lists would be useful for
all indexers. See [3]http://www.backwordsindexing.com/Novice/NoviceNotes.html.
And my last list on this listicles reply: [4]https://indexstudents.wordpress.com/education/ - lots of training and all sorts of resources.
(Yes, I like the subject of indexing!)

References
  1. https://www.asindexing.org/category/webinars
  2. http://pnwasi.org/wp/?page_id=141
  3. http://www.backwordsindexing.com/Novice/NoviceNotes.html
  4. https://indexstudents.wordpress.com/education

This post from Mark Baker was worth capturing as I like his writing style.

From: techwr-l [mbaker@analecta.com]
Sent: Wednesday, July 25, 2018 6:20 PM
To: techwr-l@lists.techwr-l.com
Subject: RE: Looking for classes in indexing

I'm old too, but let's face it, indexes are the paper substitute for a search engine. Anything an index can do, a decent search engine can do better (yes, including synonyms). More to the point, even the old are so habituated to search now that the only way they are going to stumble into your index is if it shows up in a search results.

Unless, of course, they actually are reading on paper, because then the index is the poor man's search engine, and in that case it better be good, because it has a lot to live up to.

And if there are those out there that still want to claim that indexes are better than search engines, here is the clincher: An index only works when you have a the right book in your hand. Which means you have to find the book before you can use the index. But a search engine searches everything. The reader does not have to locate the book first. Indeed, they probably never know which "book" their results came from. They live in a world of pages, not books, and they find pages using search. Every Page is Page One.

If I was looking for a course to take in this day an age, I would take SEO before I took indexing. Unless, of course, I was actually preparing a book for publication on paper. (Which, as it happens, I am: Structured Writing: Rhetoric and Process, real soon now from XML press. I think it has a pretty good index, most of which is Richard Hamilton's doing.)

No comments: