2chinskec
I am having an issue with LC Classification sorting. When I sort by LC Classification, I end up with entries sorted like this:
TL671.4.A528 1999
TL671.7.K54 2003
TL671.4.M35 1992
TL671.6.M36 2007
TL671.2.N5 2010
TL671.2.N54 2016
TL671.2.R29 2006
TL671.7
TL671.7
TL671.7
Is this an issue with LT, or am I doing something incorrectly?
TL671.4.A528 1999
TL671.7.K54 2003
TL671.4.M35 1992
TL671.6.M36 2007
TL671.2.N5 2010
TL671.2.N54 2016
TL671.2.R29 2006
TL671.7
TL671.7
TL671.7
Is this an issue with LT, or am I doing something incorrectly?
3Nevov
I've found there is a bug topic open from 2023 about the LC sorting order: https://www.librarything.com/topic/352322 and a few links to older discussions within there, so it might be this is an LT issue.
4GraceCollection
I don't use LC but at a glance, it seems like it may be ignoring the first number after the decimal and sorting by the following letter instead, which is strange. If that pattern holds true throughout your library, then perhaps that information can help LT staff narrow down the bug.
5chinskec
I'm not sure if that pattern holds true, but I have only noticed the issue in cases where the LC class number has a numerical decimal. Here's another block in my library where I think the sorting is incorrect.
QA76.17.C47 2003
QA76.6.G656 2007
QA76.9.A25A5453 2011
QA76.9.A25C627 2009
QA76.9.A25S69 2011
QA76.73.C15N865 1992
QA76.73.C153S733 2003
QA76.73.F25E847 1992
QA76.73.F25F64 1996
QA76.73.F25M3945 1995
QA76.73.F25N9 1996
QA76.76.O63B3685 2004
QA76.76.O63R5628 2010
QA76.76.O63R855 2006
QA76.76.O63S73449 2004
QA76.76.T48H37 2008
QA76.76.T48W38 2011
QA76.2.T67T67 2001
QA76.17.C47 2003
QA76.6.G656 2007
QA76.9.A25A5453 2011
QA76.9.A25C627 2009
QA76.9.A25S69 2011
QA76.73.C15N865 1992
QA76.73.C153S733 2003
QA76.73.F25E847 1992
QA76.73.F25F64 1996
QA76.73.F25M3945 1995
QA76.73.F25N9 1996
QA76.76.O63B3685 2004
QA76.76.O63R5628 2010
QA76.76.O63R855 2006
QA76.76.O63S73449 2004
QA76.76.T48H37 2008
QA76.76.T48W38 2011
QA76.2.T67T67 2001
6Maddz
All bar the last seem to be sorting as a text string: sort by first character, if they are the same, sort by the second character, and so on. I would have expected the last entry to be second on the list as the first character to vary in the list is character #6.
Unfortunately, I don't think you can force the numeric sort without left padding the numeric sections so they are the same length - so replacing QA76.6 with QA76.06 and adding .00 if there is no decimal. Even then, there's going to be issues with the next section - that ranges from 3 characters to 9, and it's alphanumeric
Unfortunately, I don't think you can force the numeric sort without left padding the numeric sections so they are the same length - so replacing QA76.6 with QA76.06 and adding .00 if there is no decimal. Even then, there's going to be issues with the next section - that ranges from 3 characters to 9, and it's alphanumeric
7GraceCollection
>6 Maddz: I'm not so sure this is right. Sorting as a text string should have 'QA76.9...' after 'QA76.7...', even if the part of the string between decimals has 2 digits instead of one.
8Maddz
>7 GraceCollection: No, that's right; '.' sorts before a number. '.' is ASCII code 46, '1' is code 49. That's how alphanumeric sorts work.
So for QA76.17 you get the following ASCII sort order 81/65/55/54/46/49/55, whereas for QA76.6. you get 81/65/55/54/46/54/46. Look at the last 2 codes for each string - 49/55 sorts before 54/46.
So for QA76.17 you get the following ASCII sort order 81/65/55/54/46/49/55, whereas for QA76.6. you get 81/65/55/54/46/54/46. Look at the last 2 codes for each string - 49/55 sorts before 54/46.
9GraceCollection
>8 Maddz: Right. I'm talking about 'QA76.9' and 'QA76.7;' 81/65/55/54/46/57 should be after 81/65/55/54/46/55, regardless of if the next character is a decimal or another digit. But instead, the 'QA76.9's come first. Therefore, it can't be sorting as an alphanumeric string.
10Maddz
>9 GraceCollection: Look at character #7: QA76.9. and QA76.73 - the . sorts before the 3.
Basically, each character is sorted in turn. To a human brain .9. comes after .73. but to the algorithm it comes after. In order to make the sort correct, each section needs the same number of characters so you would need to change .9. to .09. which changes 46/57/46 to 46/48/57/46 which sorts before 46/55/63/46 instead of after. This falls down with the third section looking like it's alphanumeric or hexadecimal and with a highly variable number of characters followed by the fourth section which looks like the year of publication? As I'm not a librarian (or American for that matter), LC Classifications aren't relevant to me so I don't understand how they work.
If you have access to Excel, drop an apparently mis-sorted block into Column A. Sort by Column A, then do a text-to-columns operation using '.' and ' ' as the separators, so you end up with the first section in Column B, then column C and so on. Copy all the columns into a new tab, then do an advanced sort by column B, then C, then D, and then E. Copy the result back into the first tab, pasting it to the right of the initial block. I can't show you because I don't have Excel on this machine (only Google Sheets which only sorts by a single column), but you'll see the difference in the two sorts; the second section should sort as a number not as text. Unfortunately, to get the sort right to a human eye needs these codes splitting into sections, removing the separators and sorting by section not by the complete code including any separator.
You can check the ASCII code table here: https://www.ascii-code.com/ to get the character codes; if you replace each character by it's numeric ASCII code you'll see the differences.
Apologies if this comes across as technical.
Basically, each character is sorted in turn. To a human brain .9. comes after .73. but to the algorithm it comes after. In order to make the sort correct, each section needs the same number of characters so you would need to change .9. to .09. which changes 46/57/46 to 46/48/57/46 which sorts before 46/55/63/46 instead of after. This falls down with the third section looking like it's alphanumeric or hexadecimal and with a highly variable number of characters followed by the fourth section which looks like the year of publication? As I'm not a librarian (or American for that matter), LC Classifications aren't relevant to me so I don't understand how they work.
If you have access to Excel, drop an apparently mis-sorted block into Column A. Sort by Column A, then do a text-to-columns operation using '.' and ' ' as the separators, so you end up with the first section in Column B, then column C and so on. Copy all the columns into a new tab, then do an advanced sort by column B, then C, then D, and then E. Copy the result back into the first tab, pasting it to the right of the initial block. I can't show you because I don't have Excel on this machine (only Google Sheets which only sorts by a single column), but you'll see the difference in the two sorts; the second section should sort as a number not as text. Unfortunately, to get the sort right to a human eye needs these codes splitting into sections, removing the separators and sorting by section not by the complete code including any separator.
You can check the ASCII code table here: https://www.ascii-code.com/ to get the character codes; if you replace each character by it's numeric ASCII code you'll see the differences.
Apologies if this comes across as technical.
11Maddz
My techie partner has had a look and thinks there's something going on with the sort rules - he doesn't think they are going character by character as in a true alphanumeric sort. Looking at the description of LC Classification codes on Wikipedia, what looks like going on is what >5 chinskec: noticed; where the topic is a decimal number not an integer number it looks as though the decimal section is being treated as a separate entity not as part of the topic code.
I dunno - I'm no expert on this. I think we'll have to wait until staff can confirm the sort rules.
I dunno - I'm no expert on this. I think we'll have to wait until staff can confirm the sort rules.
12chinskec
This is important to me, because my library consists mostly of nonfiction, technical materials.
I wrote some MATLAB code to parse and sort (properly, I think) LC call numbers. For example, 'HN670.3Z9C6B845 2024' gets parsed as:
1. Class letters - HN
2. Class numbers - 670.3 (a real number)
3. 1st Cutter - Z9, which gets represented as 26.9 (a real number)
4. 2nd Cutter - C6, which gets represented as 3.6 (a real number)
5. 3rd Cutter - B845, which gets represented as 2.845 (a real number)
6. Date - 2024 (a real number).
The class letters (can be up to three letters) are also encoded as three real numbers. Then, the database can be sorted using a numerical sort algorithm.
Here is a link to my code, in case it is useful to anyone:
https://github.com/chinske/lccnsort
When I have time, I might port my code to Python, which is probably more accessible for most people.
I wrote some MATLAB code to parse and sort (properly, I think) LC call numbers. For example, 'HN670.3Z9C6B845 2024' gets parsed as:
1. Class letters - HN
2. Class numbers - 670.3 (a real number)
3. 1st Cutter - Z9, which gets represented as 26.9 (a real number)
4. 2nd Cutter - C6, which gets represented as 3.6 (a real number)
5. 3rd Cutter - B845, which gets represented as 2.845 (a real number)
6. Date - 2024 (a real number).
The class letters (can be up to three letters) are also encoded as three real numbers. Then, the database can be sorted using a numerical sort algorithm.
Here is a link to my code, in case it is useful to anyone:
https://github.com/chinske/lccnsort
When I have time, I might port my code to Python, which is probably more accessible for most people.
13GraceCollection
>10 Maddz: Look at character #7: QA76.9. and QA76.73 - the . sorts before the 3. Basically, each character is sorted in turn.
Look at the 6th character. 7 comes before 9. If 'each character is sorted in turn,' it does NOT matter whether . is sorted before or after 3, because 9 is always sorted after 7.
To the human brain, 9 and 09 are the same thing, but a computer sorting character-by-character compares the first character only, and compares further only if they are the same. A computer and a human can both agree that seven comes before nine. 46/57 in a string sort should be after 46/55, because, as you say, each character is sorted in turn.
You are looking at the wrong character. The decimal and the 3 do not matter in a character-by-character sort because they are after the 7 and 9.
If we were looking at QA76.7. and QA76.73, you would be correct. But if it were sorting character by character in an alphanumeric string, anything with 5 identical characters and then a 7 as the 6th character would sort before something with the identical 5 characters and then a 9 as the 6th character, regardless of the seventh character.
Therefore, that can't be the way it's sorting. If it's doing some sort of chunk sort, using the decimals to divide each entry into multiple data points, I don't know. But a character-by-character alphanumeric string sort does not work the way it is currently sorting.
Look at the 6th character. 7 comes before 9. If 'each character is sorted in turn,' it does NOT matter whether . is sorted before or after 3, because 9 is always sorted after 7.
To the human brain, 9 and 09 are the same thing, but a computer sorting character-by-character compares the first character only, and compares further only if they are the same. A computer and a human can both agree that seven comes before nine. 46/57 in a string sort should be after 46/55, because, as you say, each character is sorted in turn.
You are looking at the wrong character. The decimal and the 3 do not matter in a character-by-character sort because they are after the 7 and 9.
If we were looking at QA76.7. and QA76.73, you would be correct. But if it were sorting character by character in an alphanumeric string, anything with 5 identical characters and then a 7 as the 6th character would sort before something with the identical 5 characters and then a 9 as the 6th character, regardless of the seventh character.
Therefore, that can't be the way it's sorting. If it's doing some sort of chunk sort, using the decimals to divide each entry into multiple data points, I don't know. But a character-by-character alphanumeric string sort does not work the way it is currently sorting.
14kristilabrie
Just posting that LT staff has seen this and will have an answer for you soon, hopefully by tomorrow or as soon as possible thereafter!
15chinskec
kristilabrie, I'm curious if LT staff has any update on this?
16kristilabrie
Thanks for checking in: not quite yet, sorry for the delay! It is still on the developers' list to discuss when they get the time. Hopefully sooner than later; thanks for your patience in the meantime.
18kristilabrie
>17 chinskec: Thanks, I think we've got all the information we need, just need the time and return from holiday breaks to discuss further. :)