Lightroom fails to do "Unicode normalization" on searches

  • 2
  • Problem
  • Updated 1 week ago
  • Acknowledged

Lightroom fails to normalize unicode on searches, so that seemingly identical search strings give different results.

I was searching for all photos with the word "Médano" in the caption. It turns out that there are (at least?) two ways to indicate this word using unicode UTF+8. Lightroom treats these two ways as different words, and doesn't return one when you search for the other. This is a subtle issue, but those who are using non-English languages will probably run into it sooner or later.

The two ways to "spell" Médano are "Médano" and "Médano" (I have no idea if these are distinct once pasted into this forum).

You can see the difference using the hexdump command:

~$ echo 'Médano' | hexdump -C
00000000  4d 65 cc 81 64 61 6e 6f  0a                       |Me..dano.|
00000009
~$ echo 'Médano' | hexdump -C
00000000  4d c3 a9 64 61 6e 6f 0a                           |M..dano.|
00000008

The difference is explained at Wikipedia. I don't fully understand all of this, but the first form seems to be what I get when I copy from a pdf, and the second form is what I get when I type the word using the standard MacOS English keyboard.

I guess my only suggestion is that if some searches involving accented characters are not giving you the results you expect, this might be what is going on.

John Ellis adds:

The difference between the two versions of é is that one is a single Unicode character, Latin Small Letter E with Acute (U+00E9), whereas the other is a combination of two characters, Latin Small Letter E followed by a Combining Acute Accent (U+0301). On Mac, you can type the former by holding down "e" and then selecting the accented version from the popup menu.

Photo of Alan Harper

Alan Harper

  • 449 Posts
  • 89 Reply Likes

Posted 1 month ago

  • 2
Photo of William Warby

William Warby

  • 3 Posts
  • 2 Reply Likes
This reply was created from a merged topic originally titled Lightroom Smart Collection Filters against words with è accented character don't ....

I have many photos set with both sublocation and keyword "La Quinetière". It is impossible to create a smart collection which shows only these photos - when I set the criteria to find keywords or location with "contains" and "Quinetière", I get zero results. As a test, if I replace the accented è character with a normal e, the smart collection works. I can do a normal keyword search for "La Quinetière" and that works, so this seems to be a problem specific to smart collections.
Photo of John R. Ellis

John R. Ellis, Champion

  • 4047 Posts
  • 1067 Reply Likes
In the original topic, William wrote:

I've figured out the problem but it's the strangest thing - essentially I found that if I copied and pasted the keyword from a photo which had it set, rather than the one I had in the smart filter already, it works. I then added the version I had in the smart filter as a keyword and Lightroom appeared to add the exact same keyword twice. I have copied both variants here:

La Quinetière <-- Broken
La Quinetière <-- Works

They look identical (and where I've copied them here I guess they might be) but to Lightroom at least, they are not identical. I have since established that there are in fact two different variants of the è character in unicode which look outwardly identical but have a different character code:

https://apps.timwhitlock.info/unicode/inspect?s=La+Quineti%C3%A8re
https://apps.timwhitlock.info/unicode/inspect?s=La+Quinetie%CC%80re
Photo of Alan Harper

Alan Harper

  • 449 Posts
  • 89 Reply Likes
I suspect that Adobe programmed smart collections to ignore diacritical accents. I am talking about the text search field in the grid view, where diacritical marks are not ignored, but "identical" marks are not recognized as such.
Photo of John R. Ellis

John R. Ellis, Champion

  • 4047 Posts
  • 1067 Reply Likes
"I suspect that Adobe programmed smart collections to ignore diacritical accents"

I just verified that both smart collections and the Library Filter bar treat the two representations of "é" as different characters.   
Photo of John R. Ellis

John R. Ellis, Champion

  • 4059 Posts
  • 1070 Reply Likes
At the SDK code level, LrStringUtils.compareStrings isn't normalizing the two representations of ""é".  In particular, LrStringUtils.compareStrings ("\195\169", "\101\204\129") incorrectly returns false.
Photo of John R. Ellis

John R. Ellis, Champion

  • 4059 Posts
  • 1070 Reply Likes
A similar problem occurs when detecting duplicate photos whose path contains letters with diacritical marks: https://feedback.photoshop.com/photos...
Photo of John R. Ellis

John R. Ellis, Champion

  • 4059 Posts
  • 1070 Reply Likes
That other bug was reported fixed in 8.2: 
https://feedback.photoshop.com/photoshop_family/topics/synchronise-folder-creates-duplicates-when-no...

But it was surely in a different code path from this bug.