[xpost from /r/algorithms] Clustering on non-numeric (categorical) data

Hi all!

I am researching clustering algorithms for data that is not numeric. I am not at liberty to discuss exactly on what data I'm clustering (it's proprietary information), but I can give you an idea of the kind of data I have. Also, I am somewhat new to this area of study. Please let me know if my understanding is off and I should be trying different things.

So, pretty much all of the clustering algorithms I've come across so far use some notion of distance to find clusters. This is fine if your data is numeric and it has some intrinsic value that we find meaningful. For example, we would consider the value of a housing price meaningful because we could compare two instances of it in a helpful way: $200k - $150k = $50k lets us know that the value of the first price is more than the second. A distance measure would thus be helpful in this case. This is not so for categorical data. Categorical data is where the instances of the feature all fall into a set with no natural ordering. Car makes, for example, would be considered categorical data. We would find no meaning in comparing the names of two manufacturers in the same way we did the housing prices: Ford - Toyota = ??? is not very helpful. Even if we assigned numbers to the categories, the distances we would find would be arbitrary and hence meaningless.

My data is a mix of these two things. I have numeric, categorical, and boolean features in my data sets (I consider boolean data a subset of categorical data, although we might find boolean distances meaningful with proper scaling). My research so far has turned up these two promising-ish papers:

I don't know if these are quite what I'm looking for. They seem a bit hand-wavey about certain things (especially the first), but the other papers I found were much worse. Is there anyone on here who has experience with trying to cluster this type of data? If so, could you point me in the direction of some decent papers to read up on?

Thanks in advance!

tl;dr - I need to cluster mixed categorical and numeric data. Know of any good papers?

edit: I was told you guys would be more suited to answer this question. Please let me know if somewhere else would be better. Thanks in advance.

edit2: Also, if you know of a good paper but there is a pay wall, don't worry. Just link it as I can get access to almost any publication through our library.

edit3: Thanks all! This was extremely helpful.

submitted by hammerheadquark
[link][44 comments]

[xpost from /r/algorithms] Clustering on non-numeric (categorical) data

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...