Reading Between the Data

On any given day, people leave behind patterns without realizing it. It could be the order of websites they visit, the sequence of life events they experience, or even the way they structure sentences when they write. To most of us, it’s just noise. To Volodymyr Melnykov, it’s a puzzle waiting to be solved.

Melnykov studies how to group similar patterns together, a process known as clustering. His recent study, “Model based clustering of multivariate categorical sequences,” was published in Pattern Recognition. He tackles how to make sense of data that isn’t made up of neat, clean numbers, a problem many traditional clustering methods struggle with.

“Statistics is about providing inference based on samples,” Melnykov said. “Samples are drawn from populations. But in reality, populations are much more complicated than what we usually cover in our introductory statistics courses. They may consist of various subpopulations representing different ages, health conditions, lifestyles, and other characteristics.” “Clusters can be viewed as samples from these subpopulations,” he explained.

In simple terms, clustering helps researchers find hidden groups within data. “Observations within each group should be similar… but groups should be relatively distinct.”

Most existing clustering methods work well when the data is numerical, such as height, weight or speed. But much of the real world is made up of categories like job titles, political parties, or life events like marriage, graduation or illness.

The challenge becomes even more complex when those categories appear as sequences over time. For example, a person’s life is not just a list of events, but an ordered story. The timing and order. Graduating before starting a career often leads to a very different path than working first and returning to school later.

“Sequence has to be preserved,” he said of sequence data. “The very specific order… tells a lot about the individual. It’s not just what happened, but when it happened and what came before or after. If you change the order, you change the interpretation.”

Melnykov’s research goes a step further by looking at multiple sequences at once. Instead of just tracking one aspect of a person’s life, like health, his method can analyze several dimensions simultaneously, such as health, employment, and financial status.

“Each sequence… can impact each other,” he said. “If a person got sick… probably financial well-being will be also affected.”

His model captures these interactions and uses them to better group similar patterns together. The result is a realistic and flexible way to analyze complex data.

One example from the paper involves analyzing writing styles. By breaking text into sequences such as parts of speech or word length researchers can compare authors in surprising ways.

“You can actually represent the style of a writer using these data,” Melnykov said.

That approach can even help answer long-standing debates. Melnykov and co-author Yingying Zhang explored whether the Russian novel And Quiet Flows the Don was written by someone other than Mikhail Sholokhov, who won the Nobel Prize in Literature for the book, by comparing stylistic patterns.

But the reach of this work doesn’t stop at literature. Melnykov points to projects involving climate science, ancient philosophy, and even volcanic eruptions by using similar statistical ideas.

At its core, his research reflects that real-world data is messy, layered and full of hidden structure. By finding better ways to uncover those patterns, Melnykov hopes to help researchers across fields make sense of the stories buried inside it.

The Culverhouse College of Business

Media Inquiries

Zach Thomas

Director of Marketing & Communications

Contact

Support

Receive the Culverhouse Newsletter

Academic Departments

Research & Outreach Centers