Strings and Character Frequency | Developer Knowledge Base

String interview problems often look different on the surface, but many of them reduce to the same ideas: normalization, counting, preserving the right character state, and deciding whether an array or a HashMap is the better frequency structure.

1. Definition

What is this topic about?

This topic is about solving string problems by modeling characters explicitly.

Instead of thinking only in terms of whole words, you often think in terms of:

character counts,
position of first or last occurrence,
normalized representation,
and whether order matters or only multiset content matters.

Why does it exist?

Because many beginner and intermediate interview questions are disguised counting tasks.

A prompt may talk about:

anagrams,
duplicates,
valid characters,
palindromes,
first unique characters,
or grouped words,

but the real bottleneck is often frequency tracking.

Where does it fit?

This topic sits right after array basics.

That is natural because string problems often become:

array-of-counts problems,
HashMap lookup problems,
or two-pass scan problems.

What it is NOT

This topic is not about advanced string matching like KMP or Rabin-Karp.

It is about the earlier interview layer:

counting,
normalization,
grouping,
and scanning.

Why interviewers care

Interviewers like string problems because they test whether you can:

read the contract carefully,
notice hidden assumptions,
choose a compact state model,
and avoid overengineering.

A lot of wrong string solutions fail because the candidate did not clarify what should be ignored, normalized, or counted.

2. Core Concepts

String as a sequence problem

Many string problems are really sequence problems with characters.

That means the useful questions are often:

what is the unit of work,
what state must survive,
and does order matter?

Frequency counting

Frequency counting asks how many times each character appears.

That is useful for:

anagram checks,
duplicate detection,
majority-style reasoning,
and building canonical representations.

Array versus `HashMap`

When the alphabet is small and fixed, an array is often the cleanest choice.

Examples:

lowercase English letters,
uppercase letters,
digits,
or ASCII-focused prompts.

When the domain is larger or unclear, HashMap<Character, Integer> may be safer.

Normalization

Normalization means converting different-looking inputs into a common comparable form.

Typical normalization steps include:

lowercasing,
skipping spaces,
skipping punctuation,
trimming,
or converting to a sorted or counted signature.

Two-pass scans

A two-pass scan is common in string tasks.

First pass:

build counts,
store positions,
or gather structure.

Second pass:

find the first valid answer,
compare counts,
or build grouped output.

Canonical representation

A canonical representation is a stable form that lets equal logical strings compare the same way.

Examples:

sorted characters for anagrams,
count-signatures like a2#b1#c0...,
normalized lowercase alphanumeric form.

Order-sensitive versus order-insensitive tasks

This distinction matters early.

If order matters, a frequency-only solution may be insufficient.

If order does not matter, preserving original arrangement may be wasted work.

Character domain assumptions

Always ask what characters can appear.

That decides whether:

int[26] is enough,
int[128] is enough,
or a map-based approach is required.

Strings in Java

In Java, String is immutable.

That means repeated concatenation in loops can be costly.

For pure counting tasks, you often avoid building many intermediate strings.

Recognition signals

This topic should trigger when the prompt mentions:

anagram,
repeated characters,
uniqueness,
normalized equality,
or grouped-by-letter-content behavior.

3. Practical Usage

When counting is the right first move

Start with counting when the prompt asks whether two strings have the same characters, whether a character appears exactly once, or whether many words share the same letter composition.

When normalization matters first

Normalize first when uppercase/lowercase differences, punctuation, or whitespace should not affect the answer.

When an array is better than a map

Prefer an array when:

the alphabet is fixed,
the prompt guarantees lowercase English letters,
or you want tighter memory and faster constant factors.

When a map is better than an array

Prefer a map when:

the character set is not fixed,
Unicode or mixed symbols may appear,
or the prompt does not justify a small-domain assumption.

Typical interview flow

A clean explanation often sounds like this:

“I only care about character counts, not original order.”
“The prompt guarantees lowercase English letters.”
“So an int[26] is enough.”
“I update counts in one pass.”
“Then I compare or scan the counts depending on the task.”

Common optimization path

A common path is:

naive nested comparison,
then sorting,
then direct counting.

Sorting is often acceptable, but counting is frequently simpler and faster when the domain is small.

Pair-programming language

Useful phrases include:

“This is really a frequency problem.”
“Order is irrelevant here, so I want a canonical form.”
“The alphabet is fixed, so an array beats a map.”
“I will normalize before I compare.”
"This needs two passes: count first, answer second."

LeetCode #242 — Valid Anagram — int[26] frequency comparison.
LeetCode #387 — First Unique Character in a String — frequency array + second pass.
LeetCode #49 — Group Anagrams — canonical key via sorted characters.
LeetCode #125 — Valid Palindrome — normalization + two-pointer check.
LeetCode #438 — Find All Anagrams in a String — sliding window with frequency match.

4. Code Examples

Basic Example — valid anagram with `int[26]`

This version assumes lowercase English letters.

public class ValidAnagramArray {
    public static boolean isAnagram(String left, String right) {
        if (left == null || right == null || left.length() != right.length()) {
            return false;
        }

        int[] counts = new int[26];

        for (int index = 0; index < left.length(); index++) {
            counts[left.charAt(index) - 'a']++;
            counts[right.charAt(index) - 'a']--;
        }

        for (int count : counts) {
            if (count != 0) {
                return false;
            }
        }

        return true;
    }
}

Why it works:

equal strings must cancel to zero count difference,
the state is tiny,
and time is linear.

Complexity: time O(n) where n = max(s.length, t.length); extra space O(1) (fixed 26-element array).

Basic Example — first unique character with two passes

public class FirstUniqueCharacter {
    public static int firstUniqueIndex(String text) {
        if (text == null || text.isEmpty()) {
            return -1;
        }

        int[] counts = new int[26];

        for (int index = 0; index < text.length(); index++) {
            counts[text.charAt(index) - 'a']++;
        }

        for (int index = 0; index < text.length(); index++) {
            if (counts[text.charAt(index) - 'a'] == 1) {
                return index;
            }
        }

        return -1;
    }
}

This is a classic two-pass scan:

first count,
then find the first valid position.

Complexity: time O(n); extra space O(1) (fixed 26-element array).

Advanced Example — group anagrams with canonical signatures

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class GroupAnagrams {
    public static List<List<String>> group(String[] words) {
        Map<String, List<String>> groups = new HashMap<>();

        for (String word : words) {
            String signature = buildSignature(word);
            groups.computeIfAbsent(signature, ignored -> new ArrayList<>()).add(word);
        }

        return new ArrayList<>(groups.values());
    }

    private static String buildSignature(String word) {
        int[] counts = new int[26];

        for (int index = 0; index < word.length(); index++) {
            counts[word.charAt(index) - 'a']++;
        }

        StringBuilder builder = new StringBuilder();
        for (int count : counts) {
            builder.append('#').append(count);
        }

        return builder.toString();
    }
}

The key move is building a canonical representation that ignores order but preserves composition.

Complexity: time O(n·k) where n = number of words, k = average word length; extra space O(n·k) for the map.

Advanced Example — palindrome check with normalization

public class NormalizedPalindrome {
    public static boolean isPalindrome(String text) {
        if (text == null) {
            return false;
        }

        int left = 0;
        int right = text.length() - 1;

        while (left < right) {
            char leftChar = Character.toLowerCase(text.charAt(left));
            char rightChar = Character.toLowerCase(text.charAt(right));

            if (!Character.isLetterOrDigit(leftChar)) {
                left++;
                continue;
            }

            if (!Character.isLetterOrDigit(rightChar)) {
                right--;
                continue;
            }

            if (leftChar != rightChar) {
                return false;
            }

            left++;
            right--;
        }

        return true;
    }
}

This is a good reminder that some string tasks are not frequency tasks first.

Normalization and pointer movement may be the real model.

Complexity: time O(n); extra space O(1).

Common Pitfall — wrong domain assumption

public class BrokenFrequencyCheck {
    public static boolean containsDuplicateLetter(String text) {
        int[] counts = new int[26];

        for (int index = 0; index < text.length(); index++) {
            counts[text.charAt(index) - 'a']++;
        }

        for (int count : counts) {
            if (count > 1) {
                return true;
            }
        }

        return false;
    }
}

This breaks if the input can contain uppercase letters, digits, punctuation, or non-English characters.

The real bug is not syntax.

It is an unstated assumption about the character domain.

5. Trade-offs

Aspect	Details
⚡ Performance	Frequency arrays and two-pass scans are usually linear and have great constants when the alphabet is fixed.
💾 Memory	Arrays are compact, while maps are more flexible but heavier.
🔧 Maintainability	Counting solutions are often short and review-friendly, but only if the domain assumptions are explicit.
🔄 Flexibility	Map-based solutions handle larger character sets better; array-based solutions are tighter but more specialized.

Performance trade-off: Sorting may be good enough, but counting usually wins when the domain is small.

Memory trade-off: An int[26] is tiny; a map buys flexibility at higher overhead.

Maintainability trade-off: The cleanest code is often the one whose assumptions are stated out loud.

Flexibility trade-off: The more diverse the input, the less safe hard-coded array domains become.

6. Common Mistakes

Assuming lowercase letters without proof
This is one of the most common silent bugs.
Forgetting normalization requirements
Case, spaces, and punctuation often change the correct model.
Using sorting when counting is simpler
Sorting is valid, but it may be unnecessary work.
Using counting when order still matters
Frequency alone does not preserve sequence information.
Building many temporary strings carelessly
In Java, repeated string concatenation in loops can be more expensive than necessary.
Missing the two-pass structure
Some tasks need counts first and a decision pass second.
Not naming the character domain
26, 52, 128, and HashMap are different design decisions, not interchangeable magic constants.

7. Deep Dive

Why string problems often reduce well

String prompts often feel messy because natural language is messy.

But many of them reduce cleanly once you ask whether the task is about:

equality after normalization,
multiset equality,
position of a rare character,
or grouped content.

Why counting arrays are so popular

Counting arrays are popular because they compress the needed state into a tiny fixed structure.

That gives:

predictable memory,
predictable access,
and simple review logic.

Canonical forms are powerful

Canonical forms let you compare things that look different but mean the same thing under the prompt rules.

That is why sorted strings, count signatures, and normalized lowercase forms show up repeatedly.

Strings are not always “just strings”

Sometimes the real model is:

array scanning,
hashing,
two pointers,
or window maintenance.

The string is only the container.

Java-specific caution

Because String is immutable, heavy reconstruction inside loops should be treated carefully.

When you only need counts, indexes, or normalized comparisons, do not allocate more than the task requires.

The explanation advantage

A strong explanation here is simple:

what assumptions the input gives,
what state you preserve,
why array or map is the right domain structure,
and whether order matters.

That already sounds much stronger than “I just used a map.”

8. Interview Questions

Q: When is int[26] better than HashMap<Character, Integer>?
A: When the prompt guarantees lowercase English letters or another small fixed alphabet.

Q: Why do many string problems use two passes?
A: Because the first pass gathers counts or structure, and the second pass turns that structure into the required answer.

Q: What is a canonical representation?
A: A stable form that lets logically equivalent inputs compare the same way, such as a sorted string or a count signature.

Q: What is a common silent bug in character-frequency problems?
A: Assuming a smaller character domain than the prompt actually guarantees.

Q: Why is normalization so important?
A: Because case, punctuation, and whitespace rules often change what “equal” means in the problem.

9. Glossary

Term	Definition
frequency counting	Tracking how many times each character appears.
normalization	Converting input into a common comparable form.
canonical representation	A stable representation used for comparison or grouping.
two-pass scan	One pass to gather state, another pass to answer.
character domain	The set of characters the input may contain.
immutable	A value that cannot be modified in place after creation.
signature	A compact representation of the string’s logical content.

10. Cheatsheet

Ask whether order matters.
Ask what character domain is guaranteed.
Normalize before comparing when the prompt requires it.
Prefer int[26] for fixed lowercase-letter tasks.
Prefer a map when the domain is unclear or broad.
Two-pass scans are common and often cleanest.
Sorting is valid, but counting may be cheaper.
Frequency alone cannot preserve order-sensitive meaning.
State your assumptions out loud in interviews.
Many string problems are really counting problems in disguise.