022 — Codeforces Hashing Pick (Rolling Hash)¶

Source¶

Original source: Self-picked Codeforces problem tagged "hashing" (representative: rolling-hash substring counting)
Platform: Codeforces — https://codeforces.com/problemset/ (filter tag: hashing)
Topic: rolling hash (polynomial string hashing)
Difficulty: Hard
Pattern: rolling hash / counting
Basic idea inherited from: Pattern 7 teaser + CP-Algorithms reading — https://cp-algorithms.com/string/string-hashing.html

1. Problem in simple words¶

এই row-টা "নিজের rating-এর একটা hashing-tagged Codeforces problem বেছে নাও" — তাই এখানে সেই ধাঁচের একটা প্রতিনিধি (representative) problem নিজের ভাষায় ধরছি, যা প্রায় সব string-hashing problem-এর মূলে: একটা text S আর একটা pattern P দেওয়া আছে; P, S-এর মধ্যে (overlap সহ) কতবার substring হিসেবে আসে গুনতে হবে। আসল CF problem আলাদা হতে পারে, কিন্তু rolling-hash যন্ত্রটা এক।

S = "abababab", P = "ab"
"ab" আসে index 0, 2, 4, 6-এ
উত্তর: 4

2. Which basic idea does this inherit from?¶

এটা Pattern 7 (rolling hash)-এর পূর্ণ রূপ — Repeated DNA Sequences (#19)-এ যে window-কে key ভাবার teaser দেখেছ, তার বড় সংস্করণ। মূল ধারণা place-value/positional সংখ্যা: string-কে কোনো base-এর একটা সংখ্যা ভাবা, যা ../../01-math-based-programming-fundamentals/02-modular-arithmetic/-এর modular arithmetic-এর উপর দাঁড়ানো। পুরো treatment CP-Algorithms-এ: https://cp-algorithms.com/string/string-hashing.html।

3. What is the hidden math or logic?¶

লুকানো গণিত হলো polynomial hashing: দৈর্ঘ্য-m একটা string c0 c1 ... c(m-1)-কে ধরা হয় সংখ্যা c0·base^(m-1) + c1·base^(m-2) + ... + c(m-1) (সব mod একটা বড় prime)। মূল সৌন্দর্য — window এক ঘর ডানে সরালে এই সংখ্যা শূন্য থেকে বের করতে হয় না: বেরিয়ে-যাওয়া অক্ষরের contribution = c·base^(m-1) বিয়োগ করো, বাকিটা base দিয়ে গুণো (সবাই এক ঘর বাঁয়ে সরল), নতুন অক্ষর যোগ করো। তাই প্রতিটা slide O(1)।

4. Which data structure is needed?¶

মূলত কয়েকটা integer (running hash, pattern hash, base^(m-1)) আর modular arithmetic — কোনো ভারী structure লাগে না। তবে "কোন কোন window pattern-এর সমান" বা "distinct substring কতগুলো" জাতীয় variant-এ একটা hash set / Counter লাগে দেখা hash জমাতে — সেখানেই hashing chapter-এর seen-set ফিরে আসে।

5. Why this data structure fits¶

দৈর্ঘ্য-m দুটো string সরাসরি তুলনা O(m); কিন্তু তাদের hash তুলনা O(1)। rolling বানানোয় প্রতিটা পরের window-এর hash আগেরটা থেকে O(1)-এ পাওয়া যায়, তাই পুরো text scan O(n)। নাইভ "প্রতি position-এ m-অক্ষর মেলাও" হতো O(n·m); rolling hash সেটাকে O(n + m)-এ নামায়। বড় CF input-এ এটাই পার্থক্য গড়ে দেয়।

6. Real-life analogy¶

ভাবো একটা লম্বা সংখ্যা-ফিতের উপর জানালা সরাচ্ছ আর প্রতিবার জানালার ভেতরের অঙ্কগুলোকে একটা একক সংখ্যা হিসেবে পড়ছ। জানালা এক ঘর ডানে সরালে তুমি পুরোটা আবার পড়ো না — বাঁয়ের সবচেয়ে বড় স্থানীয় মানের অঙ্কটা বাদ দাও, বাকিটা ১০ গুণ করো (সব এক ঘর বাঁয়ে সরল), ডানে নতুন অঙ্ক বসাও। ঠিক এভাবেই rolling hash এক ঘর সরায় — পুরো পুনর্গণনা ছাড়াই নতুন "সংখ্যা" পেয়ে যায়।

7. Visual explanation¶

S = "aaa", P = "aa" (m=2, base=256); প্রতি window-এর hash roll হয়:

P-এর hash  : 'a'·256 + 'a'                     = h(P)
window[0:2]="aa": একই হিসাব = h(P)             → মেলে, যাচাই "aa"=="aa" ✓
roll → window[1:3]="aa":
   বাদ S[0]='a'·base^(m-1), ·base, যোগ S[2]='a' = h(P) আবার → মেলে ✓
উত্তর: 2

প্রতিটা roll-এ মাত্র কয়েকটা গাণিতিক ধাপ — পুরো substring আবার পড়া নয়।

8. Example input and output¶

Input : S = "abababab", P = "ab"     Output: 4   (index 0,2,4,6)
Input : S = "aaaaa",    P = "aa"     Output: 4   (index 0,1,2,3; overlap সহ)
Input : S = "abcdef",   P = "xyz"    Output: 0
Input : S = "mississippi", P = "issi" Output: 2  (index 1, 4)
Input : S = "hello",    P = ""       Output: 0   (খালি pattern গোনা হয় না)

9. Brute force thinking¶

প্রথম সরল চিন্তা: text-এর প্রতিটা শুরু-position-এ pattern-টা অক্ষরে অক্ষরে মিলিয়ে দেখো।

count = 0
প্রতিটা শুরু i (0 .. n-m)-র জন্য:
    S[i : i+m] == P হলে count += 1
return count

10. Why brute force is slow¶

প্রতিটা শুরু-position-এ সবচেয়ে খারাপ ক্ষেত্রে mটা অক্ষর মেলাতে হয় — মোট O(n·m)। n ও m দুটোই বড় হলে (CF-তে 10^5–10^6) এটা time limit ছাড়ায়। অপচয়টা: পাশাপাশি দুটো window-এর প্রায় সব অক্ষর একই, তবু আমরা প্রতিবার গোড়া থেকে তুলনা করছি। hash তুলনা সেই কাজটাকে O(1)-এ নামায় (rolling-এ window-এর hash-ও O(1) আপডেট)।

11. Key observation¶

মূল observation: দুটো string সমান কি না, তা (প্রায় নিশ্চিতভাবে) তাদের একটামাত্র সংখ্যা — hash — তুলনা করে বলা যায়, পুরো অক্ষর না মিলিয়ে। আর সেই hash window slide-এ O(1)-এ আপডেট হয়। দুর্লভ collision সামলাতে hash মিললে একবার আসল slice মিলিয়ে নাও (verify), তাহলে উত্তর সবসময় সঠিক।

12. Optimized intuition¶

P-র hash একবার বানাও। S-এর প্রথম m-অক্ষর window-এর hash বানাও। high = base^(m-1) mod। প্রতিটা শুরু-position-এ: window-hash == P-hash হলে আসল slice মিলিয়ে (collision-guard) count বাড়াও; তারপর hash roll করো — বাঁ অক্ষরের contribution বাদ, ·base, ডান অক্ষর যোগ, mod। text একবার scan, প্রতিটা ধাপ O(1) — মোট O(n + m)।

13. Thinking tweak¶

মোচড়: string equality-কে integer equality-তে নামিয়ে আনো, আর সেই integer-কে window slide-এ incremental রাখো। এটাই Pattern 7-এর প্রাণ। CP-তে আরও একটা মোচড় জরুরি — base/mod randomize করো (বা double hashing), কারণ fixed parameter-কে anti-hash test case ইচ্ছে করে collide করায়। verify-step থাকলে এই গণনায় তা এমনিতেই নিরাপদ।

14. Step-by-step dry run¶

Input S = "aba", P = "ab" (base=256, m=2):

P-hash = ord('a')·256 + ord('b')। window S[0:2]="ab"-এর hash = একই → মেলে; slice "ab"=="ab" → count = 1।
roll: বাদ S[0]='a' (·high), ·256, যোগ S[2]='a' → window এখন "ba"-র hash।
শুরু-position 1: window-hash == P-hash? "ba" != "ab" → hash আলাদা → count বাড়ে না।
শেষ: count = 1।

15. Python solution¶

def count_occurrences(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0 or m > n:
        return 0
    base = 256                                  # alphabet-এর চেয়ে বড় base
    mod = 1_000_000_007                         # বড় prime, collision কমায়
    p_hash = 0                                  # pattern-এর hash
    for ch in pattern:
        p_hash = (p_hash * base + ord(ch)) % mod
    window_hash = 0                             # প্রথম window-এর hash
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    high = pow(base, m - 1, mod)                # সবচেয়ে বড় স্থানীয় মান
    count = 0
    for start in range(n - m + 1):
        if window_hash == p_hash:              # hash মিলেছে
            if text[start:start + m] == pattern:   # collision-গার্ড: আসল যাচাই
                count += 1
        if start < n - m:                      # পরের window-তে roll
            out_ch = ord(text[start])
            in_ch = ord(text[start + m])
            window_hash = ((window_hash - out_ch * high) * base + in_ch) % mod
    return count

# ---- tests ----
assert count_occurrences("abababab", "ab") == 4
assert count_occurrences("aaaaa", "aa") == 4
assert count_occurrences("abcdef", "xyz") == 0
assert count_occurrences("mississippi", "issi") == 2
assert count_occurrences("hello", "") == 0
print("সব test pass!")

16. Line-by-line code explanation¶

if m == 0 or m > n: return 0 — খালি pattern বা text-এর চেয়ে বড় pattern-এর কোনো occurrence নেই।
base, mod — polynomial hashing-এর প্যারামিটার; বড় prime mod collision কমায়।
প্রথম for — pattern-কে base-m সংখ্যা হিসেবে এক hash-এ গুটিয়ে নেয়।
দ্বিতীয় for — text-এর প্রথম m-অক্ষর window-এর hash, একইভাবে।
high = pow(base, m - 1, mod) — বাঁ-প্রান্তের অক্ষরের স্থানীয় মান, roll-এ বাদ দিতে লাগে।
if window_hash == p_hash: if text[start:start+m] == pattern: count += 1 — hash মিললে আসল slice মিলিয়ে নিশ্চিত হই (collision থাকলেও উত্তর নির্ভুল)।
window_hash = ((window_hash - out_ch * high) * base + in_ch) % mod — O(1) roll: বাঁ অক্ষর বাদ, এক ঘর বাঁয়ে সরাও, ডান অক্ষর যোগ; Python-এ % mod ঋণাত্মককেও ঠিক রাখে।

17. Output walkthrough¶

"abababab","ab"-এ চারটা position-এ hash মেলে আর verify পাস → 4। "aaaaa","aa"-এ overlap সহ চারটা → 4। "abcdef","xyz"-এ কোনো hash মেলে না → 0। "mississippi","issi"-এ index 1 ও 4-এ "issi" → 2। খালি pattern → 0 (গার্ডেই বেরিয়ে যায়)। শেষে print: সব test pass!।

18. Time complexity¶

O(n + m) — pattern ও প্রথম window-এর hash O(m); তারপর প্রতিটা শুরু-position O(1) roll, মোট O(n)। (collision বিরল বলে verify খরচ amortized নগণ্য; adversarial mod-এ randomize করলে নিরাপদ।)

19. Space complexity¶

O(1) অতিরিক্ত — কয়েকটা integer; slice-verify সাময়িকভাবে O(m)। distinct-substring variant-এ hash set রাখলে O(সংখ্যা)।

20. Common mistakes¶

collision উপেক্ষা করা — শুধু hash মিললেই count বাড়ালে বিরল ভুল ম্যাচ; hash মিললে আসল slice একবার verify করো (বা double hashing)।
fixed base/mod ব্যবহার করা — CF-এর anti-hash test fixed parameter collide করায়; base/mod randomize করো।
roll-এ ঋণচিহ্ন বা high ভুল — high = base^(m-1); বিয়োগের পর % mod দিতে ভুলো না (Python ঋণাত্মক mod ঠিক রাখলেও অভ্যাস রাখা ভালো)।

21. Which basic problem this inherits from¶

ভিত্তি: seen-set of windows (#19 Repeated DNA, Pattern 4)-এর "window = key" idea, এখানে key হলো O(1)-এ roll হওয়া polynomial hash (Pattern 7)। গণিতটা আসে math fundamentals-এর place-value ও modular arithmetic (../../01-math-based-programming-fundamentals/02-modular-arithmetic/) থেকে। সম্পূর্ণ teaser ../patterns.md-র Pattern 7-এ, পূর্ণ পাঠ https://cp-algorithms.com/string/string-hashing.html-এ।

22. Similar harder problems¶

Repeated DNA Sequences — https://leetcode.com/problems/repeated-dna-sequences/ (এই tracker-এর #19; window hashing-এর সরল রূপ)
Longest Duplicate Substring — https://leetcode.com/problems/longest-duplicate-substring/ (rolling hash + binary search)
Find the Index of the First Occurrence in a String — https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/ (substring search; rolling hash বা KMP)

23. Pattern learned¶

Rolling hash / counting: string equality-কে integer hash equality-তে নামাও; window slide-এ hash O(1)-এ roll করো (বাঁ contribution বাদ, ·base, ডান যোগ, mod)। hash মিললে verify; CF-তে base/mod randomize।

24. Final summary¶

ভবিষ্যতের আমাকে: "অনেক substring দ্রুত তুলনা / খোঁজা / গোনা" (বিশেষত CF-এর hashing tag) দেখলেই rolling hash মনে পড়বে — string = base-সংখ্যা, slide = O(1) roll, equality = hash তুলনা + verify। এটা Repeated DNA-র window-key idea-র পূর্ণ রূপ; গণিতটা place-value আর modular arithmetic। randomize আর verify — এই দুটো ভুললে CF-এ ধরা।

মনে রাখার নিয়ম (legal): সব নিজের ভাষায় লেখা; problem statement copy করা হয়নি; official link শুধু attribution-এর জন্য; "asked by Google/Amazon" দাবি নেই — এটা একটা common interview pattern।