1.5 Implement a method to perform basic string compression using the counts of repeated characters. For example, the string aabccccaaa would become a2b1c5a3. If the “compressed” string would not become smaller than the original string, your method should return the original string.
This question is asking for a basic implementation of run-length encoding.
String concatenation
def compress_string(value):
if not value:
return value
result = ""
current_char = value[0]
current_count = 1
for i in range(1, len(value)):
char = value[i]
if char == current_char:
current_count += 1
else:
result += f"{current_char}{current_count}"
current_count = 1
result += f"{current_char}{current_count}"
if len(result) >= len(value):
return value
return result
Gayle considers this approach inefficient both in time and space because every time a new character is encountered and we append both the current_char
and the current_count
values to the resulting string, the concatentation operation takes O(K^2) time such that K is the size of our character sequence (the count of unique characters).
Why does the concatenation take O(K^2) time?
Consider the example given in the prompt: a2b1c5a3
. Lets say we’re at the end of the compression in our function above and result
equals a2b1c5
and we have a
and 3
left to add. For each of those last two characters (a
and 3
), we need to copy all of the left hand side (a2b1c5
) and the new character into a new string.
Since we need to iterate through the entire string as well, the runtime complexity is O(p + k^2) such that p
is the length of the string and k
is the length of our character sequence.
Small aside: in the book, gayle makes this runtime estimate based on java, not python. Newer versions of java actually compile string concatenation in certain situations to use string buffers which do not take quadratic time for concatentatios
Allocate a result list to be exact size of final compression
Avoid the overhead of string concatenation by pre-allocating a list. Find the size by running an intial run that only computes the number of characters we’ll be introducing. If it’s less than the original string, then we return the original. Otherwise, we’ll run the compression.
def compress_string(value):
if not value:
return value
# find final result list size
current_char = value[0]
current_count = 1
total_size = 0
for i in range(1, len(value)):
char = value[i]
if char == current_char:
current_count += 1
else:
# note that we have to stringify the count value to count the characters
total_size += (1 + len(str(current_count)))
current_char = char
current_count = 1
total_size += (1 + len(str(current_count)))
if total_size >= len(value):
return value
# calculate compression
result = [None] * total_size
current_char = value[0]
current_count = 1
index = 0
for i in range(1, len(value)):
char = value[i]
if char == current_char:
current_count += 1
else:
result[index] = current_char
index += 1
for c in str(current_count):
result[index] = c
index += 1
current_char = char
current_count = 1
result[index] = current_char
index += 1
for c in str(current_count):
result[index] = c
index += 1
return ''.join(result)