The goal of this lecture note is to illustrate how to program with strings in OCaml.
So strings are built out of characters and are written between double quotes:
# "hello world";;
- : string = "hello world"
#
They are concatenated with the infix operator ^:
# "ab" ^ "ba";;
- : string = "abba"
# "hello" ^ " " ^ "world";;
- : string = "hello world"
#
And if we want to have a double quote in a string, we quote it with a backslash, i.e., \. This backslash is purely notational, in that it does not exist in the string, which we can verify by computing the length of the said string:
# "\"";;
- : string = "\""
# String.length "\"";;
- : int = 1
#
Likewise if we want to have a backslash in a string – we quote it too:
# "\\";;
- : string = "\\"
#
Besides String.length, a function of type string -> int to compute the length of a string, the string library String also contains, e.g, String.get : string -> int -> char, a function to index a character in a string, starting from 0 all the way up to (and excluding) the length of the string. Indexing a string with a negative integer or with an index that is greater or equal to the length of the string raises an error:
# let a_string_of_length_4 = "aBcD";;
val a_string_of_length_4 : string = "aBcD"
# String.length a_string_of_length_4;;
- : int = 4
# String.get a_string_of_length_4 (-1);;
Exception: Invalid_argument "index out of bounds".
# String.get a_string_of_length_4 0;;
- : char = 'a'
# String.get a_string_of_length_4 1;;
- : char = 'B'
# String.get a_string_of_length_4 2;;
- : char = 'c'
# String.get a_string_of_length_4 3;;
- : char = 'D'
# String.get a_string_of_length_4 4;;
Exception: Invalid_argument "index out of bounds".
# String.get a_string_of_length_4 max_int;;
Exception: Invalid_argument "index out of bounds".
#
Assume that strings are represented (i.e., implemented) as a contiguous area of memory (which they actually are). Such a contiguous area of memory is often called a “buffer”, and the fact that it cannot be indexed outside of its bounds (here: 0, inclusively, and the length of the string, exclusively) means that the implementation of OCaml does not suffer from the buffer overflow problem. Small causes, big effects.
Notationally, if the name s denotes a string and i denotes an integer, s.[i] is syntactic sugar (i.e., is unparsed into) String.get s i.
In Exercise 1.g, for example, “adding 1 to the left n times, iteratively and starting with 0” means that we start with 0, and then we add 1, which gives 1, and then we add 1, which gives 2, and then we add 1, which gives 3, etc., iteratively, i.e., one addition at a time, as if we were occupying ourselves with building our own onion:
0 = 0
1 + (0) = 1
1 + (1 + (0)) = 1 + (1) = 2
1 + (1 + (1 + (0))) = 1 + (2) = 3
1 + (1 + (1 + (1 + (0)))) = 1 + (3) = 4
An editorial recommendation: to add 1 to the left n times, iteratively and starting with 0,
Magritte: I approve this mini-interlude.
Harald: Wow. s and s.
Alfrothul (o _ o): s and s.
Harald: One is the name of the string in OCaml, and the other is the string.
Alfrothul: Ah, right. Names and what these names denote.
Harald: Yes. A useful distinction to make, isn’t it?
Alfrothul: In this context, yes. Let me try: evaluating String.length s amounts to applying the function denoted by String.length to the string denoted by s. How am I doing?
Harald: You are doing good, I believe. How about String.length "aBcD"?
Alfrothul: Same same. Evaluating String.length "aBcD" amounts to applying the function denoted by String.length to the string denoted by "aBcD".
Vigfus: But the string denoted by "aBcD" is "aBcD", since "aBcD" is a literal. Wait. Actually no: one is the syntax of the string – how it’s written, and the other is its semantics – what it means, i.e., an OCaml value of type string. I think I’m getting the hang of this.
Since Version 4.00.0, the OCaml library function for strings features String.map, which has type (char -> char) -> string -> string. To quote its documentation, “String.map f s applies function f in turn to all the characters of s (in increasing index order) and stores the results in a new string that is returned.”
For example, String.map can be used to map a string to another string of the same length but containing only the character 'a', as captured in the following unit-test function:
let test_to_a candidate =
let b0 = (candidate "" = "")
and b1 = (candidate "x" = "a")
and b2 = (candidate "xy" = "aa")
and b3 = (candidate "xyz" = "aaa")
(* etc. *)
in b0 && b1 && b2 && b3;;
To implement to_a, we supply the constant function fun c -> 'a' to String.map and to the given string:
let to_a s =
String.map (fun c -> 'a') s;;
This implementation passes the unit test:
# test_to_a to_a;;
- : bool = true
#
In words, fun c -> 'a' is applied to each character of the given string, and the resulting characters 'a' all occur in the resulting string:
Also, since c is not used in the body of fun c -> 'a', we can make that manifest by writing _ instead:
let to_a s =
String.map (fun _ -> 'a') s;;
Alfrothul: Ah! Now we can write a generator of real random strings.
Harald: A generator of real random strings. We already had a generator of random strings, didn’t we?
Alfrothul: Yes we did, but they all contained the same character.
Harald: The same random character, yes.
Alfrothul: And now we can manufacture random strings containing random characters.
Harald: Pray explain.
Alfrothul: Well, we first generate a string of random length, and then we map each of its characters to a random one using String.map. Look:
let random_char () =
char_of_int (Random.int 32 + int_of_char ' ');;
let random_string n =
String.map (fun _ -> random_char ()) (String.make (Random.int n) ' ');;
Harald (prudently): I see that your function has type int -> string.
Alfrothul: It does. Given a non-negative integer, it first creates a string of random length containing the space character using String.make, and then it maps each occurrence of the space character into a random character using String.map. Look:
# random_string 10;;
- : string = "(+%- .62"
# random_string 10;;
- : string = " 3+>,"
# random_string 10;;
- : string = ""
# random_string 10;;
- : string = "0?8,."
#
Harald: Harrumph.
Keeping in mind that int_of_char and char_of_int respectively map a character into its ASCII code and the ASCII code of a character into this character, define a function that maps a string of lowercase letters to the corresponding string of uppercase letters and another function that maps a string of uppercase letters to the corresponding string of lowercase letters:
let test_upper_string_of_letters candidate =
let b0 = (candidate "abc" = "ABC")
(* etc. *)
in b0;;
let test_lower_string_of_letters candidate =
let b0 = (candidate "ABC" = "abc")
(* etc. *)
in b0;;
Vigfus: We are supposed to extend these unit-test functions with our own tests, right?Mimer: Right.Brynja: Any other piece of advice?Mimer: Read the general interlude below.
(The input of upper_string_of_letters is assumed to be a string of lowercase letters and the input of lower_string_of_letters is assumed to be a string of uppercase letters.)
Harald: But what if the given string does not exclusively contain lowercase letters?
Alfrothul: Well, we were asserted it does.
Brynja: My turn to be literal:
let test_upper_letter candidate =
let a = (candidate 'a' = 'A')
and z = (candidate 'z' = 'Z')
in a && z;;
let upper_letter c =
let a = int_of_char 'a'
and n = int_of_char c
and z = int_of_char 'z'
and d = int_of_char 'a' - int_of_char 'A'
in let () = assert (a <= n && n <= z)
in char_of_int (n - d);;
let () = assert (test_upper_letter upper_letter);;
Loki: Wow. Literal and testy.
Harald: Hang on a second. OK, a denotes the ASCII code of the lowercase letter 'a', n denotes the ASCII code of the given letter, z denotes the ASCII code of the lowercase letter 'z', and d denotes... denotes what?
Brynja: Well, the ASCII code for 'a' is 97 and the ASCII code for 'A' is 65, and letters occur in alphabetical order in the ASCII table, so...
Harald: OK. Evaluating n - d gives the ASCII code of the corresponding uppercase letter.
Alfrothul: Which it does, since you have asserted that n lies between a and z, i.e., that the given letter is lowercase. If it isn’t, the assertion fails and the computation stops.
Vigfus: Well, lower_letter follows.
Halcyon: Mutatis mutandis.
This exercise is a continuation of Exercise 2.
Implement a function that maps the lowercase letters in an arbitrary string to uppercase and that leaves the other characters unchanged, be them uppercase letters, digits, or whichever other character:
let test_upper_string candidate =
let b0 = (candidate "abc123" = "ABC123")
and b1 = (candidate "--A B C**" = "--A B C**")
and b2 = (candidate "Abc" = "ABC")
and b3 = (candidate "aBc" = "ABC")
and b4 = (candidate "abC" = "ABC")
(* etc. *)
in b0 && b1 && b2 && b3 && b4;;
Implement a function that maps the uppercase letters in an arbitrary string to lowercase and that leaves the other characters unchanged, be them lowercase letters, digits, or whichever other character:
let test_lower_string candidate =
let b0 = (candidate "ABC123" = "abc123")
and b1 = (candidate "--a b c**" = "--a b c**")
and b2 = (candidate "Abc" = "abc")
and b3 = (candidate "aBc" = "abc")
and b4 = (candidate "abC" = "abc")
(* etc. *)
in b0 && b1 && b2 && b3 && b4;;
Subsidiary questions, keeping in mind that in Emacs, M-b is used to make the cursor jump to the beginning of the latest word (or of the current word), and M-f is used to make the cursor jump to the end of the next word (or of the current word):
Alfrothul: Fun fact – since Version 4.02.0, the OCaml library function for strings also features a function String.mapi, which has type (int -> char -> char) -> string -> string. To quote its documentation, “String.mapi f s calls f with each character of s and its index (in increasing index order) and stores the results in a new string that is returned.”
Harald: So... We should be able to map any string to a string of the same length, but containing successive digits. Blam.
Alfrothul: Beg pardon?
Harald: Let me sketch a unit-test function:
let test_blam candidate =
let b0 = (candidate "" = "")
and b1 = (candidate "a" = "0")
and b2 = (candidate "ab" = "01")
and b3 = (candidate "abc" = "012")
and b4 = (candidate "abcd" = "0123")
(* etc. *)
in b0 && b1 && b2 && b3 && b4;;
Alfrothul: I see. Pretty straighforward indeed:
let blam s =
String.mapi (fun i _ -> char_of_int (i + 48)) s;;
Harald: i + 48, eh.
Alfrothul: That’s because the ASCII code of 0 is 48:
# int_of_char '0';;
- : int = 48
# char_of_int 48;;
- : char = '0'
# char_of_int (48 + 1);;
- : char = '1'
# char_of_int (48 + 2);;
- : char = '2'
#
Harald: I see. But there is no way I want to remember that the ASCII code of 0 is 48.
Alfrothul: What’s that now?
Harald: I mean we should be more explicit in the definition of blam:
let blam s =
String.mapi (fun i _ -> char_of_int (i + int_of_char '0')) s;;
Alfrothul: Mmmh... Right. Still, be that as it may, blam passes the unit tests too, which was the point:
# test_blam blam;;
- : bool = true
#
Harald: And what if blam is applied to a string that is longer than 10? Let me try:
# String.length "abcdefghijkl";;
- : int = 12
# blam "abcdefghijkl";;
- : string = "0123456789:;"
#
Alfrothul: Well, that’s the ASCII table for you. After the characters 8 and 9, the characters : and ;.
Be a pal and
Could you parameterize blam to yield a cycle that is shorter than 10? For example, when applied to a string and to 6, the result should be the cyclic string "012345012345012345...". Of course a new unit-test function will be needed too.
Subsidiary questions:
Vigfus: I’m sorry but String.mapi is undefined:
# String.mapi;;
Characters 0-11:
String.mapi;;
^^^^^^^^^^^
Error: Unbound value String.mapi
Did you mean map?
#
Brynja: What is your version of OCaml?
Vigfus: When OCaml starts, it displays:
OCaml version 4.01.0
#
Jasmine: That’s why.
Brynja: Right – mapi was only added to the strings library since Version 4.02.0.
Vigfus: OK?
Brynja: Can you upgrade your version of OCaml?
Vigfus (with a small voice): No?
Mimer: That’s OK. The accompanying resource file contains an updated version of the String module that contains both a definition of mapi and a definition of init, so just load this file and you are good to go.
Vigfus (clickety clickety click): Yay! I can now generate the 26 letters of the alphabet in a string.
Harald (O_o): Letters? Alphabet? String?
Vigfus: Well, yes. We can map a string of 26 characters to a string that contains all successive letters of the alphabet. Look:
# let lower_case_alphabet =
String.mapi (fun i _ -> char_of_int (i + int_of_char 'a')) (String.make 26 ' ');;
val lower_case_alphabet : string = "abcdefghijklmnopqrstuvwxyz"
#
Mimer: Way to go, Vigfus!
Loki: Wow, Mimer.Mimer: Yes?Loki: That was a quick and dirty definition of String.mapi.Mimer: Did you expect a reference implementation?Loki: Well, it is effective.
Can String.map be defined in terms of String.mapi?
Vigfus: On second thoughts, though...
Brynja: On second thoughts?
Vigfus: It would be a better idea to use String.init here to generate the 26 letters of the alphabet in a string.
Brynja: Pray tell.
Vigfus (taking a deep breath): Well, since Version 4.02.0, the strings library also contains an init function of type int -> (int -> char) -> string that we can use to initialize a string. Given the length of the desired string and a function of type int -> char, this init function yields a string where each of the characters are the result of applying the given function to each of the successive valid indices in this string. Look:
# String.init 26 (fun i -> char_of_int (i + int_of_char 'A'));;
- : string = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#
Brynja: Mimer is right. Way to go, Vigfus!
Alfrothul: Indeed. Well spotted, Vigfus: the generator of random strings too is better implemented with String.init:
let random_string n =
String.init (Random.int n) (fun _ -> random_char ());;
Harald: How about we generalize?
Alfrothul: Er... Sure, sounds good, let’s. What do you have in mind?
Harald: Well, the ASCII table contains 2 to the 7, i.e., 128 characters, so we could create a string for it.
Alfrothul: Indeed we can:
# let ascii_table = String.init 128 char_of_int;;
val ascii_table : string =
"\000\001\002\003\004\005\006\007\b\t\n\011\012\r\014\015\016\017\018\019\020\021\022\023\024\025\026\027\028\029\030\031 !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\127"
#
Brynja (jumping in): Wow, what a way to print the first 32 characters in the ASCII table. Let me check something:
# '\000';;
- : char = '\000'
# int_of_char '\000';;
- : int = 0
# String.length "\000";;
- : int = 1
#
Brynja: OK, makes sense. Sorry. You were saying?
Vigfus (jumping in too): Actually, OCaml’s ASCII table has 2 to the 8, i.e., 256 entries, and the last one use the same convention, look:
# String.get (String.init 256 char_of_int) 255;;
- : char = '\255'
#
Harald: Thanks. Be that as it may, we can now simulate String.mapi in terms of String.map:
let test_string_mapi candidate =
let b0 = (candidate (fun i c -> char_of_int (int_of_char c + 1)) "abcde" = "bcdef")
and b1 = (candidate (fun i c -> char_of_int (i + int_of_char '0')) "abcde" = "01234")
in b0 && b1;;
let string_mapi f s =
String.map
(fun c -> let i = int_of_char c
in f i (String.get s i))
(String.init (String.length s) char_of_int);;
Alfrothul: Wow. You first create a prefix of the ASCII table, and then you map pointwise a function over this prefix. In effect, this function is given the index of the character it is applied to.
Harald: Right. And then we can apply the given function to the index and to the corresponding character in the given string, as per String.mapi‘s mandate.
Brynja: ...but only for strings up to 256 characters.
Mike Wazowski: We’re still working on it. It’s a work in progress.
Mimer: Mr. Wazowski, thanks for stopping by! Do you need ushers?
Harald (unfazed): Yes, only up to 256 characters:
let string_mapi f s =
let () = assert (String.length s <= 256) in
String.map
(fun c -> let i = int_of_char c
in f i (String.get s i))
(String.init (String.length s) char_of_int);;
Vigfus: Well, this implementation passes the unit test.
The string is a stark data structure and everywhere it is passed there is much duplication of process.It is a perfect vehicle for hiding information.—Alan Perlis‘s programming epigram #34
Integers can be converted to strings (in base 10) and vice-versa:
# string_of_int 123;;
- : string = "123"
# int_of_string "123";;
- : int = 123
#
The function converting a string to an integer is a left inverse of the function converting an integer to a string:
# int_of_string (string_of_int 543210);;
- : int = 543210
# int_of_string "hello world";;
Exception: Failure "int_of_string".
#
Negative integers can also be formatted into a string, but on the occasion one must help the OCaml parser by wrapping them between parentheses or prefixing their sign with ~, so that the negative sign is not read as the infix subtraction operator, which gives rise to, ah, unintuitive error messages:
# int_of_string "-321";;
- : int = -321
# string_of_int -123;; (* this expression is read as subtracting 123 to string_of_int *)
Characters 0-13:
string_of_int -123;;
^^^^^^^^^^^^^
Error: This expression has type int -> string
but an expression was expected of type int
# string_of_int (-123);; (* fix: isolate the negative integer between cosy parentheses *)
- : string = "-123"
# string_of_int ~-123;; (* alternative fix: mark the negative sign as such *)
- : string = "-123"
#
To generalize string_of_int, OCaml offers a library function to format strings, Printf.sprintf, that returns a formatted string. To a first approximation, its first argument is the string to print out (so there is nothing to format here):
# Printf.sprintf "Hello world";;
- : string = "Hello world"
#
This argument string is actually a control directive that determines the behavior of Printf.sprintf: if, for example, it contains an occurrence of %i, then Printf.sprintf expects not just the control string, but an integer too – in fact, as many integers as occurrences of %i in the control string. These integers are formatted into part of the resulting string:
# Printf.sprintf "We are in %ind Street." 42;;
- : string = "We are in 42nd Street."
# Printf.sprintf "We stand at the corner of %ind Street and %ith Avenue." 42 5;;
- : string = "We stand at the corner of 42nd Street and 5th Avenue."
#
Here, the control directive %i specifies that an integer is expected. Other control directives exist, e.g., %s for strings and %B for Booleans:
# Printf.sprintf "Is %s %B?\n" "that" true;;
- : string = "Is that true?\n"
#
There is more information in the OCaml manual.
OCaml also offers a library function to format strings and print them out, Printf.printf, that returns the unit-type value. The only difference between Printf.sprintf and Printf.printf is that one returns the formatted string whereas the other prints it out and returns the unit-type value. In other words, the following expressions are equivalent:
To wit:
# Printf.printf "Hello world\n";;
Hello world
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "Hello world\n");;
Hello world
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "%s" (Printf.sprintf "Hello world\n"));;
Hello world
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "%s" (Printf.sprintf "%s" (Printf.sprintf "Hello world\n")));;
Hello world
- : unit = ()
# Printf.printf "We are in %ind Street.\n" 42;;
We are in 42nd Street.
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "We are in %ind Street.\n" 42);;
We are in 42nd Street.
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "%s" (Printf.sprintf "We are in %ind Street.\n" 42));;
We are in 42nd Street.
- : unit = ()
# Printf.printf "%s" (Printf.sprintf "%s" (Printf.sprintf "%s" (Printf.sprintf "We are in %ind Street.\n" 42)));;
We are in 42nd Street.
- : unit = ()
#
Added the editorial recommendation in the food for thought about Exercise 1.1 [24 Apr 2021]
Added the aftermath, removed the Stringy module, and added the updated String module [11 Mar 2021]
Created 2 versions of the resource file, one for current versions of OCaml and one for older versions, thanks to Ștefan-Cristian Roată’s eagle eye [19 Feb 2021]
Simplified the statement of Exercise 3, at Mohamed Mbarek’s sane prompting [15 Feb 2021]
Added the definition of random_char in the random interlude, at Mohamed Mbarek’s candid prompting [13 Feb 2021]
Created [13 Feb 2021]