Ruby Regular Expressions
I wrote this reference as a summary / cheatsheet about how to use regular expressions in ruby along with some tips
### DIFFERENT WAYS OF MATCHING
"Use the force" =~/force/ # => 8 index of the first occurrence of the word
"Use the fork" =~ /force/ # => nil
"Use the fork" !~ /force/ # => true
"Use the force"[/force/] # => "force"
"Use the fork"[/force/] # => nil
"Use the force".match /force/ #=> #<MatchData "force">
"Use the fork".match /force/ #=> nil
"Use the force"[/(the) force/, 1] # => "the" # first capture (index of MatchData)
"Number 123"[/(?<number>\d+)/, "number"] => "123"
# (!) str =~ regexp is not exactly the same as regexp =~ str.
# Strings captured from named capture groups are assigned to local variables only in the second case
/(?<number>\d+)/ =~ "Number 123"; number #=> "123" # This works
"Number 123" =~ /(?<number>\d+)/; number # => NameError: undefined local variable
"Number 123"[/(?<number>\d+)/]; number # => NameError: undefined local variable
# Example of named captures in Rubular => http://rubular.com/r/SZDykFX2nr
# Rubular is awesome
#### GLOBAL VARIABLES / Regexp.last_match
# These matching operations leave us with some global variables with the result of the match, but
# is better to use Regexp.last_match for clarity
str = "Lord of the 7 Kingdoms, and Protector of the Realm"
str[/the (\d+) kingdoms?+/i]
str.match(/the (\d+) kingdoms?+/i)
str =~ /the (\d+) kingdoms?+/i
Regexp.last_match # same as $~ # => #<MatchData "the 7 Kingdoms" 1:"7">
Regexp.last_match(0) # matched string, same as $& # => "the 7 Kingdoms"
Regexp.last_match(1) # first capture, same as $1 # => 7
Regexp.last_match(n) # nth capture, same as $n
Regexp.last_match.pre_match # same as $` # => "Lord of "
Regexp.last_match.post_match # same as $' # => ", and Protector of the Realm"
### WORKING WITH MATCHES
match_data = "Jack Johnson".match /(\w*)\s(\w*)/ #=> #<MatchData "Jack Johnson" 1:"Jack" 2:"Johnson">
match_data.to_a
#=> [
# [0] "Jack Johnson", # matched string
# [1] "Jack", # first capture
# [2] "Johnson" # second capture
#]
match_data.captures
#=> [
# [0] "Jack",
# [1] "Johnson"
#]
match_data = "Number 123".match /(?<number>\d+)/ #=> #<MatchData "123" number:"123">
match_data[:number] # => 123 # same as match_data.captures[0], match_data[1]
match_data.captures
# => ["123"]
match_data.names
# => ["number"]
match_data.names.zip(match_data.captures).to_h
# => {"number"=>"123"}
"Number 123 456".scan(/\d+/)
#=> [
# [0] "123",
# [1] "456"
#]
"Number 123 456".scan(/(?<number>\d+)/)
#=> [
# [0] [
# [0] "123"
# ],
# [1] [
# [0] "456"
# ]
#]
# In this last case, the named variable is not helping much, if we do "..".scan(...){|m| ...}
# that m could be a string or an array, it might be cleaner just to use Regexp.last_match within the block
"Number 123 456".scan(/(?<number>\d+)/){ puts Regexp.last_match(:number) }
# => 123
# => 456
"123 456 789".scan(/(\d)(\d)(\d)/) #=> [["1", "2", "3"], ["4", "5", "6"], ["7", "8", "9"]]
### REPLACEMENTS
"that man is superman".sub /man/, "bird" #=> "that bird is superman" # replaces once
"that man is superman".gsub /man/, "bird" #=> "that bird is superbird" # global sub
# we also have sub!, gsub!
other options:
str = "that man is superman"
str[/man/] = "man"
str #=> "that bird is superman" # like sub!
str = "foo bar"
str[/(\w*) (\w*)/, 2] = "baz" # you can target a capture
str # => "foo baz"
# We can reference a capture with \1, \2 ...
"123 456 789".gsub(/(\d+)/, '[\1]') #=> "[123] [456] [789]"
# For more complex changes we can use a block
"WHAT'S GOING ON?".gsub(/\S*/) {|s| s.downcase } # => "what's going on?"
# All of them do the same:
"123 456 789".gsub(/(\d+)/) { |m| m.to_i * 2 }
"123 456 789".gsub(/(\d+)/) { $1.to_i * 2 }
"123 456 789".gsub(/(\d+)/) { Regexp.last_match(0).to_i * 2 }
"123 456 789".gsub(/(?<digits>\d+)/) { $~[:digits].to_i * 2 }
"123 456 789".gsub(/(?<digits>\d+)/) { Regexp.last_match(:digits).to_i * 2 }
#=> "246 912 1578"
# You can also pass a hash to espicify matched_string => replacement pairs, such as:
"Mr".gsub(/M(iste)?r/ 'Mister' => 'Doctor', 'Mr' => 'Dr') #=> Dr
"Mister".gsub(/M(iste)?r/, 'Mister' => 'Doctor', 'Mr' => 'Dr') #=> Doctor
# More recomendations from our guidelines:
url.gsub('http://', 'https://') # bad, there is only one substitution, use sub
url.sub('http://', 'https://') # good
str.gsub('-', '_') # bad - there is a more specialized / performant alternative
str.tr('-', '_') # good
# A BUNCH OF USEFUL METHODS AND TIPS
# partition
# >> Searches sep or pattern (regexp) in the string and returns the part before it, the match, and the part after it.
# >> If it is not found, returns two empty strings and str.
"hello".partition("l") #=> ["he", "l", "lo"]
"hello".partition(/l/) #=> ["he", "l", "lo"]
"hello".partition("x") #=> ["hello", "", ""]
# start_with?, end_with? include?
# from our guides: Prefer Ruby's Standard Library methods (start_with?, end_with?) over ActiveSupport aliases (starts_with?, ends_with?)
"Use the force".start_with?("Use") # => true
"Use the force".include?("the") # => true
"Use the force".end_with?("force") # => true
# Regex equality ===
/hello/ === "hello" # => true (but "hello" === /hello/ => false)
# Is meant to be used within case statements
str = "HELLO"
case str
when /^[a-z]*$/; puts "Lower case"
when /^[A-Z]*$/; puts "Upper case"
else; puts "Mixed case"
end
#=> "Upper case"
# From our guides:
# Use %r only for regular expressions matching at least one '/' character
%r{\s+} # bad
%r{^/(.*)$} # good
%r{^/blog/2011/(.*)$} # good
# Use non-capturing groups when you don't use the captured result
/(first|second)/ # bad
/(?:first|second)/ # good
#Be careful with ^ and $ as they match start/end of line, not string endings.
# If you want to match the whole string use: \A and \z (not to be confused with \Z which is the equivalent of /\n?\z/)
string = "some injection\nusername"
string[/^username$/] # matches
string[/\Ausername\z/] # doesn't match
#Use x modifier for complex regexps. This makes them more readable and you can add some useful comments. Just be careful as spaces are ignored
regexp = /
start # some text
\s # white space char, you could also use [ ]
(group) # first group
(?:alt1|alt2) # some alternation
end
/x
#The interpolation accepts anything that can be stringified. Even better, I can use another regexp.
GITHUB_COM = %r{https?://(?:www\.)?github\.com}i
%r{\A#{GITHUB_COM}/([^/]+)/?\z}o # The o flag at the end optimizes the regexp by only doing the interpolation once. Don’t use o with dynamic content.
# Name your matches.
# Sometimes a single regexp will capture several pieces of information. Instead of capturing a username consider the case where I want a username and project.
r = %r{\A#{GITHUB_COM}/([^/]+)/([^/]+)/?\z}o
m = r.match('http://github.com/AaronLasseigne/dotfiles') #=> #<MatchData "http://github.com/AaronLasseigne/dotfiles" 1:"AaronLasseigne" 2:"dotfiles">
m[1] #=> "AaronLasseigne"
m[2] #=> "dotfiles"
# compared to:
r = %r{\A#{GITHUB_COM}/(?<username>[^/]+)/(?<project>[^/]+)/?\z}o
m = r.match('http://github.com/AaronLasseigne/dotfiles') #=> #<MatchData "http://github.com/AaronLasseigne/dotfiles" username:"AaronLasseigne" project:"dotfiles">
m[:username] #=> "AaronLasseigne"
m[:project] #=> "dotfiles"
# NEW IN RUBY 2.4
# Regexp#match? 3x times faster than ===, =~, match; returns a boolean and does not set global variables
/^foo (\w+)$/.match?('foo wow') # => true
$~ # => nil
# MatchData#named_captures, #values_at
pattern = /(?<first_name>John) (?<last_name>\w+)/
pattern.match('John Backus').named_captures # => { "first_name" => "John", "last_name" => "Backus" }
pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
pattern.match('2016-02-01').values_at(:year, :month) # => ["2016", "02"]
### REFERENCES AND ADVANCED TOPICS (CONDITIONALS, BACKTRACKING, ATOMIC...)
http://aaronlasseigne.com/2016/07/08/5-tips-for-writing-a-legible-regexp/
http://blog.honeybadger.io/using-conditionals-inside-ruby-regular-expressions/
http://idiosyncratic-ruby.com/11-regular-extremism.html
http://aaronlasseigne.com/2016/06/10/proper-regexp-anchoring/
http://revelry.co/quick-tip-using-regexp-replace-last-occurrence-ruby/
https://github.com/bbatsov/ruby-style-guide#regular-expressions
http://batsov.com/articles/2013/10/03/using-rubys-gsub-with-a-hash/
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454