Ruby Regular Expressions October 23rd, 2016 - Bonn I wrote this reference as a summary / cheatsheet about how to use regular expressions in ruby along with some tips ### DIFFERENT WAYS OF MATCHING "Use the force" =~/force/ # => 8 index of the first occurrence of the word "Use the fork" =~ /force/ # => nil "Use the fork" !~ /force/ # => true "Use the force"[/force/] # => "force" "Use the fork"[/force/] # => nil "Use the force".match /force/ #=> #<MatchData "force"> "Use the fork".match /force/ #=> nil "Use the force"[/(the) force/, 1] # => "the" # first capture (index of MatchData) "Number 123"[/(?<number>\d+)/, "number"] => "123" # (!) str =~ regexp is not exactly the same as regexp =~ str. # Strings captured from named capture groups are assigned to local variables only in the second case /(?<number>\d+)/ =~ "Number 123"; number #=> "123" # This works "Number 123" =~ /(?<number>\d+)/; number # => NameError: undefined local variable "Number 123"[/(?<number>\d+)/]; number # => NameError: undefined local variable # Example of named captures in Rubular => http://rubular.com/r/SZDykFX2nr # Rubular is awesome #### GLOBAL VARIABLES / Regexp.last_match # These matching operations leave us with some global variables with the result of the match, but # is better to use Regexp.last_match for clarity str = "Lord of the 7 Kingdoms, and Protector of the Realm" str[/the (\d+) kingdoms?+/i] str.match(/the (\d+) kingdoms?+/i) str =~ /the (\d+) kingdoms?+/i Regexp.last_match # same as $~ # => #<MatchData "the 7 Kingdoms" 1:"7"> Regexp.last_match(0) # matched string, same as $& # => "the 7 Kingdoms" Regexp.last_match(1) # first capture, same as $1 # => 7 Regexp.last_match(n) # nth capture, same as $n Regexp.last_match.pre_match # same as $` # => "Lord of " Regexp.last_match.post_match # same as $' # => ", and Protector of the Realm" ### WORKING WITH MATCHES match_data = "Jack Johnson".match /(\w*)\s(\w*)/ #=> #<MatchData "Jack Johnson" 1:"Jack" 2:"Johnson"> match_data.to_a #=> [ # [0] "Jack Johnson", # matched string # [1] "Jack", # first capture # [2] "Johnson" # second capture #] match_data.captures #=> [ # [0] "Jack", # [1] "Johnson" #] match_data = "Number 123".match /(?<number>\d+)/ #=> #<MatchData "123" number:"123"> match_data[:number] # => 123 # same as match_data.captures[0], match_data[1] match_data.captures # => ["123"] match_data.names # => ["number"] match_data.names.zip(match_data.captures).to_h # => {"number"=>"123"} "Number 123 456".scan(/\d+/) #=> [ # [0] "123", # [1] "456" #] "Number 123 456".scan(/(?<number>\d+)/) #=> [ # [0] [ # [0] "123" # ], # [1] [ # [0] "456" # ] #] # In this last case, the named variable is not helping much, if we do "..".scan(...){|m| ...} # that m could be a string or an array, it might be cleaner just to use Regexp.last_match within the block "Number 123 456".scan(/(?<number>\d+)/){ puts Regexp.last_match(:number) } # => 123 # => 456 "123 456 789".scan(/(\d)(\d)(\d)/) #=> [["1", "2", "3"], ["4", "5", "6"], ["7", "8", "9"]] ### REPLACEMENTS "that man is superman".sub /man/, "bird" #=> "that bird is superman" # replaces once "that man is superman".gsub /man/, "bird" #=> "that bird is superbird" # global sub # we also have sub!, gsub! other options: str = "that man is superman" str[/man/] = "man" str #=> "that bird is superman" # like sub! str = "foo bar" str[/(\w*) (\w*)/, 2] = "baz" # you can target a capture str # => "foo baz" # We can reference a capture with \1, \2 ... "123 456 789".gsub(/(\d+)/, '[\1]') #=> "[123] [456] [789]" # For more complex changes we can use a block "WHAT'S GOING ON?".gsub(/\S*/) {|s| s.downcase } # => "what's going on?" # All of them do the same: "123 456 789".gsub(/(\d+)/) { |m| m.to_i * 2 } "123 456 789".gsub(/(\d+)/) { $1.to_i * 2 } "123 456 789".gsub(/(\d+)/) { Regexp.last_match(0).to_i * 2 } "123 456 789".gsub(/(?<digits>\d+)/) { $~[:digits].to_i * 2 } "123 456 789".gsub(/(?<digits>\d+)/) { Regexp.last_match(:digits).to_i * 2 } #=> "246 912 1578" # You can also pass a hash to espicify matched_string => replacement pairs, such as: "Mr".gsub(/M(iste)?r/ 'Mister' => 'Doctor', 'Mr' => 'Dr') #=> Dr "Mister".gsub(/M(iste)?r/, 'Mister' => 'Doctor', 'Mr' => 'Dr') #=> Doctor # More recomendations from our guidelines: url.gsub('http://', 'https://') # bad, there is only one substitution, use sub url.sub('http://', 'https://') # good str.gsub('-', '_') # bad - there is a more specialized / performant alternative str.tr('-', '_') # good # A BUNCH OF USEFUL METHODS AND TIPS # partition # >> Searches sep or pattern (regexp) in the string and returns the part before it, the match, and the part after it. # >> If it is not found, returns two empty strings and str. "hello".partition("l") #=> ["he", "l", "lo"] "hello".partition(/l/) #=> ["he", "l", "lo"] "hello".partition("x") #=> ["hello", "", ""] # start_with?, end_with? include? # from our guides: Prefer Ruby's Standard Library methods (start_with?, end_with?) over ActiveSupport aliases (starts_with?, ends_with?) "Use the force".start_with?("Use") # => true "Use the force".include?("the") # => true "Use the force".end_with?("force") # => true # Regex equality === /hello/ === "hello" # => true (but "hello" === /hello/ => false) # Is meant to be used within case statements str = "HELLO" case str when /^[a-z]*$/; puts "Lower case" when /^[A-Z]*$/; puts "Upper case" else; puts "Mixed case" end #=> "Upper case" # From our guides: # Use %r only for regular expressions matching at least one '/' character %r{\s+} # bad %r{^/(.*)$} # good %r{^/blog/2011/(.*)$} # good # Use non-capturing groups when you don't use the captured result /(first|second)/ # bad /(?:first|second)/ # good #Be careful with ^ and $ as they match start/end of line, not string endings. # If you want to match the whole string use: \A and \z (not to be confused with \Z which is the equivalent of /\n?\z/) string = "some injection\nusername" string[/^username$/] # matches string[/\Ausername\z/] # doesn't match #Use x modifier for complex regexps. This makes them more readable and you can add some useful comments. Just be careful as spaces are ignored regexp = / start # some text \s # white space char, you could also use [ ] (group) # first group (?:alt1|alt2) # some alternation end /x #The interpolation accepts anything that can be stringified. Even better, I can use another regexp. GITHUB_COM = %r{https?://(?:www\.)?github\.com}i %r{\A#{GITHUB_COM}/([^/]+)/?\z}o # The o flag at the end optimizes the regexp by only doing the interpolation once. Don’t use o with dynamic content. # Name your matches. # Sometimes a single regexp will capture several pieces of information. Instead of capturing a username consider the case where I want a username and project. r = %r{\A#{GITHUB_COM}/([^/]+)/([^/]+)/?\z}o m = r.match('http://github.com/AaronLasseigne/dotfiles') #=> #<MatchData "http://github.com/AaronLasseigne/dotfiles" 1:"AaronLasseigne" 2:"dotfiles"> m[1] #=> "AaronLasseigne" m[2] #=> "dotfiles" # compared to: r = %r{\A#{GITHUB_COM}/(?<username>[^/]+)/(?<project>[^/]+)/?\z}o m = r.match('http://github.com/AaronLasseigne/dotfiles') #=> #<MatchData "http://github.com/AaronLasseigne/dotfiles" username:"AaronLasseigne" project:"dotfiles"> m[:username] #=> "AaronLasseigne" m[:project] #=> "dotfiles" # NEW IN RUBY 2.4 # Regexp#match? 3x times faster than ===, =~, match; returns a boolean and does not set global variables /^foo (\w+)$/.match?('foo wow') # => true $~ # => nil # MatchData#named_captures, #values_at pattern = /(?<first_name>John) (?<last_name>\w+)/ pattern.match('John Backus').named_captures # => { "first_name" => "John", "last_name" => "Backus" } pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/ pattern.match('2016-02-01').values_at(:year, :month) # => ["2016", "02"] ### REFERENCES AND ADVANCED TOPICS (CONDITIONALS, BACKTRACKING, ATOMIC...) http://aaronlasseigne.com/2016/07/08/5-tips-for-writing-a-legible-regexp/ http://blog.honeybadger.io/using-conditionals-inside-ruby-regular-expressions/ http://idiosyncratic-ruby.com/11-regular-extremism.html http://aaronlasseigne.com/2016/06/10/proper-regexp-anchoring/ http://revelry.co/quick-tip-using-regexp-replace-last-occurrence-ruby/ https://github.com/bbatsov/ruby-style-guide#regular-expressions http://batsov.com/articles/2013/10/03/using-rubys-gsub-with-a-hash/ http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ← Previous Post Next Post →