require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we're interested in...
doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
puts link.content
end
doc.at_css('h3').content
####
# Search for nodes by xpath
doc.xpath('//h3/a[@class="l"]').each do |link|
puts link.content
end
####
# Or mix and match.
doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|
puts link.content
end
####
# Work with attributes
xml = "<foo wam='bam'>bar</foo>"
doc = Nokogiri::XML(xml)
doc.at_css("foo").content => "bar"
doc.at_css("foo")["wam"].content => "bam"
####
# Work with elements
el = doc.at_css("foo")
el.children # => array of elements
####
So for example if we wanted to know all the names of the food items in our
document we simply say:
> doc.xpath("//name").collect(&:text)
=> ["carrot", "tomato", "corn", "grapes", "orange", "pear", "apple"]
If we were interested in the entire node we could leave off the
.collect(&:text). What if we wanted to select all the names of food items that
were best baked? This requires us to use what’s called an axis – we will
first need to find the element “baked” but then go back up our XML elements
to
find which food the item is inside.
> doc.xpath("//tag[text()='baked']/ancestor::node()/name").collect(&:text)
=> ["pear", "apple"]
What if we were only interested in vegetables that were good for roasting?
Just add //veggies:
>
doc.xpath("//veggies//tag[text()='roasted']/ancestor::node()/name").collect(&:t
xt)
=> ["carrot", "tomato"]
What about if we wanted to know all the tags ‘corn’ had? Again this is very
easy:
> doc.xpath("//name[text()='corn']/../tags/tag").collect(&:text)
=> ["raw", "boiled", "grilled"]
We can even do searches matching the first character. Let’s say we wanted to
know all the food items that started with the letter ‘c’:
> doc.xpath("//name[starts-with(text(),'c')]").collect(&:text)
=> ["carrot", "corn"]
You could also use [contains(text(),'rot'] and get back just carrot, useful
when you want to do a partial match.
####
# Traversion
node.ancestors # Ancestors for <node>
node.at('xpath') # Returns node at given XPATH
node.at_css('selector') # Returns node at given CSS selector
node.xpath('xpath') # Returns nodes at given XPATH
node.css('selector') # Returns nodes at given selector
node.child # Returns the child node
node.children # Returns child nodes
node.parent
####
# Data manipulation
node.name # Element name
node.node_type
node.content # Returns text as string
# (aka: .inner_text, .text)
node.content = '...'
node.inner_html
node.inner_html = '...'
node.attribute_nodes # Returns attributes as nodes
node.attributes # Returns attributes as hash
####
# Tree manipulation
node.add_next_sibling(other) # Place <other> after <node>
node.add_previous_sibling(other) # Place <other> before <node>
node.add_child(other) # Put <other> inside <node>
node.after(data) # Put a new node after <node>
node.before(data) # Put a new node before <node>
node.parent = other # Reparents <node> inside <other>