|
|||||
|
|||||
Swizzle Stream
Swizzle stream is pretty nice for finding/manipulating data in streams. It consists of two approaches:
One limitation is that it does not support regular expressions. All tokens are treated as string literals which keeps the searching/scanning very fast and efficient as we only need to buffer and compare a known and fixed amount of data while reading each new byte of a stream. FiltersInclude/ExcludeThe first set of filters simply include or exclude all data between two tokens in the stream.
For example if you wanted to read in an html document and include only the body, but you also want to exclude any html comments or script elements you could do the following: URL url = new URL("http://somewhere.com/foo/bar.html"); InputStream in = new BufferedInputStream(url.openStream()); // Include only the body in = new IncludeFilterInputStream(in, "<BODY", "</BODY>"); // Exclude any comments in = new ExcludeFilterInputStream(in, "<!--", "-->"); // Exclude any script sections in = new ExcludeFilterInputStream(in, "<SCRIPT", "</SCRIPT>"); try { int b; while ((b = in.read()) != -1) { System.out.print((char) b); } } finally { in.close(); } ReplacementThere are three basic strategies for replacing a chunk of text (called a token) from a stream.
The value X is supplied by you through implementing this interface: public interface StreamTokenHandler { public InputStream processToken(String token) throws IOException; } This was done so that X could also be a stream keeping the library true to it's endless tiker-tape philosophy. The value of X could be a 10 GB file on disk if you desired. There are a couple of standard implementations of StreamTokenHandler:
These are trivial implementations and much more powerful replacements could be done with little effort. For example, here is a DelimitedTokenReplacementInputStream subclass (included in swizzle-stream) with it's own implementation of StreamTokenHandler which can resolve relative URLs to be absolute URLs. package org.codehaus.swizzle.stream; import java.io.IOException; import java.io.InputStream; import java.net.URL; public class ResolveUrlInputStream extends DelimitedTokenReplacementInputStream { public ResolveUrlInputStream(InputStream in, String begin, String end, URL url) { super(in, begin, end, new UrlResolver(begin, end, url), false); } public static class UrlResolver extends StringTokenHandler { private final URL parent; private final String begin; private final String end; public UrlResolver(String begin, String end, URL parent) { this.begin = begin; this.end = end; this.parent = parent; } public String handleToken(String token) throws IOException { String cleanedToken = token.replaceAll("[ \"\']", ""); URL newURL = new URL(parent, cleanedToken); StringBuffer link = new StringBuffer(); link.append(begin.toLowerCase()); if (!begin.endsWith("\"")) { link.append('\"'); } link.append(newURL.toExternalForm()); if (!end.startsWith("\"")) { link.append('\"'); } link.append(end); return link.toString(); } } } You could use the above class as follows: URL url = new URL("http://somewhere.com/foo/bar.html"); InputStream in = new BufferedInputStream(url.openStream()); // Resolve all links relative to the "http://somewhere.com/foo/bar.html" url in = new ResolveUrlInputStream(in, "<A HREF=", ">", url); // Resolve all img src links relative to the "http://somewhere.com/foo/bar.html" url in = new ResolveUrlInputStream(in, "SRC=\"", "\"", url); try { int b; while ((b = in.read()) != -1) { System.out.print((char) b); } } finally { in.close(); } Convenience subclassesFor convenience, there are subclasses of the above replacement filters that incorporate the various handlers. They are:
LexerUnder the covers the lexer is just using the above mentioned filters to achieve its results, however thinking in "stream wrapping" can make your brain hurt after a while. Often you have a specific grammar you are after and simply want an easy way to chop tokens out of the stream. The lexer has two methods:
readToken(token) Seeks in the stream till it finds the start token, reads into a buffer till it finds the end token, then Given the input stream contained the sequence "123ABC456EFG" InputStream in ... StreamLexer lexer = new StreamLexer(in); String token = lexer.readToken("3","C"); // returns the string "AB" char character = (char)in.read(); // returns the character '4' readToken(begin, end) Seeks in the stream till it finds and has completely read the token, then stops. Given the input stream contained the sequence "000[A]111[B]222[C]345[D]" InputStream in ... StreamLexer lexer = new StreamLexer(in); String token = lexer.readToken("222"); // returns the string "222" token = lexer.readToken("[", "]"); // returns the string "C" char character = (char)in.read(); // returns the character '3' Lexer ExampleHere's a chunk of code I whipped up recently after seeing standup comic at a local restaurant. Couldn't remember his name but definitely remembered seeing him on Comedy Central. Swizzle stream to the rescue .. it was a piece of cake write something that would download the picture of every comedian in Comedy Central list of comedians A-Z. (yes, the style and exception handling of this code is terrible ... such is the way of "write once, never use again" code) import org.codehaus.swizzle.stream.StreamLexer; import java.net.URL; import java.io.BufferedInputStream; import java.io.File; import java.io.BufferedOutputStream; import java.io.FileOutputStream; public class FindComedians { public static void main(String[] args) throws Exception { String[] list = {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"}; for (String s : list) { URL url = new URL("http://www.comedycentral.com/comedians/browse/" + s + "/index.jhtml"); BufferedInputStream in = new BufferedInputStream(url.openStream()); StreamLexer lexer = new StreamLexer(in); if (lexer.readToken("bioTextArea_partialScroll") != null) { String comedian = null; while ((comedian = lexer.readToken("<a href=\"/comedians/browse", "\"")) != null) { try { comedian(new URL(url, "/comedians/browse" + comedian), comedian); } catch (Exception e) { System.out.println("Failed: " + e.getMessage()); } } } } } private static void comedian(URL url, String comedian) throws Exception { comedian = comedian.replaceFirst(".*/",""); comedian = comedian.replaceFirst("jhtml$","jpg"); System.out.println(comedian); BufferedInputStream in = new BufferedInputStream(url.openStream()); StreamLexer lexer = new StreamLexer(in); if (lexer.readToken("bioTextArea_partialScroll") == null) fail(comedian, "part1."); if (lexer.readToken("scrollCenter") == null) fail(comedian, "part2."); String img = lexer.readToken("<img src=\"", "\">"); if (img == null) fail(comedian, "no img url."); download(new URL(url, img), comedian); } private static void download(URL url, String comedian) throws Exception { BufferedInputStream in = new BufferedInputStream(url.openStream()); File file = new File("/tmp/comedians"); file = new File(file, comedian); BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(file)); int i = in.read(); while (i != -1){ out.write(i); i = in.read(); } out.close(); in.close(); } private static void fail(String comedian, String message) throws Exception { throw new Exception(comedian+" - "+ message); } } |
|||||
|
Copyright 2003-2006 - The Codehaus. All rights reserved unless otherwise noted.
Powered by Atlassian Confluence
|
|||||